From:
andrew cooke <andrew@...>
Date:
Fri, 12 Oct 2012 05:42:00 -0300
Update: A higher-level,
easier to read account is now available.
You can jump directly to Implementation
or Lessons Learned if you don’t want the
background details.
Parts below are critical of the original design; since I had a chance
to comment on the original design, before it was implemented, this is
largely self-criticism.
The Original Cache
Background
The cache provides a local Python interface to a remote service
(called the “source” below) that speaks XMLRPC.
For confidentiality reasons I don’t want to use the original domain
model here, but it was analogous to the following: a set of workers that
are continuously measured for performance against various metrics; each
metric has ratings; each worker has a history of what periods they had
which ratings; the rules used to rate workers may vary from individual
to individual; several other worker attributes are stored.
So it’s a fairly complex, hierarchical set of data that includes time
series.
The source provides various views into these data. It has evolved
over time and duplicates information in various places. A typical call
might return ratings over some time range, for a given worker, or all
workers in a certain class. All data are structured in terms of lists
and maps (XMLRPC structs).
The original cache reified the model described above as Python
classes. The structure of these classes is pretty much as expected (a
Worker contains Rules and Metrics, Metrics contain time series of
Ratings, etc etc).
Problem 1 - Schizophrenic
API
There are two different “visions” of the cache API in the description
above. One is a service that provides “fragments” of data (presumably
tailored to particular, historical needs). The other is a graph of
related objects.
Either vision could be used to provide an API, but the initial code
tried to provide both. So, for example, you could request a fragment of
a time series (and receive the appropriate objects). Or you could
request a Worker instance (say) and then use methods on that to locate
the related data (ie navigate the graph of connected objects).
Unfortunately, to provide all this in a consistent way, without
duplicating or omitting data, introduces complexity:
To guarantee a consistent graph, the cache would need to maintain
references to generated objects so that it could “patch them in” to new,
related data, as they are requested.
To support retrieval of related data, some backlinks were needed
within the graph of data model instances.
And in practice, this was incompletely implemented (fragments were
disconnected). The result was inconsistencies and duplications in the
retrieved data.
Problem 2 - Multiple Data
Models
The “object-graph” part of the API was new to the cache. Code in the
source (and other, related services) structured the data in a more
tabular form - as nested lists and maps. The XMLRPC calls returned
sections of this tabular structure.
This led to a mindset which considered the tabular form of the data
as fundamental. So the original cache used this as the internal data
format.
Since data within the cache were stored in tabular form, rather than
as instances of the model provided in the API, the cache was not able to
“patch together” a consistent graph (see previous section).
It was also unclear exactly what data had already been retrieved,
since the source provided a “patchwork” view over the table and the
format had no natural way to indicate where data were missing.
And, finally, the tabular format was undocumented and open to change.
The only reliable reference was the implementation used in the source,
but although that was available as a separate module, it was
re-implemented for the cache.
Problem 3 - Granularity
And Efficiency
The main cost (in time spent) when using the source is constant
per-connection. So it makes much more sense to request a large amount of
data in a single call, rather than across multiple calls.
Clients that used the cache, therefore, would “pre-load” the cache by
making a very general call (eg current health for all Workers) before
requesting more specific data.
Because of the limitations of the internal tabular format (see above)
the cache was unable to detect what data had been pre-loaded. So the API
was extended to provide explicit control for when to use the internal
table and when to contact the source.
This, in turn, led to more complex clients.
Problem 4 - No Generations
Even though the source updated in definite steps the protocol did not
include version data.
This, together with the fragmented view of data provided by the
source, led to overlapping, inconsistent data being retrieved over
time.
Also, the cache was implemented / used as a single instance, so any
“flush” operation would apply to all cache users. This effectively
forced single-threaded use.
Problem 5 - Not Distributed
Because the cache was in-memory (in Python code) it could not be
shared across machines. Clients on multiple machines produced repeated,
duplicate requests to the source.
The New Cache
Background
Work for the new cache was motivated by problem 4 (no generations)
above. Multiple clients were over-taxing the server; a new
implementation that used memcache to support distributed clients was
requested.
The new cache implementation had to be generally compatible with the
initial version, but client code was available and could be modified to
accommodate small changes, if necessary.
Not all clients of the source use the cache; some call the protocol
“directly” (via Python’s XMLRPC libraries). The code for these clients
could not be modified, so modifications to existing XMLRPC procedures
were not acceptable.
Solution 5 - Distributed
Cache
As described above, memcache was used to cache the data received from
calls to the source. The key is, essentially, the URL; the value is the
returned data.
Solution 4 - Generationed
Data
The XMLRPC protocol was extended to include information on the
current version of data in the server.
The source already contained two separate generations of data (one
current, one being constructed). It was relatively easy to extend this
to a configurable number of generations, providing access to each via a
new, different port (while preserving access to the “current” generation
through the original port). This allows cache instances to access a
particular generation without changing the existing protocol.
Memcache keys include the source address (with port, which is tied to
generation number). This allows multiple generations to be cached
without overwriting data.
To simplify management of generations in clients, the new cache
implementation has a method that returns a fresh instance for the latest
generation. This, combined with other changes (below) that remove the
need for explicit cache control, supports a much simpler pattern of use:
a cache is instantiated for a single, short task (eg generating a web
page) that requires consistent data from a single generation. Re-running
the task, or similar tasks in other threads, use a new instance of the
cache.
Finally, a new service was introduced to the system, called the
Primer. This monitors the source, detects a new generation, and
pre-loads commonly-used, generic data (see below) to memcache. It then
modifies an entry in memcache that describes the current generation;
this allows new cache instances to identify and access the latest
data.
Solution 3 -
Expansion To Generic Requests
Although the original cache interface was very broad, the problems
outlined earlier led to it being used in a very restricted way - global
information was loaded (ie health for all Workers) and then, for each
Worker that required further data, all data was loaded for that
instance.
So, in practice, only two source calls were used to populate the
cache.
The new cache intercepts all calls and generates responses from the
internal model (described below). Because the internal model contains
sufficient information to identify incomplete data the appropriate calls
to the source can be made automatically. Only the two calls described
above are used, so the implementation is quite simple.
For examples: an initial query to retrieve a single Worker will
trigger the retrieval of all Workers (the “global information” described
above); the first request for detailed data for a given Worker will load
all data for that instance.
Solution 2 - Single Data
Model
The data stored in memcache are direct snapshots of the source calls.
The new cache uses that data to construct an in-memory graph of
instances using the same classes exposed by the cache API.
As described above, the graph is constructed in two stages. First,
“global data” are loaded; second, calls to retrieve data for particular
Workers trigger retrieval of the data for those instances (only). It may
seem costly to keep this in memory, but since they are also the results
returned to clients they are likely held in memory by the caller. And
cache lifetimes tend to be short (discarded for each new generation), so
the in-memory cache does not have time to grow too large.
There is no attempt to reconstruct the “tabular” format used in the
source (cf the old client, as described above).
Solution 1 - Emphasise
One Vision Of API
The new client prioritises the “graph of objects” API. Calls to
request fragments are still supported, but trigger the creation of the
appropriate section of the object graph before returning a filtered
subset of that model.
This guarantees (with generationed data) a consistent set of
results.
Implementation
The cache has four responsibilities:
Expansion of specific calls (targeting a particular detail) to
generic requests (targetting all workers, or all information for one
worker).
Caching responses from the source in a way that can be accessed
by a distributed set of clients.
Translation of (tabular) responses into data model
instances.
Management of generations.
It is implemented using a layered approach. The lowest level is a
“dumb” wrapper around memcache. For each method in the cache API (which
corresponds pretty closely to each XMLRPC procedure) memcache is
checked; if no data are available a call is made to the source and the
results stored in memcache and returned to the caller.
The data model is a separate module that includes translation from
the tabular format (as received from XMLRPC) into the graph of objects.
The lowest level cache translates each fragment separately, so the
client receives unconnected fragments on successive calls.
The middle level of the cache manages an in-memory set of Worker
instances. The first call for a Worker is converted to a call (via the
lower level) for all workers; the results (as data model
instances) are stored by name. Subsequent calls then pull Workers
directly from this in-memory store.
The top level of the cache is similar to the middle level, but
intercepts calls for detailed data. The parent Worker is retrieved from
the mid-level and, if that is missing the required information, a
request is made for all data for that Worker; the response is
used to construct the entire graph. Finally, the client receives the
appropriate fragment taken from the complete, consistent object graph.
In this way the client receives connected fragments.
This is implemented using inheritance: the lowest level is the base
class; the middle and top levels are successive sub-classes.
The base level must be generation-aware as the memcache keys depend
on the generation used. This level also implements the method that
returns a new cache instance for the latest generation. By using
type(self)(....)
(and the same constructor arguments for
all classes) sub-classes are automatically accommodated. The previous
cache implementation was modified return itself when this call is
invoked. This allows it to be used as a direct replacement when
generations are not required.
Preserving the cache API and implementing functionality in
progressive layers made it easy to test for consistency during
development.
Lessons Learned
In no particular order:
Identify common usage patterns and use those to simplify how data
are loaded. It’s OK to load too much data, within reason (and in a sense
it gets more OK as the number of clients increases, since they are all
doing the same, and so hitting memcache, and not the source).
Make generations (if you need them) an end-to-end property. In
other words, think about this as early as possible in system
design.
Use a new instance of the cache for each generation (if this is
returned from a method on the previous cache then a non-generational
cache can return itself).
Carefully consider the balance between distributed and in-memory
stores. We keep a snapshot of the returned data in memcache, but save
the “processed” object graph in memory - this reduces server load while
keeping the caches responsive to simple queries.
Try to develop the cache and source in tandem, so all clients can
be isolated from the protocol details.
Make memcache a direct copy of the wire data. This makes it easy
to “read through” and supports alternative cache implementations (eg
with a different data model implementation).
Take care to separate the “direct cache” from any “translation
layer” that converts the results into a more usable data model.
Make sure that you can identify “missing data” in your data
model.
Following these gave a replacement cache that is significantly
simpler than the original implementation, while leaving the cache API,
service protocol, and cache clients largely unchanged. It also performs
better with multiple, distributed clients and provides stronger
guarantees about the consistency of the results (both in time - via
generationed data - and “space” - as a single connected graph of data
model instances).