Following our discussion of causal consistency in COPS, this paper presents an eventually consistent database designed for graph storage. This paper was presented at USENIX ATC 2013, a prestigious venue in the field of computer science, in the year 2013.
Let's begin with some basic concepts.
Consistency Models Revisisted
We've previously explored linearizability and sequential consistency in ZooKeeper, as well as causal consistency in COPS. In this discussion, we'll examine eventual consistency models and provide an illustrative example to summarize them, within the context of single-key distributed key-value stores.
The phrase "consistency model" in the context of a single data item read-write system means the following:
- Say clients are observing all the writes happening in the store. A consistency model determines the order in which the clients see the writes.
- After all the writes are applied, the consistency model determines what would be the effective state of the objects in the store.
The consistency model hierarchy is:
Linearizable -> Sequential -> Causal ->
PRAM -> Read-Your-Writes (RYW) -> Eventual
Read-Your-Writes (RYW)
This model guarantees that a client can observe the effect of its own writes.
Example
Consider the following example, with the initial state being, {X: 0, Y: 0}:
The real-time ordering of events are:
- W(X, 1) or W1- Client 1 sets key X to value 1.
- W(X, 2) or W2 - Client 2 sets key X to value 2.
- W(Y, 3) or W3 - Client 1 sets key Y to value 3.
- W(X, 4) or W4 - Client 2 sets key X to value 4.
The causal relationships are:
- W1 -> W3 - events within the same system.
- W2 -> W4 - events within the same system.
Now we can define each consistency model.
Linearizability
- Order of writes: W1, W2, W3, W4. All clients will observe this and only this order.
- The final effective state will be {X: 4, Y: 3}.
Sequential Consistency
- Possible orders are all 4! permutations; only one of these orders will be chosen, and all clients will observe that same order:
- W1, W2, W3, W4
- W1, W2, W4, W3
- W1, W3, W2, W4
- and so on.
- The final effective state will be either {X: 1, Y: 3} or {X: 2, Y: 3} or {X: 4, Y: 3} based on which total order was chosen.
Causal Consistency
- Possible orders are listed below (all orders in which W1 appears before W3 and W2 appears before W4); clients can observe any of these orders, and different clients may observe different orders:
- W1, W2, W3, W4
- W1, W2, W4, W3
- W1, W3, W2, W4
- W2, W1, W3, W4
- W2, W1, W4, W3
- W2, W4, W1, W3
- The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.
PRAM (or FIFO) Consistency
- Possible orders are listed below (all orders in which W1 appears before W3 and W2 appears before W4); clients can observe any of these orders, and different clients may observe different orders:
- W1, W2, W3, W4
- W1, W2, W4, W3
- W1, W3, W2, W4
- W2, W1, W3, W4
- W2, W1, W4, W3
- W2, W4, W1, W3
- The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.
RYW Consistency
- Possible orders for all clients except client 1 and 2 are all possible 4! permutations; any ordering is fine.
- Possible orders for client 1 are those in which W1 is ordered before W3; client 1 will observe one of them.
- Similarly, possible orders for client 2 are those in which W2 is ordered before W4; client 2 will observe one of them.
- The final effective states for all other clients will be {X: 1, Y: 3}, {X: 2, Y: 3}, and {X: 4, Y: 3}. Different replicas in the data store may have different values, and that is completely fine.
- The final effective states for client 1 will be {X: 1, Y: 3}, {X: 2, Y: 3}, or {X: 4, Y: 3}; any one based on the order that it observed.
- The final effective states for client 2 will be {X: 1, Y: 3} or {X: 4, Y: 3}; any one based on the order that it observed.
Eventual Consistency
- All orders are valid. They can be different for all the clients.
- The final effective states will be {X: 1, Y: 3}, {X: 2, Y: 3}, and {X: 4, Y: 3}.
Conflict Handling
Read-After-Write (RAW) Consistency
Read-After-Write (RAW) consistency primarily originated as an industry term used by companies like Amazon and Facebook.
It ensures that reads from any client after any write operation return the written value or a later version. This is essentially equivalent to linearizability, as it guarantees immediate visibility of a committed write across all clients.
However, in reality, this is not linearizability. It is just a descriptive way of stating a common technique used in industry to make writes visible to readers. To give a spoiler, the readers are redirected to the same replica where the writer performed the write. Thus, it is not exactly linearizability, in which all replicas would reflect the write. We will explore the implications of RAW consistency in the context of TAO shortly.
TAO
Graph Model
Objects
- The key is the object's unique identifier (id).
- The value is a structure containing the object type (otype) and a set of key-value pairs representing the object's attributes: <otype, (key -> value)>. The otype determines how the internal key-value pairs within the value are interpreted.
Associations
- The key is composed of the source object identifier (id1), the association type (atype), and the destination object identifier (id2): <id1, atype, id2>.
- The value includes a timestamp (time) and a set of key-value pairs representing the association's attributes: <time, (key -> value)>. The atype allows for defining multiple relationship types between objects.
API
Write Operations
- association_add(id1, atype, id2, time, (k->v)*): Adds an association of type atype between objects id1 and id2.
- association_delete(id1, atype, id2): Deletes the specified association, which corresponds to deleting the key <id1, atype, id2>.
- association_change_type(id1, atype, id2, newtype): Modifies the association type. This involves deleting the existing key <id1, atype, id2> and adding a new key <id1, newtype, id2> in a single operation.
Read Operations
- assoc_get(id1, atype, id2set, high?, low?): Retrieves all associations of type atype between id1 and objects within the id2set. This requires retrieving multiple key-value pairs from the database. Optionally, the client can specify the time range (low to high) for the associations.
- assoc_count(id1, atype): Returns the number of associations of type atype originating from id1.
- assoc_range(id1, atype, pos, limit): Returns a range of associations of type atype originating from id1, starting at position pos (when sorted temporally), and returning a maximum of limit associations.
- assoc_time_range(id1, atype, high, low, limit): Returns associations of type atype originating from id1, within a specified time range (low to high), returning a maximum of limit associations.
Architecture
TAO's architecture is complex, reflecting its diverse range of use cases and the system's evolutionary development. This complexity is managed through a layered design and strategic sharding.
Sharding
The entire dataset is divided into logical shards for distribution and scalability. The sharding is based on the object identifier. All associations from an object (id1) are part of the same shard.
Regional Deployment and Replication
TAO is deployed across multiple regions, each acting as a leader for a subset of the logical shards. Notably, each region stores all shards, regardless of its leadership role. This ensures data availability even if a region becomes temporarily unavailable.
Each region comprises the two layers - storage layer and cache layer.
Storage Layer (MySQL Database)
All objects and associations are persisted in a sharded MySQL database. The API operations are translated into corresponding SQL queries.The storage layer within each region maintains all logical shards, with objects and associations stored in separate tables.
Cache Layer
This layer caches objects, association lists, and association counts, employing an LRU eviction policy. The cache is not simply a passive storage; it understands the semantics of the stored key-value pairs and can execute application logic.- Follower Tier: Client interactions occur at this tier, with multiple follower tiers per region.
- Leader Tier: This tier communicates directly with the storage layer.
Within each cache tier, TAO utilizes a slab allocator to manage memory. RAM is divided into arenas, each dedicated to a specific object or association type. This partitioning provides isolation for LRU eviction policies, ensuring that different data types do not interfere with each other. Further optimizations exist for small, fixed-size items like association counts.
The cache is responsible for storing the following data:
- Objects: Objects, identified by their unique ID (<id>).
- Association Counts: The number of associations for each combination of object identifier and association type (<id, atype>). This is crucial for efficiently executing assoc_count queries.
- Association Lists: Lists of associations for each <id, atype> combination, ordered by time and typically limited to 6000 entries. These lists are used to answer range queries. Queries beyond limit go to the database.
It's important to note that these cached items directly reflect TAO's graph model. However, the underlying database operates as a simple key-value store.
Scalability
TAO's deployment model facilitates high scalability. Read operations are primarily handled by the local follower tier cache. Cache misses trigger queries to the region's storage layer (MySQL).
Write operations (described next) are propagated asynchronously to other regions.Write Propagation
Write operations are propagated synchronously from the follower cache tier to the leader cache tier, which then updates the database in the storage layer.
Upon database write, a changeset is generated, encapsulating the modifications. For example, deleting an association results in a changeset that:
- Removes the association from the corresponding association list.
- Decrements the association count for the object and association type.
The changeset is applied synchronously along the path: client -> follower cache tier -> leader cache tier -> leader database. However, updates to other cache tiers and databases are performed asynchronously, using an invalidation-based approach.
Leader Cache Tier Invalidation
After applying the update, the leader cache tier sends invalidation messages to all follower cache tiers within its region for the affected object. It also sends refill messages for association lists.Database Replication and Inter-Regional Invalidation
The database replicates the updates to non-leader regions. The database then sends invalidation messages to the leader cache tiers of those non-leader regions. These non-leader leader cache tiers then forward the invalidation messages to their respective non-leader follower cache tiers.Example
In the figure above, solid lines represent data flow, while dotted lines indicate control messages, specifically invalidations and refills. Consider a write operation initiated by a client in a non-leader region.
The write request goes from the non-leader follower cache to the non-leader region's leader cache, and then to the leader cache. Along this path, the write is applied synchronously to each component. The leader cache performs the write to the leader database.
Following the database update, the leader cache sends invalidation messages to all follower caches within its own (leader) region.
Subsequently, the leader database replicates the write to the non-leader region's database. The leader database then sends an invalidation message to the leader cache within the non-leader region. This non-leader leader cache, in turn, propagates invalidation messages to all follower caches within its region.
Note: In the example above, the non-leader region's leader cache receives invalidation message twice.
Consistency Model
While TAO is often described as eventually consistent, a deeper examination reveals a more nuanced consistency model.
Per-Key Sequential Consistency
TAO leverages MySQL as its underlying storage. Each key-value pair is managed by a single MySQL database instance, which acts as the leader for that data. Although data is replicated across regions, a single leader is designated for each key-value pair at any given time. This design enables the serialization of writes for a specific key-value pair.
Because a single MySQL process acts as the leader for all operations on an object and its associations, atomic operations, such as assoc_change_type, can be implemented using SQL queries. This atomicity could also be achieved with a non-distributed key-value store like LevelDB, which provides per-row atomic transactions.Is TAO Sequentially Consistent?
Read-Write Transactions
TAO does not support traditional read-write transactions. This aligns with systems like Dynamo and COPS, which prioritize availability over strong consistency. All writes are blind-writes.Read-After-Write (RAW) Consistency
Within a single cache tier (follower tier), TAO provides RAW consistency. When a write is successfully applied to the database, the local cache tier is updated. However, this RAW consistency is not guaranteed across different cache tiers. A client querying a different tier immediately after a write may not see the updated value.Fault Tolerance
Given TAO's effective reliance on eventual consistency, its fault tolerance model is designed for simplicity and resilience.
Database Failures
- Leader Database Failure: A non-leader database takes over as the new leader.
- Non-Leader Database Failure: Read requests are redirected to the leader database.
Cache Failures
- Leader Cache Tier Failure:
- Follower cache misses are redirected to the database directly.
- Write requests are redirected to another available leader cache tier replica, which then enqueues invalidation messages for the original leader.
- Invalidation Failure: The leader cache tier enqueues messages and delivers them when the follower becomes reachable.
- Follower Cache Tier Failure: Client requests are redirected to another follower cache tier, which results in the loss of RAW consistency for the client.
Evaluation
- TAO demonstrates remarkably high availability in production environments, reporting a 99.999% uptime.
- A single follower cache tier is capable of handling up to 600k QPS during peak hit rates. However, throughput decreases as the hit rate declines, due to expensive downstream data fetches for cache misses.
- The overall cache hit rate is 96.4%.
- Write latency within a single region is 12.1 ms. For remote region writes, the latency increases to 74.4 ms, with 58.1 ms attributed to round-trip network latency.
- Despite its eventual consistency model, TAO exhibits low replication delays:
- p85: 1s
- p99: 3s
- p999: 10s
Comments
Post a Comment