Paper Insights #28 - TAO: Facebook's Distributed Data Store for the Social Graph

Following our discussion of causal consistency in COPS, this paper presents an eventually consistent database designed for graph storage. This paper from Facebook was presented at Usenix Annual Technical 2013, a prestigious venue in the field of computer science, in the year 2013.

Paper Link

Let's begin with some basic concepts.

1. Consistency Models Revisited

We've previously explored linearizability and sequential consistency in ZooKeeper, as well as causal consistency in COPS. In this discussion, we'll examine eventual consistency models and provide an illustrative example to summarize them, within the context of single-key distributed key-value stores.

The phrase "consistency model" in the context of a single data item read-write system means the following:

Say clients are observing all the writes happening in the store. A consistency model determines the order in which the clients see the writes.
After all the writes are applied, the consistency model determines what would be the effective state of the objects in the store.

The consistency model hierarchy is:

Linearizable -> Sequential -> Causal ->

PRAM -> Read-Your-Writes (RYW) -> Eventual

1.1. Read-Your-Writes (RYW)

This model guarantees that a client can observe the effect of its own writes.

1.2. Example

Consider the following example shown in illustration 1, with the initial state being, {X: 0, Y: 0}:

Illustration 1: Consistency model example.

The real-time ordering of events are:

W(X, 1) or W₁- Client 1 sets key X to value 1.
W(X, 2) or W₂ - Client 2 sets key X to value 2.
W(Y, 3) or W₃ - Client 1 sets key Y to value 3.
W(X, 4) or W₄ - Client 2 sets key X to value 4.

The causal relationships are:

W₁ → W₃ - events within the same system.
W₂ → W₄ - events within the same system.

Now we can define each consistency model.

1.2.1. Linearizability

Order of writes: W₁, W₂, W₃, W₄. All clients will observe this and only this order.
The final effective state will be {X: 4, Y: 3}.

1.2.2. Sequential Consistency

Possible orders are all 4! permutations; only one of these orders will be chosen, and all clients will observe that same order:

W₁, W₂, W₃, W₄
W₁, W₂, W₄, W₃
W₁, W₃, W₂, W₄
and so on.

The final effective state will be either {X: 1, Y: 3} or {X: 2, Y: 3} or {X: 4, Y: 3} based on which total order was chosen.

Note: Up to the point of sequential consistency, the system maintains a deterministic order of writes. There will only be one order and hence only one final state. Beyond this, multiple values for the same key can exist without violating the consistency model. When such multiple values exist, the system can choose to present all the known values to the clients for the clients to decide.

1.2.3. Causal Consistency

Possible orders are listed below (all orders in which W₁ appears before W₃ and W₂ appears before W₄); clients can observe any of these orders, and different clients may observe different orders:

W₁, W₂, W₃, W₄
W₁, W₂, W₄, W₃
W₁, W₃, W₂, W₄
W₂, W₁, W₃, W₄
W₂, W₁, W₄, W₃
W₂, W₄, W₁, W₃

The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.

1.2.4. PRAM (or FIFO) Consistency

Possible orders are listed below (all orders in which W₁ appears before W₃ and W₂ appears before W₄); clients can observe any of these orders, and different clients may observe different orders:

W₁, W₂, W₃, W₄
W₁, W₂, W₄, W₃
W₁, W₃, W₂, W₄
W₂, W₁, W₃, W₄
W₂, W₁, W₄, W₃
W₂, W₄, W₁, W₃

The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.

For this example, PRAM consistency appears to be the same as causal consistency. To distinguish them, let's say there was a R(X) before W₂ on Node 2, which fetched value 1, as shown in illustration 2. Then we have additional causal orders - W₁ → R(X) and R(X) → W₂. The final effective state will be {X: 4, Y: 3} for causal consistency.

Illustration 2: Causal v/s PRAM consistency example.

For PRAM consistency, however, all orders mentioned above would still be valid.

1.2.5. RYW Consistency

This is where things start to differ slightly.

Possible orders for all clients except client 1 and 2 are all possible 4! permutations; any ordering is fine.
Possible orders for client 1 are those in which W₁ is ordered before W₃; client 1 will observe one of them.
Similarly, possible orders for client 2 are those in which W₂ is ordered before W₄; client 2 will observe one of them.
The final effective states for all other clients will be {X: 1, Y: 3}, {X: 2, Y: 3}, and {X: 4, Y: 3}. Different replicas in the data store may have different values, and that is completely fine.
The final effective states for client 1 will be {X: 1, Y: 3}, {X: 2, Y: 3}, or {X: 4, Y: 3}; any one based on the order that it observed.
The final effective states for client 2 will be {X: 1, Y: 3} or {X: 4, Y: 3}; any one based on the order that it observed.

1.2.6. Eventual Consistency

All orders are valid. They can be different for all the clients.
The final effective states will be {X: 1, Y: 3}, {X: 2, Y: 3}, and {X: 4, Y: 3}.

1.3. Conflict Handling

In all the examples above, we discussed scenarios involving causal, PRAM, RYW, and eventual consistency, where multiple values for the same key could exist. However, systems designed with these consistency models, such as Dynamo and COPS, typically implement a conflict handler. This conflict handler is associative and commutative, and helps converge multiple values to a single, consistent value.

In practical applications, this convergence process usually occurs very rapidly, resulting in only one effective and valid state being retained.

2. Read-After-Write Consistency

Read-After-Write (RAW) consistency primarily originated as an industry term used by companies like Amazon and Facebook.

It ensures that reads from any client after any write operation return the written value or a later version. This is essentially equivalent to linearizability, as it guarantees immediate visibility of a committed write across all clients.

However, in reality, this is not linearizability. It is just a descriptive way of stating a common technique used in industry to make writes visible to readers. To give a spoiler, the readers are redirected to the same replica where the writer performed the write. Thus, it is not exactly linearizability, in which all replicas would reflect the write. We will explore the implications of RAW consistency in the context of TAO shortly.

3. TAO

TAO is a graph database and widely deployed at Meta (formerly Facebook). It is distributed and offers eventual consistency. It is again a NoSQL database and is optimized for availability and performance over consistency.

4. Graph Model

In TAO, the graph structure is represented using objects (nodes) and associations (edges). Both objects and associations are fundamentally stored as key-value pairs, effectively building the graph model on a key-value abstraction.

4.1. Objects

The key is the object's unique identifier (id).
The value is a structure containing the object type (otype) and a set of key-value pairs representing the object's attributes: <otype, (key → value)>. The otype determines how the internal key-value pairs within the value are interpreted.

4.2. Associations

The key is composed of the source object identifier (id1), the association type (atype), and the destination object identifier (id2): <id1, atype, id2>.
The value includes a timestamp (time) and a set of key-value pairs representing the association's attributes: <time, (key → value)>. The atype allows for defining multiple relationship types between objects.

All objects and associations are stored as independent key-value pairs, with the interpretation of the value depending on the otype or atype. The system utilizes a limited set of object and association types.

The time included in the association value facilitates time-based sorting, a crucial feature for social media platforms like Facebook.

5. API

The TAO API, while presented as a graph interface, ultimately operates on the underlying key-value store.

5.1. Write Operations

association_add(id1, atype, id2, time, (k->v)*): Adds an association of type atype between objects id1 and id2.
association_delete(id1, atype, id2): Deletes the specified association, which corresponds to deleting the key <id1, atype, id2>.
association_change_type(id1, atype, id2, newtype): Modifies the association type. This involves deleting the existing key <id1, atype, id2> and adding a new key <id1, newtype, id2> in a single operation.

5.2. Read Operations

assoc_get(id1, atype, id2set, high?, low?): Retrieves all associations of type atype between id1 and objects within the id2set. This requires retrieving multiple key-value pairs from the database. Optionally, the client can specify the time range (low to high) for the associations.
assoc_count(id1, atype): Returns the number of associations of type atype originating from id1.
assoc_range(id1, atype, pos, limit): Returns a range of associations of type atype originating from id1, starting at position pos (when sorted temporally), and returning a maximum of limit associations.
assoc_time_range(id1, atype, high, low, limit): Returns associations of type atype originating from id1, within a specified time range (low to high), returning a maximum of limit associations.

6. Architecture

TAO's architecture is complex, reflecting its diverse range of use cases and the system's evolutionary development. This complexity is managed through a layered design and strategic sharding.

6.1. Sharding

The entire dataset is divided into logical shards for distribution and scalability. The sharding is based on the object identifier. All associations from an object (id1) are part of the same shard.

6.2. Regional Deployment and Replication

TAO is deployed across multiple regions, each acting as a leader for a subset of the logical shards, as shown in illustration 3. Notably, each region stores all shards, regardless of its leadership role. This ensures data availability even if a region becomes temporarily unavailable.

Illustration 3: Multi-regional deployment.

Each region comprises the two layers - storage layer and cache layer.

6.3. Storage Layer (MySQL Database)

All objects and associations are persisted in a sharded MySQL database. The API operations are translated into corresponding SQL queries.

The storage layer within each region maintains all logical shards, with objects and associations stored in separate tables.

All objects are stored in one table, and all associations in another. The tables act as key-value stores where the value is serialized into a single column. The association table has an additional index on <id1, atype, time> to support range queries.

6.4. Cache Layer

This layer caches objects, association lists, and association counts, employing an LRU eviction policy. The cache is not simply a passive storage; it understands the semantics of the stored key-value pairs and can execute application logic.

The cache layer is organized hierarchically, as shown in illustration 4:

Follower Tier: Client interactions occur at this tier, with multiple follower tiers per region.
Leader Tier: This tier communicates directly with the storage layer.

Each tier (follower and leader) maintains all shards. Consistent hashing distributes shards across servers within a tier. To manage high load on specific shards, those shards may be cloned and assigned to multiple servers.

Illustration 4: Multi-layer cache.

Within each cache tier, TAO utilizes a slab allocator to manage memory. RAM is divided into arenas, each dedicated to a specific object or association type. This partitioning provides isolation for LRU eviction policies, ensuring that different data types do not interfere with each other. Further optimizations exist for small, fixed-size items like association counts.

The cache is responsible for storing the following data:

Objects: Objects, identified by their unique ID (<id>).
Association Counts: The number of associations for each combination of object identifier and association type (<id, atype>). This is crucial for efficiently executing assoc_count queries.
Association Lists: Lists of associations for each <id, atype> combination, ordered by time and typically limited to 6,000 entries. These lists are used to answer range queries. Queries beyond limit go to the database.

Note that these cached items directly reflect TAO's graph model. However, the underlying database operates as a simple key-value store.

6.5. Scalability

TAO's deployment model facilitates high scalability. Read operations are primarily handled by the local follower tier cache. Cache misses trigger queries to the region's storage layer (MySQL).

Write operations (described next) are propagated asynchronously to other regions.

7. Write Propagation

Write operations are propagated synchronously from the follower cache tier to the leader cache tier, which then updates the database in the storage layer.

Upon database write, a changeset is generated, encapsulating the modifications. For example, deleting an association results in a changeset that:

Removes the association from the corresponding association list.
Decrements the association count for the object and association type.

The changeset is applied synchronously along the path: client → follower cache tier → leader cache tier → leader database. However, updates to other cache tiers and databases are performed asynchronously, using an invalidation-based approach.

7.1. Leader Cache Tier Invalidation

After applying the update, the leader cache tier sends invalidation messages to all follower cache tiers within its region for the affected object. It also sends refill messages for association lists.

Importantly, a leader cache tier only sends invalidation messages within its own region.

7.2. Database Replication and Inter-Regional Invalidation

The database replicates the updates to non-leader regions. The database then sends invalidation messages to the leader cache tiers of those non-leader regions. These non-leader leader cache tiers then forward the invalidation messages to their respective non-leader follower cache tiers.

7.3. Example

Let's run through the paper's example to understand how writes work.

Figure 2: Copy of figure 2 from the paper.

In figure 2 above, solid lines represent data flow, while dotted lines indicate control messages, specifically invalidations and refills. Consider a write operation initiated by a client in a non-leader region.

The write request goes from the non-leader follower cache to the non-leader region's leader cache, and then to the leader cache. Along this path, the write is applied synchronously to each component. The leader cache performs the write to the leader database.

Following the database update, the leader cache sends invalidation messages to all follower caches within its own (leader) region.

Subsequently, the leader database replicates the write to the non-leader region's database. The leader database then sends an invalidation message to the leader cache within the non-leader region. This non-leader leader cache, in turn, propagates invalidation messages to all follower caches within its region.

Note: In the example above, the non-leader region's leader cache receives invalidation messages twice.

8. Consistency Model

While TAO is often described as eventually consistent, a deeper examination reveals a more nuanced consistency model.

8.1. Per-Key Sequential Consistency

TAO leverages MySQL as its underlying storage. Each key-value pair is managed by a single MySQL database instance, which acts as the leader for that data. Although data is replicated across regions, a single leader is designated for each key-value pair at any given time. This design enables the serialization of writes for a specific key-value pair.

Because a single MySQL process acts as the leader for all operations on an object and its associations, atomic operations, such as assoc_change_type, can be implemented using SQL queries.

Additionally, every write is tagged with a version number. Follower replicas use these version numbers to maintain the same write order as the database, ensuring consistent observation of updates. Unlike Dynamo and COPS, TAO's automatic conflict resolution eliminates the need for explicit conflict handlers.

Is TAO Sequentially Consistent?

NO! While writes for a single key-value pair are serialized, TAO does not guarantee sequential consistency across all key-value pairs. The single leader approach prevents conflicts within a specific key-value pair, but clients may observe updates to different keys in varying orders.

8.2. Read-Write Transactions

TAO does not support traditional read-write transactions. This aligns with systems like Dynamo and COPS, which prioritize availability over strong consistency. All writes are blind-writes.

8.3. RAW Consistency

Within a single cache tier (follower tier), TAO provides RAW consistency. When a write is successfully applied to the database, the local cache tier is updated. However, this RAW consistency is not guaranteed across different cache tiers. A client querying a different tier immediately after a write may not see the updated value.

9. Fault Tolerance

Given TAO's effective reliance on eventual consistency, its fault tolerance model is designed for simplicity and resilience.

9.1. Database Failures

Leader Database Failure: A non-leader database takes over as the new leader.
Non-Leader Database Failure: Read requests are redirected to the leader database.

9.2. Cache Failures

Cache invalidation messages are treated as best-effort deliveries. The system's consistency model is not compromised if these messages are lost; updates will eventually propagate. TAO makes best-effort delivery of invalidation messages to maintain cache consistency as much as possible.

Leader Cache Tier Failure:

Follower cache misses are redirected to the database directly.
Write requests are redirected to another available leader cache tier replica, which then enqueues invalidation messages for the original leader.

Invalidation Failure: The leader cache tier enqueues messages and delivers them when the follower becomes reachable.

Follower Cache Tier Failure: Client requests are redirected to another follower cache tier, which results in the loss of RAW consistency for the client.

10. Evaluation

TAO demonstrates remarkably high availability in production environments, reporting a 99.999% uptime.
A single follower cache tier is capable of handling up to 600k QPS during peak hit rates. However, throughput decreases as the hit rate declines, due to expensive downstream data fetches for cache misses.
The overall cache hit rate is 96.4%.
Write latency within a single region is 12.1 ms. For remote region writes, the latency increases to 74.4 ms, with 58.1 ms attributed to round-trip network latency.
Despite its eventual consistency model, TAO exhibits low replication delays:

p85: 1s
p99: 3s
p999: 10s

11. Paper Review

This paper is a great read, offering a clear API specification and numerous practical lessons. A notable contrast exists between this industry paper and academic works like the COPS paper. This difference arises from their distinct motivations: academic papers prioritize theoretical excellence, while industry papers focus on delivering practical solutions. The TAO paper exemplifies this industry-driven approach, highlighting the real-world considerations that shape large-scale system design.

Pico