Skip to main content

Paper Insights #27 - TAO: Facebook's Distributed Data Store for the Social Graph

Following our discussion of causal consistency in COPS, this paper presents an eventually consistent database designed for graph storage. This paper was presented at USENIX ATC 2013, a prestigious venue in the field of computer science, in the year 2013.

Paper Link

Let's begin with some basic concepts.

Consistency Models Revisisted

We've previously explored linearizability and sequential consistency in ZooKeeper, as well as causal consistency in COPS. In this discussion, we'll examine eventual consistency models and provide an illustrative example to summarize them, within the context of single-key distributed key-value stores.

The phrase "consistency model" in the context of a single data item read-write system means the following:

  • Say clients are observing all the writes happening in the store. A consistency model determines the order in which the clients see the writes.
  • After all the writes are applied, the consistency model determines what would be the effective state of the objects in the store.

The consistency model hierarchy is:

Linearizable -> Sequential -> Causal -> 

PRAM -> Read-Your-Writes (RYW) -> Eventual

Read-Your-Writes (RYW)

This model guarantees that a client can observe the effect of its own writes.

Example

Consider the following example, with the initial state being, {X: 0, Y: 0}:

The real-time ordering of events are:

  1. W(X, 1) or W1- Client 1 sets key X to value 1. 
  2. W(X, 2) or W2 - Client 2 sets key X to value 2.
  3. W(Y, 3) or W3 - Client 1 sets key Y to value 3.
  4. W(X, 4) or W4 - Client 2 sets key X to value 4.

The causal relationships are:

  1. W1 -> W3 - events within the same system.
  2. W2 -> W4 - events within the same system.

Now we can define each consistency model.

Linearizability

  • Order of writes: W1, W2, W3, W4. All clients will observe this and only this order.
  • The final effective state will be {X: 4, Y: 3}.

Sequential Consistency

  • Possible orders are all 4! permutations; only one of these orders will be chosen, and all clients will observe that same order:
    • W1, W2, W3, W4
    • W1, W2, W4, W3
    • W1, W3, W2, W4
    • and so on.
  • The final effective state will be either {X: 1, Y: 3} or {X: 2, Y: 3} or {X: 4, Y: 3} based on which total order was chosen.
Important Note: Up to the point of sequential consistency, the system maintains a deterministic order of writes. There will only be one order and hence only one final state. Beyond this, multiple values for the same key can exist without violating the consistency model. When such multiple values exists, the system can chose to present all the known values to the clients for the clients to decide.

Causal Consistency

  • Possible orders are listed below (all orders in which W1 appears before W3 and W2 appears before W4); clients can observe any of these orders, and different clients may observe different orders:
    • W1, W2, W3, W4
    • W1, W2, W4, W3
    • W1, W3, W2, W4
    • W2, W1, W3, W4
    • W2, W1, W4, W3
    • W2, W4, W1, W3
  • The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.

PRAM (or FIFO) Consistency

  • Possible orders are listed below (all orders in which W1 appears before W3 and W2 appears before W4); clients can observe any of these orders, and different clients may observe different orders:
    • W1, W2, W3, W4
    • W1, W2, W4, W3
    • W1, W3, W2, W4
    • W2, W1, W3, W4
    • W2, W1, W4, W3
    • W2, W4, W1, W3
  • The final effective states will be {X: 1, Y: 3} and {X: 4, Y: 3}. Both are valid and the system can return different values for X without affecting consistency.
For this example, PRAM consistency appears to be the same as causal consistency. To distinguish them, let's say there was a R(X) before W2 on Node 2, which fetched value 1. Then we have an additional causal order W1 -> R(X) and R(X) -> W2. The final effective state will be {X: 4, Y: 3} for causal consistency.



For PRAM consistency, however, all orders mentioned above would still be valid.

RYW Consistency

This is where things start to differ slightly.
  • Possible orders for all clients except client 1 and 2 are all possible 4! permutations; any ordering is fine.
  • Possible orders for client 1 are those in which W1 is ordered before W3; client 1 will observe one of them.
  • Similarly, possible orders for client 2 are those in which W2 is ordered before W4; client 2 will observe one of them.
    • The final effective states for all other clients will be {X: 1, Y: 3}{X: 2, Y: 3}, and {X: 4, Y: 3}. Different replicas in the data store may have different values, and that is completely fine.
    • The final effective states for client 1 will be {X: 1, Y: 3}{X: 2, Y: 3}, or {X: 4, Y: 3}; any one based on the order that it observed.
    • The final effective states for client 2 will be {X: 1, Y: 3} or {X: 4, Y: 3}; any one based on the order that it observed.

    Eventual Consistency

    • All orders are valid. They can be different for all the clients.
    • The final effective states will be {X: 1, Y: 3}{X: 2, Y: 3}, and {X: 4, Y: 3}.

    Conflict Handling

    In all the examples above, we discussed scenarios involving causal, PRAM, RYW, and eventual consistency, where multiple values for the same key could exist. However, systems designed with these consistency models, such as Dynamo and COPS, typically implement a conflict handler. This conflict handler is associative and commutative, and helps converge multiple values to a single, consistent value.

    In practical applications, this convergence process usually occurs very rapidly, resulting in only one effective and valid state being retained.

    Read-After-Write (RAW) Consistency

    Read-After-Write (RAW) consistency primarily originated as an industry term used by companies like Amazon and Facebook.

    It ensures that reads from any client after any write operation return the written value or a later version. This is essentially equivalent to linearizability, as it guarantees immediate visibility of a committed write across all clients.

    However, in reality, this is not linearizability. It is just a descriptive way of stating a common technique used in industry to make writes visible to readers. To give a spoiler, the readers are redirected to the same replica where the writer performed the write. Thus, it is not exactly linearizability, in which all replicas would reflect the write. We will explore the implications of RAW consistency in the context of TAO shortly.

    TAO

    TAO is a graph database and widely deployed at Meta (formerly Facebook). It is distributed and offers RAW consistency. In reality it only offers eventual consistency. It is again a NoSQL database and is optimized for availability and performance over consistency.

    Graph Model

    In TAO, the graph structure is represented using objects (nodes) and associations (edges). Both objects and associations are fundamentally stored as key-value pairs, effectively building the graph model on a key-value abstraction.

    Objects

    • The key is the object's unique identifier (id).
    • The value is a structure containing the object type (otype) and a set of key-value pairs representing the object's attributes: <otype, (key -> value)>. The otype determines how the internal key-value pairs within the value are interpreted.

    Associations

    • The key is composed of the source object identifier (id1), the association type (atype), and the destination object identifier (id2): <id1, atype, id2>.
    • The value includes a timestamp (time) and a set of key-value pairs representing the association's attributes: <time, (key -> value)>. The atype allows for defining multiple relationship types between objects.
    All objects and associations are stored as independent key-value pairs, with the interpretation of the value depending on the otype or atype. The system utilizes a limited set of object and association types.

    The time included in the association value facilitates time-based sorting, a crucial feature for social media platforms like Facebook.

    API

    The TAO API, while presented as a graph interface, ultimately operates on the underlying key-value store.

    Write Operations

    • association_add(id1, atype, id2, time, (k->v)*): Adds an association of type atype between objects id1 and id2.
    • association_delete(id1, atype, id2): Deletes the specified association, which corresponds to deleting the key <id1, atype, id2>.
    • association_change_type(id1, atype, id2, newtype): Modifies the association type. This involves deleting the existing key <id1, atype, id2> and adding a new key <id1, newtype, id2> in a single operation.

    Read Operations

    • assoc_get(id1, atype, id2set, high?, low?): Retrieves all associations of type atype between id1 and objects within the id2set. This requires retrieving multiple key-value pairs from the database. Optionally, the client can specify the time range (low to high) for the associations.
    • assoc_count(id1, atype): Returns the number of associations of type atype originating from id1.
    • assoc_range(id1, atype, pos, limit): Returns a range of associations of type atype originating from id1, starting at position pos (when sorted temporally), and returning a maximum of limit associations.
    • assoc_time_range(id1, atype, high, low, limit): Returns associations of type atype originating from id1, within a specified time range (low to high), returning a maximum of limit associations.

    Architecture

    TAO's architecture is complex, reflecting its diverse range of use cases and the system's evolutionary development. This complexity is managed through a layered design and strategic sharding.

    Sharding

    The entire dataset is divided into logical shards for distribution and scalability. The sharding is based on the object identifier. All associations from an object (id1) are part of the same shard.

    Regional Deployment and Replication

    TAO is deployed across multiple regions, each acting as a leader for a subset of the logical shards. Notably, each region stores all shards, regardless of its leadership role. This ensures data availability even if a region becomes temporarily unavailable.

    Each region comprises the two layers - storage layer and cache layer.

    Storage Layer (MySQL Database)

    All objects and associations are persisted in a sharded MySQL database. The API operations are translated into corresponding SQL queries. 

    The storage layer within each region maintains all logical shards, with objects and associations stored in separate tables.


    All objects are stored in one table, and all associations in another. The tables act as key-value store where the value is serialized into a single column. The associations table has additional index on <id1, atype, time> to support range queries.

    Cache Layer

    This layer caches objects, association lists, and association counts, employing an LRU eviction policy. The cache is not simply a passive storage; it understands the semantics of the stored key-value pairs and can execute application logic.

    The cache layer is organized hierarchically:
    • Follower Tier: Client interactions occur at this tier, with multiple follower tiers per region.
    • Leader Tier: This tier communicates directly with the storage layer.
    Each tier (follower and leader) maintains all shards. Consistent hashing distributes shards across servers within a tier. To manage high load on specific shards, those shards may be cloned and assigned to multiple servers.


    Within each cache tier, TAO utilizes a slab allocator to manage memory. RAM is divided into arenas, each dedicated to a specific object or association type. This partitioning provides isolation for LRU eviction policies, ensuring that different data types do not interfere with each other. Further optimizations exist for small, fixed-size items like association counts.

    The cache is responsible for storing the following data:

    1. Objects: Objects, identified by their unique ID (<id>).
    2. Association Counts: The number of associations for each combination of object identifier and association type (<id, atype>). This is crucial for efficiently executing assoc_count queries.
    3. Association Lists: Lists of associations for each <id, atype> combination, ordered by time and typically limited to 6000 entries. These lists are used to answer range queries. Queries beyond limit go to the database.

    It's important to note that these cached items directly reflect TAO's graph model. However, the underlying database operates as a simple key-value store.

    Scalability

    TAO's deployment model facilitates high scalability. Read operations are primarily handled by the local follower tier cache. Cache misses trigger queries to the region's storage layer (MySQL).

    Write operations (described next) are propagated asynchronously to other regions.

    Write Propagation

    Write operations are propagated synchronously from the follower cache tier to the leader cache tier, which then updates the database in the storage layer.

    Upon database write, a changeset is generated, encapsulating the modifications. For example, deleting an association results in a changeset that:

    • Removes the association from the corresponding association list.
    • Decrements the association count for the object and association type.

    The changeset is applied synchronously along the path: client -> follower cache tier -> leader cache tier -> leader database. However, updates to other cache tiers and databases are performed asynchronously, using an invalidation-based approach.

    Leader Cache Tier Invalidation

    After applying the update, the leader cache tier sends invalidation messages to all follower cache tiers within its region for the affected object. It also sends refill messages for association lists.

    Importantly, a leader cache tier only sends invalidation messages within its own region.

    Database Replication and Inter-Regional Invalidation

    The database replicates the updates to non-leader regions. The database then sends invalidation messages to the leader cache tiers of those non-leader regions. These non-leader leader cache tiers then forward the invalidation messages to their respective non-leader follower cache tiers.

    Example

    Let's run through the paper's example to understand how writes work.


    In the figure above, solid lines represent data flow, while dotted lines indicate control messages, specifically invalidations and refills. Consider a write operation initiated by a client in a non-leader region.

    The write request goes from the non-leader follower cache to the non-leader region's leader cache, and then to the leader cache. Along this path, the write is applied synchronously to each component. The leader cache performs the write to the leader database.

    Following the database update, the leader cache sends invalidation messages to all follower caches within its own (leader) region.

    Subsequently, the leader database replicates the write to the non-leader region's database. The leader database then sends an invalidation message to the leader cache within the non-leader region. This non-leader leader cache, in turn, propagates invalidation messages to all follower caches within its region.

    Note: In the example above, the non-leader region's leader cache receives invalidation message twice.

    Consistency Model

    While TAO is often described as eventually consistent, a deeper examination reveals a more nuanced consistency model.

    Per-Key Sequential Consistency

    TAO leverages MySQL as its underlying storage. Each key-value pair is managed by a single MySQL database instance, which acts as the leader for that data. Although data is replicated across regions, a single leader is designated for each key-value pair at any given time. This design enables the serialization of writes for a specific key-value pair.

    Because a single MySQL process acts as the leader for all operations on an object and its associations, atomic operations, such as assoc_change_type, can be implemented using SQL queries. This atomicity could also be achieved with a non-distributed key-value store like LevelDB, which provides per-row atomic transactions.
    Additionally, every write is tagged with a version number. Follower replicas use these version numbers to maintain the same write order as the database, ensuring consistent observation of updates. Unlike Dynamo and COPS, TAO's automatic conflict resolution eliminates the need for explicit conflict handlers.

    Is TAO Sequentially Consistent?

    NO!

    While writes for a single key-value pair are serialized, TAO does not guarantee sequential consistency across all key-value pairs. The single leader approach prevents conflicts within a specific key-value pair, but clients may observe updates to different keys in varying orders.

    Read-Write Transactions

    TAO does not support traditional read-write transactions. This aligns with systems like Dynamo and COPS, which prioritize availability over strong consistency. All writes are blind-writes.

    Read-After-Write (RAW) Consistency

    Within a single cache tier (follower tier), TAO provides RAW consistency. When a write is successfully applied to the database, the local cache tier is updated. However, this RAW consistency is not guaranteed across different cache tiers. A client querying a different tier immediately after a write may not see the updated value. 

    Fault Tolerance

    Given TAO's effective reliance on eventual consistency, its fault tolerance model is designed for simplicity and resilience.

    Database Failures

    • Leader Database Failure: A non-leader database takes over as the new leader.
    • Non-Leader Database Failure: Read requests are redirected to the leader database.

    Cache Failures

    Cache invalidation messages are treated as best-effort deliveries. The system's consistency model is not compromised if these messages are lost; updates will eventually propagate. TAO makes best-effort delivery of invalidation messages to maintain cache consistency as much as possible.
    • Leader Cache Tier Failure:
      • Follower cache misses are redirected to the database directly.
      • Write requests are redirected to another available leader cache tier replica, which then enqueues invalidation messages for the original leader.
    • Invalidation Failure: The leader cache tier enqueues messages and delivers them when the follower becomes reachable.
    • Follower Cache Tier Failure: Client requests are redirected to another follower cache tier, which results in the loss of RAW consistency for the client.

    Evaluation

    • TAO demonstrates remarkably high availability in production environments, reporting a 99.999% uptime.
    • A single follower cache tier is capable of handling up to 600k QPS during peak hit rates. However, throughput decreases as the hit rate declines, due to expensive downstream data fetches for cache misses.
    • The overall cache hit rate is 96.4%.
    • Write latency within a single region is 12.1 ms. For remote region writes, the latency increases to 74.4 ms, with 58.1 ms attributed to round-trip network latency.
    • Despite its eventual consistency model, TAO exhibits low replication delays:
      • p85: 1s
      • p99: 3s
      • p999: 10s

    Paper Review

    This paper is a great read, offering a clear API specification and numerous practical lessons. A notable contrast exists between this industry paper and academic works like the COPS paper. This difference arises from their distinct motivations: academic papers prioritize theoretical excellence, while industry papers focus on delivering practical solutions. The TAO paper exemplifies this industry-driven approach, highlighting the real-world considerations that shape large-scale system design.

    Comments

    Popular Posts

    Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

    Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

    Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

    This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

    Paper Insights #24 - Spanner: Google's Globally-Distributed Database

    This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...