Skip to main content

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.

Paper Link

Let's begin with some basic concepts.

Network Interface Card (NIC)

The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory.

CPU <-> Memory Communication

In a Von Neumann Architecture, the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components:

  • Data Bus: Bidirectional, carrying the actual data being transferred.
  • Address Bus: Unidirectional, carrying memory addresses for data access.
  • Control Bus: Carrying control signals for synchronization and coordination.



Modern systems employ hierarchical bus architectures to enhance communication speed, moving beyond the simple single-bus model.

NIC Interactions

The NIC interfaces directly with the system bus and incorporates an internal buffer memory.


Receive (RX)

  • The NIC buffers incoming data into its internal memory.
  • Upon buffer fullness, the NIC generates a maskable interrupt to the CPU.
  • The kernel (i.e. software instructions running in CPU), in response to the interrupt, copies the data from the NIC's buffer to RAM.
  • Higher-layer protocols then process the data (e.g., TCP/IP copying data to socket buffers), and the application is notified.

Transmit (TX)

  • The application writes data to a memory buffer.
  • The kernel copies the data to the NIC's buffer.
  • The NIC transmits the data over the network.

Direct Memory Access (DMA)

Not to be confused with RDMA, discussed next.

To improve efficiency, modern NICs, and other hardware like disk controllers and graphics cards, utilize Direct Memory Access (DMA). 

This allows the NIC to directly read and write to system memory without CPU intervention. The NIC has an internal processor (DMA controller) that can manage these memory transfers. While DMA is in progress, the system bus is in use by the DMA controller, briefly making it inaccessible to the CPU. This offloads data transfer tasks from the CPU, allowing it to perform other operations concurrently. 

Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) is a major cloud technology of today. Several systems make use of RDMA internally to speed up their processing.

A Brief History

The conceptual groundwork for RDMA was laid in the early 1990s. InfiniBand emerged as a prominent technology that leveraged RDMA to achieve very high throughput and low latency.

Mellanox Technologies played a crucial role in driving the adoption of InfiniBand and RDMA, particularly in high-performance computing environments.

Efforts began to extend the benefits of RDMA to Ethernet networks, leading to the development of RDMA over Converged Ethernet (RoCE). This technology made RDMA more accessible by enabling its use on standard Ethernet infrastructure.

RDMA is now essential in various applications, including:
  • High-Performance Computing
  • Data centers
  • Cloud computing
  • Storage systems (e.g., NVMe over Fabrics)
  • Machine Learning and AI.
RDMA technologies continue to evolve, with advancements like GPUDirect RDMA, which enables direct data transfer between GPUs and RDMA-capable network adapters.

Hardware RDMA

Hardware RDMA enables direct access of memory regions without CPU intervention via NIC. For RDMA, the NIC should have the capability to support RDMA. DMA is supported by all NICs, however, RDMA is not supported by all. NICs from Mellanox are examples of the ones that support RDMA.

Architecture

In the context of traditional networking stacks, when an application seeks to retrieve data from another application running on a remote machine, it initiates a Remote Procedure Call (RPC). This RPC necessitates a traversal through the kernel and the NIC on the sender's side. Subsequently, the data request traverses the NIC and the kernel on the receiver's side. Following this, the application thread on the receiver's side is invoked, which then proceeds to read the requested data. Finally, this data is transmitted back through the same network stack, retracing the path.
  

Conversely, in the scenario utilizing hardware RDMA, the NIC is granted direct access to designated memory regions. This remote direct access capability eliminates the need for CPU intervention, and consequently, bypasses the involvement of the kernel or application threads. The NIC itself possesses the capacity to directly read the necessary memory locations and transmit the response back.


     

Implications

Hardware RDMA enables one-sided communication, one of the most valuable primitive in cloud networks.

While hardware RDMA delivers a substantial reduction in network latency, bypassing the operating system kernel and application layers on the receiver side introduces several significant implications:

Increased Complexity in Client Protocol Implementation

Although RDMA-enabled NICs gain direct access to memory regions, they lack the computational capabilities to execute logic. Specifically, they cannot parse and interpret the content of request messages. Consequently, the client application bears the responsibility of translating the request message into specific memory addresses that the NIC can directly read. This necessitates a more intricate client-side protocol, where the client must precisely format its requests to correspond to the target memory layout.

Security Vulnerabilities

Directly exposing memory regions introduces a significantly expanded attack surface. The absence of kernel intervention precludes the application of traditional security measures, such as encryption of request or response data over SSL-TLS. This lack of encryption, can severely restrict RDMA's applicability, confining it primarily to intra-cluster communication.

To mitigate these security concerns, many RDMA-capable NICs incorporate hardware-based encryption protocols, such as Google's PSP. PSP, in particular, represents a crucial technological advancement beyond its role in RDMA, as it facilitates encryption without requiring CPU involvement. This significantly reduces CPU computational overhead.

Software RDMA

Software RDMA emerged as a response to security concerns associated with hardware RDMA. In contrast to hardware RDMA, software RDMA employs dedicated kernel threads. These threads mediate communication between the NIC and application-registered memory regions.

    

This approach maintains the application thread bypass, eliminating user-space thread scheduling overhead. Additionally, the kernel threads possess the capability to execute logic and transmit data, enabling programmable network operations.

Software RDMA offers a sweet spot between hardware RDMA and the traditional network stack.

Pony Express

Google's Snap microkernel networking architecture, leverages a custom scheduler for CPU-NIC interactions, replacing interrupt-driven methods. This allows dedicated CPU cores to fully utilize NIC capacity, removing CPU bottlenecks. Snap's extensibility permits user-defined kernel code (engines) to handle network requests on these dedicated cores. Pony Express is one such engine.

Pony Express features a two-layered architecture: an application-facing layer that manages state and interacts with applications, including shared memory registration, and a lower layer that handles message processing and flow control.

Pony Express supports one-sided operations, such as RDMA reads, that extends beyond basic memory access. For example, it can perform hash table lookups if the application registers a hash map data structure.

Applications can utilize dedicated or shared Pony engine cores, communicating via shared command queues.

Performance

While traditional RPC achieve around 100,000 IOPS/core, Pony threads can reach 5M IOPS/core.

The throughput of Pony threads is influenced by the network's maximum transmission unit (MTU). Larger MTUs can improve NIC saturation but increase retransmission costs. Evaluations indicate that a single Pony thread can saturate 67 Gbps with a 5 KB MTU and 38 Gbps with a 1500B MTU. Considering modern 100 Gbps NICs, a few Pony threads can effectively saturate the NIC. 

For conservative estimations, applications can assume 10 Gbps per Pony thread, translating to 250,000 IOPS per thread with a 5 KB MTU.

Hardware vs. Software RDMA Comparison

  • Software RDMA can achieve higher user-defined IOPS due to the CPU's ability to execute custom logic, enhancing performance. For instance, software RDMA can implement custom flow control mechanisms to mitigate fabric pauses, a limitation of hardware RDMA. This demonstrates instances where software outperforms hardware, as seen in virtualization.
  • Software RDMA offers significantly greater flexibility compared to hardware RDMA. Its reprogrammable nature allows for continuous innovation and adaptation even after deployment.

  • Hardware RDMA, however, eliminates CPU overhead. While software RDMA incurs CPU costs, these costs so minimal that it is almost zero.

NIC - The Limiting Factor of Modern Networking

In cloud environments, where machines commonly possess over 100 CPU cores, RDMA facilitates rapid NIC saturation. Consequently, the NIC has emerged as the primary bottleneck for network-intensive applications, especially in shared environments. 

This limitation has spurred the development of Terabit Ethernet to significantly increase NIC bandwidth. While 400 Gbps is achievable with current technology, reaching 1+ Tbps necessitates substantial architectural innovations, with Google and Meta heavily invested in this advancement.

Caching

Caching plays a crucial role in optimizing serving systems, which often rely on replication to handle high client request volumes. Caches are essentially volatile key-value stores where both keys and values are stored in memory only. There are no persistence guarantees, so one must not rely on them for data correctness. They should only be used for performance improvements.

Cache Strategies

Two primary caching strategies exist:

In-Memory Cache

This approach embeds the cache directly within each serving process.


Drawbacks:
  • Each replica incurs the full RAM cost of its local cache.
  • Cached data is not shared across replicas, leading to redundant database fetches (e.g., if replica 1 caches key X, replica 2 must still retrieve it independently).

Distributed Cache

A separate, cluster-deployed caching layer is utilized.



Advantages:
  • Reduces memory overhead by storing each key-value pair only once.
  • Enables data sharing across serving replicas.
Disadvantage:
  • Introduces network latency as serving processes must communicate with the cache cluster. Numerous optimization techniques exist to mitigate this latency.

Cache Write Policies

There are three different policies for writing to a cache.

Write-Through

In a write-through cache, every write operation updates both the cache and the backing data store (e.g., database) simultaneously. This ensures data consistency, as the cache always reflects the latest state of the data.

However, write operations can be slower due to the need to update both the cache and the database. This is very useful for read heavy workloads.

Write-Back

With a write-back cache, write operations are initially performed only on the cache. Updates to the backing data store are delayed and performed asynchronously, typically based on a timer or when a cache line is evicted.

This significantly improves write performance, as writes are faster. However, it introduces the risk of data loss if the cache fails before the updates are written to the database. This is very useful for write heavy workloads.

Write-Around

Data is written directly to the backing store, bypassing the cache. This is used to prevent cache pollution, where infrequently written data fills the cache.

CliqueMap

CliqueMap, a widely used key-value store at Google, builds upon the foundational principles of Pilaf, an RMA-based storage system. While not developed internally, CliqueMap has become a critical component of Google's infrastructure.

Key architectural choices include:

  • Reads: CliqueMap leverages RDMA for read operations, significantly reducing latency compared to traditional RPC. However, it falls back to RPC in certain scenarios.
  • Writes: All write operations in CliqueMap are handled via RPC.

CliqueMap vs. Memcached

CliqueMap offers several advantages over the industry-standard Memcached:

RDMA-Powered Reads

Unlike Memcached, which relies solely on RPC, CliqueMap utilizes RDMA for reads. At scale, RPC introduces considerable overhead (authentication, flow control, etc.), impacting performance. RDMA provides a more efficient approach, especially for read-heavy workloads, common in caching scenarios.

Native Replication

Memcached lacks built-in replication, requiring external solutions (e.g., Redis wrappers) for redundancy. CliqueMap is designed with native replication capabilities, enhancing data availability and fault tolerance.

Data Organization

A CliqueMap cluster employs multiple replicas. Each replica organizes data into distinct regions: a data region and an index region.

Data Region

The data region is segmented into slabs, and a slab-based allocator manages memory for value storage. Slab sizes are configurable to optimize for specific workloads. The memory allocated for slabs can be dynamically resized by the CliqueMap server threads (via RPC writes) to accommodate growing data.

Each data entry in the data region comprises:
  • Format: A versioning identifier.
  • Key Length: Specifies the size of the key.
  • Key: The key itself.
  • Data Length: Specifies the size of the value.
  • Data: The value associated with the key.
  • Metadata: Includes the version number, ensuring data consistency.
  • Checksum: A checksum of the entire entry, crucial for data integrity.

Index Region

The index region organizes key-value pairs into buckets, forming a set-associative structure.

Each index entry contains:
  • KeyHash: The hash of the key, used for bucket assignment.
  • VersionNumber: A unique, monotonically increasing version number that's updated on every write.
  • Pointer: The memory address of the corresponding value in the data region.
The KeyHash determines the bucket to which a key-value pair belongs, enabling efficient lookups.

Consistent Hashing

The key value pairs are divided among the replicas based on the hash of the key. Consistent Hashing comes into play although the paper doesn't mention which algorithm is used by CliqueMap internally.

Replication

CliqueMap utilizes quorum replication, supporting:

  • R=1: No replication (single replica).
  • R=2: Replication across two replicas (for immutable key-value pairs).
  • R=3.2: Replication across three replicas, requiring a majority of two for operations.
Given its complexity and reliability, the remainder of this article will focus exclusively on the R=3.2 mode.

Reading a Key-Value Pair (GET)

Reading a key-value pair in CliqueMap involves the following steps:

  • The key is hashed. The corresponding replica backends are identified based on the hash.
  • Key-value is retrieved from the replicas using RDMA.
  • Version numbers of the retrieved values are compared to ensure consistency.
    • In R=2 mode, versions must match.
    • In R=3.2 mode, a majority (2 out of 3) must match. The majority version number is the consistent version number.
  • Finally, the client verifies the checksum of the fetched data. The retrieved key is compared to the original key to mitigate potential (though rare) hash collisions.

Read Strategies

CliqueMap employs several strategies for reads:

2xR (Two Round Trips)

Requires two RDMA round trips:
  • Retrieving the bucket from index region of all replicas using the key hash.
  • Fetching the actual value from preferred backend's data region.
The index entry is used to compare the version numbers and then the value is fetched from data region of preferred backend. This is optimal for large value sizes.

Pony Express is used for one-sided communication in both steps. 

SCAR (Scan and Read)

Performs a single RDMA round trip. Pony Express reads the index region, retrieves the data region pointer, and returns the data entry. The Pony thread needs to execute some custom logic to make this possible. This consumes more server-side CPU but reduces latency. However, the value will be returned by all the replicas. This is preferable for small value sizes.

RPC Fallback

If RDMA reads cannot provide the value (as in case of overflow buckets, described later), a fallback to traditional RPC is used. This is significantly slower than 2xR and SCAR.

Relaxed Reads

CliqueMap also offers relaxed reads, which prioritize performance over consistency. In this mode, the value is fetched from a single replica. This saves both server-side computation and client-side bandwidth consumption. However, the retrieved value may be outdated. 

Essentially, this mode trades linearizability for reduced latency.

Writing to a Key-Value Pair (SET)

Total Order on Mutations

CliqueMap ensures total ordering of key-value pair updates, leveraging Google's TrueTime, a distributed time synchronization system. Each mutation is assigned a unique, monotonically increasing version number in the format <TrueTime, Client ID, Sequence Number>.

This versioning system means that mutation order is determined by the version number, not necessarily the arrival time of requests. While this might seem to violate consistency, TrueTime guarantees that clients eventually generate higher version numbers for subsequent requests, maintaining consistency.

Erase Operations

The erase operation removes a key-value pair from the cache. To prevent conflicts with concurrent mutations, erase also uses version numbers. After an erase, the key-value pair is moved to a fully-associative tombstone cache. This ensures that reads for older versions of the key-value pair can still be served.

Compare-And-Swap (CAS)

CAS is a crucial primitive for CliqueMap. It performs a mutation only if the current version number of the key-value pair matches a provided version number. This is essential for implementing transactional behavior on single key-value pairs (CliqueMap, like many key-value stores, doesn't support multi-key transactions).

The client reads the current value and its version number, and the subsequent mutation succeeds only if the version numbers match. This enables atomic updates and prevents race conditions.

Cache Evictions

As a volatile key-value store, CliqueMap can lose data without impacting system correctness. Eviction, the process of removing key-value pairs, is essential for accommodating new key-value pairs.

Eviction Types

CliqueMap employs two primary eviction types:

  • Capacity EvictionsThese occur when the cache reaches its memory capacity and needs to free space for new key-value pairs.
  • Associativity EvictionsDue to the set-associative bucket structure, each bucket has a limited number of index entries. If numerous keys map to the same bucket, some keys must be evicted to make room for new entries.

Overflow Buckets

It is possible to address associativity evictions using overflow buckets. When a bucket overflows, new key-value pairs are stored in these overflow buckets. While this avoids immediate eviction, accessing data in overflow buckets incurs an RPC overhead.

Eviction Policies

CliqueMap supports standard eviction policies, such as Least Recently Used (LRU). However, the RDMA-based read operations present a challenge for tracking key access times.

To address this, clients communicate key access times to the server backends via separate batch RPCs. This allows the server to approximate LRU behavior despite the use of RDMA for reads.

Quorum Repairs

In R=3.2 mode, each key is replicated across three replicas. The loss of one replica results in a dirty quorum, while the loss of two leads to an inquorate state. CliqueMap implements on-demand quorum repair to recover key-value pairs in these scenarios.

Quorum repair is also triggered when a replica restarts.

The repair process involves the following steps:

  1. Replicas scan their cohorts (the other replicas holding the same key) to determine the current version of the key-value pairs.
  2. If version numbers don't match, replicas fetch the latest version of the key.

This mechanism is also valuable for updating stale replicas, which might have missed a mutation (since R=3.2 only requires two successful writes out of three).

Evaluation

The authors conducted a comprehensive evaluation of the caching system, focusing on key performance metrics, particularly tail latency (p95, p99, p999). Minimizing tail latency is crucial for caching systems, as they underpin downstream applications, directly impacting overall system performance.

Key evaluation findings include:

  • At 3 million operations per second (ops/sec), the p999 latency was ~10 ms.
  • R=3.2 mode demonstrated superior performance compared to R=1, particularly in tail latency. This is attributed to its ability to select the optimal server for read operations.
  • 2xR outperformed SCAR for large value sizes.
  • Data migration during quorum repair, performed via RPCs, introduced latency overhead.
  • CliqueMap is optimized for read-heavy workloads. Higher read percentages resulted in lower observed latencies.
  • Hardware vs. Software RDMA:
    • Pony Express demonstrated robust throughput and low latency even under high client load.
    • Hardware RDMA significantly outperformed software RDMA, even when limited to the 2xR strategy (hardware RDMA only supported 2xR).

Paper Review

This paper, authored by researchers at the University of Michigan, describes a highly compelling caching system. It stands out as one of the most promising solutions I've encountered in academic literature. I'm hopeful it will be offered as a Google Cloud product. Beyond its caching innovations, the paper provides an accessible introduction to RDMA, a technology with significant cloud implications. I found it particularly insightful, even compared to similar works from companies like Facebook and Twitter.

Comments

Popular Posts

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...