Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.

Paper Link

Recommended Read: Paper Insights - Eliminating Receive Livelock in an Interrupt-driven Kernel where I introduced how networking stack works.

Let's begin with some basic concepts.

Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) is a major cloud technology of today. Several systems make use of RDMA internally to speed up their processing.

A Brief History

The conceptual groundwork for RDMA was laid in the early 1990s. InfiniBand emerged as a prominent technology that leveraged RDMA to achieve very high throughput and low latency.

Mellanox Technologies played a crucial role in driving the adoption of InfiniBand and RDMA, particularly in high-performance computing environments.

Efforts began to extend the benefits of RDMA to Ethernet networks, leading to the development of RDMA over Converged Ethernet (RoCE). This technology made RDMA more accessible by enabling its use on standard Ethernet infrastructure.

RDMA is now essential in various applications, including:

High-Performance Computing
Data centers
Cloud computing
Storage systems (e.g., NVMe over Fabrics)
Machine Learning and AI.

RDMA technologies continue to evolve, with advancements like GPUDirect RDMA, which enables direct data transfer between GPUs and RDMA-capable network adapters.

Hardware RDMA

Hardware RDMA enables direct access of memory regions without CPU intervention via NIC. For RDMA, the NIC should have the capability to support RDMA. DMA is supported by all NICs, however, RDMA is not supported by all. NICs from Mellanox are examples of the ones that support RDMA.

Architecture

In the context of traditional networking stacks, when an application seeks to retrieve data from another application running on a remote machine, it initiates a Remote Procedure Call (RPC). This RPC necessitates a traversal through the kernel and the NIC on the sender's side. Subsequently, the data request traverses the NIC and the kernel on the receiver's side. Following this, the application thread on the receiver's side is invoked, which then proceeds to read the requested data. Finally, this data is transmitted back through the same network stack, retracing the path.

Conversely, in the scenario utilizing hardware RDMA, the NIC is granted direct access to designated memory regions. This remote direct access capability eliminates the need for CPU intervention, and consequently, bypasses the involvement of the kernel or application threads. The NIC itself possesses the capacity to directly read the necessary memory locations and transmit the response back.

Implications

Hardware RDMA enables one-sided communication, one of the most valuable primitive in cloud networks.

While hardware RDMA delivers a substantial reduction in network latency, bypassing the operating system kernel and application layers on the receiver side introduces several significant implications:

Increased Complexity in Client Protocol Implementation

Although RDMA-enabled NICs gain direct access to memory regions, they lack the computational capabilities to execute logic. Specifically, they cannot parse and interpret the content of request messages. Consequently, the client application bears the responsibility of translating the request message into specific memory addresses that the NIC can directly read. This necessitates a more intricate client-side protocol, where the client must precisely format its requests to correspond to the target memory layout.

Security Vulnerabilities

Directly exposing memory regions introduces a significantly expanded attack surface. The absence of kernel intervention precludes the application of traditional security measures, such as encryption of request or response data over SSL-TLS. This lack of encryption, can severely restrict RDMA's applicability, confining it primarily to intra-cluster communication.

To mitigate these security concerns, many RDMA-capable NICs incorporate hardware-based encryption protocols, such as Google's PSP. PSP, in particular, represents a crucial technological advancement beyond its role in RDMA, as it facilitates encryption without requiring CPU involvement. This significantly reduces CPU computational overhead.

Software RDMA

Software RDMA emerged as a response to security concerns associated with hardware RDMA. In contrast to hardware RDMA, software RDMA employs dedicated kernel threads. These threads mediate communication between the NIC and application-registered memory regions.

This approach maintains the application thread bypass, eliminating user-space thread scheduling overhead. Additionally, the kernel threads possess the capability to execute logic and transmit data, enabling programmable network operations.

Software RDMA offers a sweet spot between hardware RDMA and the traditional network stack.

Pony Express

Google's Snap microkernel networking architecture, leverages a custom scheduler for CPU-NIC interactions, replacing interrupt-driven methods. This allows dedicated CPU cores to fully utilize NIC capacity, removing CPU bottlenecks. Snap's extensibility permits user-defined kernel code (engines) to handle network requests on these dedicated cores. Pony Express is one such engine.

Pony Express features a two-layered architecture: an application-facing layer that manages state and interacts with applications, including shared memory registration, and a lower layer that handles message processing and flow control.

Pony Express supports one-sided operations, such as RDMA reads, that extends beyond basic memory access. For example, it can perform hash table lookups if the application registers a hash map data structure.

Applications can utilize dedicated or shared Pony engine cores, communicating via shared command queues.

Performance

While traditional RPC achieve around 100,000 IOPS/core, Pony threads can reach 5M IOPS/core.

The throughput of Pony threads is influenced by the network's maximum transmission unit (MTU). Larger MTUs can improve NIC saturation but increase retransmission costs. Evaluations indicate that a single Pony thread can saturate 67 Gbps with a 5 KB MTU and 38 Gbps with a 1500B MTU. Considering modern 100 Gbps NICs, a few Pony threads can effectively saturate the NIC.

For conservative estimations, applications can assume 10 Gbps per Pony thread, translating to 250,000 IOPS per thread with a 5 KB MTU.

Hardware vs. Software RDMA Comparison

Software RDMA can achieve higher user-defined IOPS due to the CPU's ability to execute custom logic, enhancing performance. For instance, software RDMA can implement custom flow control mechanisms to mitigate fabric pauses, a limitation of hardware RDMA. This demonstrates instances where software outperforms hardware, as seen in virtualization.

Software RDMA offers significantly greater flexibility compared to hardware RDMA. Its reprogrammable nature allows for continuous innovation and adaptation even after deployment.

Hardware RDMA, however, eliminates CPU overhead. While software RDMA incurs CPU costs, these costs so minimal that it is almost zero.

NIC - The Limiting Factor of Modern Networking

In cloud environments, where machines commonly possess over 100 CPU cores, RDMA facilitates rapid NIC saturation. Consequently, the NIC has emerged as the primary bottleneck for network-intensive applications, especially in shared environments.

This limitation has spurred the development of Terabit Ethernet to significantly increase NIC bandwidth. While 400 Gbps is achievable with current technology, reaching 1+ Tbps necessitates substantial architectural innovations, with Google and Meta heavily invested in this advancement.

Caching

Caching plays a crucial role in optimizing serving systems, which often rely on replication to handle high client request volumes. Caches are essentially volatile key-value stores where both keys and values are stored in memory only. There are no persistence guarantees, so one must not rely on them for data correctness. They should only be used for performance improvements.

Cache Strategies

Two primary caching strategies exist:

In-Memory Cache

This approach embeds the cache directly within each serving process.

Drawbacks:

Each replica incurs the full RAM cost of its local cache.
Cached data is not shared across replicas, leading to redundant database fetches (e.g., if replica 1 caches key X, replica 2 must still retrieve it independently).

Distributed Cache

A separate, cluster-deployed caching layer is utilized.

Advantages:

Reduces memory overhead by storing each key-value pair only once.
Enables data sharing across serving replicas.

Disadvantage:

Introduces network latency as serving processes must communicate with the cache cluster. Numerous optimization techniques exist to mitigate this latency.

Cache Write Policies

There are three different policies for writing to a cache.

Write-Through

In a write-through cache, every write operation updates both the cache and the backing data store (e.g., database) simultaneously. This ensures data consistency, as the cache always reflects the latest state of the data.

However, write operations can be slower due to the need to update both the cache and the database. This is very useful for read heavy workloads.

Write-Back

With a write-back cache, write operations are initially performed only on the cache. Updates to the backing data store are delayed and performed asynchronously, typically based on a timer or when a cache line is evicted.

This significantly improves write performance, as writes are faster. However, it introduces the risk of data loss if the cache fails before the updates are written to the database. This is very useful for write heavy workloads.

Write-Around

Data is written directly to the backing store, bypassing the cache. This is used to prevent cache pollution, where infrequently written data fills the cache.

CliqueMap

CliqueMap, a widely used key-value store at Google, builds upon the foundational principles of Pilaf, an RMA-based storage system. While not developed internally, CliqueMap has become a critical component of Google's infrastructure.

Key architectural choices include:

Reads: CliqueMap leverages RDMA for read operations, significantly reducing latency compared to traditional RPC. However, it falls back to RPC in certain scenarios.
Writes: All write operations in CliqueMap are handled via RPC.

CliqueMap vs. Memcached

CliqueMap offers several advantages over the industry-standard Memcached:

RDMA-Powered Reads

Unlike Memcached, which relies solely on RPC, CliqueMap utilizes RDMA for reads. At scale, RPC introduces considerable overhead (authentication, flow control, etc.), impacting performance. RDMA provides a more efficient approach, especially for read-heavy workloads, common in caching scenarios.

Native Replication

Memcached lacks built-in replication, requiring external solutions (e.g., Redis wrappers) for redundancy. CliqueMap is designed with native replication capabilities, enhancing data availability and fault tolerance.

Data Organization

A CliqueMap cluster employs multiple replicas. Each replica organizes data into distinct regions: a data region and an index region.

Data Region

The data region is segmented into slabs, and a slab-based allocator manages memory for value storage. Slab sizes are configurable to optimize for specific workloads. The memory allocated for slabs can be dynamically resized by the CliqueMap server threads (via RPC writes) to accommodate growing data.

Each data entry in the data region comprises:

Format: A versioning identifier.
Key Length: Specifies the size of the key.
Key: The key itself.
Data Length: Specifies the size of the value.
Data: The value associated with the key.
Metadata: Includes the version number, ensuring data consistency.
Checksum: A checksum of the entire entry, crucial for data integrity.

Index Region

The index region organizes key-value pairs into buckets, forming a set-associative structure.

Each index entry contains:

KeyHash: The hash of the key, used for bucket assignment.
VersionNumber: A unique, monotonically increasing version number that's updated on every write.
Pointer: The memory address of the corresponding value in the data region.

The KeyHash determines the bucket to which a key-value pair belongs, enabling efficient lookups.

Consistent Hashing

The key value pairs are divided among the replicas based on the hash of the key. Consistent Hashing comes into play although the paper doesn't mention which algorithm is used by CliqueMap internally.

Replication

CliqueMap utilizes quorum replication, supporting:

R=1: No replication (single replica).
R=2: Replication across two replicas (for immutable key-value pairs).
R=3.2: Replication across three replicas, requiring a majority of two for operations.

Given its complexity and reliability, the remainder of this article will focus exclusively on the R=3.2 mode.

Reading a Key-Value Pair (GET)

Reading a key-value pair in CliqueMap involves the following steps:

The key is hashed. The corresponding replica backends are identified based on the hash.
Key-value is retrieved from the replicas using RDMA.
Version numbers of the retrieved values are compared to ensure consistency.

In R=2 mode, versions must match.
In R=3.2 mode, a majority (2 out of 3) must match. The majority version number is the consistent version number.

Finally, the client verifies the checksum of the fetched data. The retrieved key is compared to the original key to mitigate potential (though rare) hash collisions.

Read Strategies

CliqueMap employs several strategies for reads:

2xR (Two Round Trips)

Requires two RDMA round trips:

Retrieving the bucket from index region of all replicas using the key hash.
Fetching the actual value from preferred backend's data region.

The index entry is used to compare the version numbers and then the value is fetched from data region of preferred backend. This is optimal for large value sizes.

Pony Express is used for one-sided communication in both steps.

SCAR (Scan and Read)

Performs a single RDMA round trip. Pony Express reads the index region, retrieves the data region pointer, and returns the data entry. The Pony thread needs to execute some custom logic to make this possible. This consumes more server-side CPU but reduces latency. However, the value will be returned by all the replicas. This is preferable for small value sizes.

RPC Fallback

If RDMA reads cannot provide the value (as in case of overflow buckets, described later), a fallback to traditional RPC is used. This is significantly slower than 2xR and SCAR.

Relaxed Reads

CliqueMap also offers relaxed reads, which prioritize performance over consistency. In this mode, the value is fetched from a single replica. This saves both server-side computation and client-side bandwidth consumption. However, the retrieved value may be outdated.

Essentially, this mode trades linearizability for reduced latency.

Writing to a Key-Value Pair (SET)

Total Order on Mutations

CliqueMap ensures total ordering of key-value pair updates, leveraging Google's TrueTime, a distributed time synchronization system. Each mutation is assigned a unique, monotonically increasing version number in the format <TrueTime, Client ID, Sequence Number>.

This versioning system means that mutation order is determined by the version number, not necessarily the arrival time of requests. While this might seem to violate consistency, TrueTime guarantees that clients eventually generate higher version numbers for subsequent requests, maintaining consistency.

Erase Operations

The erase operation removes a key-value pair from the cache. To prevent conflicts with concurrent mutations, erase also uses version numbers. After an erase, the key-value pair is moved to a fully-associative tombstone cache. This ensures that reads for older versions of the key-value pair can still be served.

Compare-And-Swap (CAS)

CAS is a crucial primitive for CliqueMap. It performs a mutation only if the current version number of the key-value pair matches a provided version number. This is essential for implementing transactional behavior on single key-value pairs (CliqueMap, like many key-value stores, doesn't support multi-key transactions).

The client reads the current value and its version number, and the subsequent mutation succeeds only if the version numbers match. This enables atomic updates and prevents race conditions.

Cache Evictions

As a volatile key-value store, CliqueMap can lose data without impacting system correctness. Eviction, the process of removing key-value pairs, is essential for accommodating new key-value pairs.

Eviction Types

CliqueMap employs two primary eviction types:

Capacity Evictions: These occur when the cache reaches its memory capacity and needs to free space for new key-value pairs.
Associativity Evictions: Due to the set-associative bucket structure, each bucket has a limited number of index entries. If numerous keys map to the same bucket, some keys must be evicted to make room for new entries.

Overflow Buckets

It is possible to address associativity evictions using overflow buckets. When a bucket overflows, new key-value pairs are stored in these overflow buckets. While this avoids immediate eviction, accessing data in overflow buckets incurs an RPC overhead.

Eviction Policies

CliqueMap supports standard eviction policies, such as Least Recently Used (LRU). However, the RDMA-based read operations present a challenge for tracking key access times.

To address this, clients communicate key access times to the server backends via separate batch RPCs. This allows the server to approximate LRU behavior despite the use of RDMA for reads.

Quorum Repairs

In R=3.2 mode, each key is replicated across three replicas. The loss of one replica results in a dirty quorum, while the loss of two leads to an inquorate state. CliqueMap implements on-demand quorum repair to recover key-value pairs in these scenarios.

Quorum repair is also triggered when a replica restarts.

The repair process involves the following steps:

Replicas scan their cohorts (the other replicas holding the same key) to determine the current version of the key-value pairs.
If version numbers don't match, replicas fetch the latest version of the key.

This mechanism is also valuable for updating stale replicas, which might have missed a mutation (since R=3.2 only requires two successful writes out of three).

Evaluation

The authors conducted a comprehensive evaluation of the caching system, focusing on key performance metrics, particularly tail latency (p95, p99, p999). Minimizing tail latency is crucial for caching systems, as they underpin downstream applications, directly impacting overall system performance.

Key evaluation findings include:

At 3 million operations per second (ops/sec), the p999 latency was ~10 ms.
R=3.2 mode demonstrated superior performance compared to R=1, particularly in tail latency. This is attributed to its ability to select the optimal server for read operations.
2xR outperformed SCAR for large value sizes.
Data migration during quorum repair, performed via RPCs, introduced latency overhead.
CliqueMap is optimized for read-heavy workloads. Higher read percentages resulted in lower observed latencies.
Hardware vs. Software RDMA:

Pony Express demonstrated robust throughput and low latency even under high client load.
Hardware RDMA significantly outperformed software RDMA, even when limited to the 2xR strategy (hardware RDMA only supported 2xR).

Paper Review

This paper, authored by researchers at the University of Michigan, describes a highly compelling caching system. It stands out as one of the most promising solutions I've encountered in academic literature. I'm hopeful it will be offered as a Google Cloud product. Beyond its caching innovations, the paper provides an accessible introduction to RDMA, a technology with significant cloud implications. I found it particularly insightful, even compared to similar works from companies like Facebook and Twitter.

Pico