Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.
Let's begin with some basic concepts.
Network Interface Card (NIC)
CPU <-> Memory Communication
- Data Bus: Bidirectional, carrying the actual data being transferred.
- Address Bus: Unidirectional, carrying memory addresses for data access.
- Control Bus: Carrying control signals for synchronization and coordination.
NIC Interactions
Receive (RX)
- The NIC buffers incoming data into its internal memory.
- Upon buffer fullness, the NIC generates a maskable interrupt to the CPU.
- The kernel (i.e. software instructions running in CPU), in response to the interrupt, copies the data from the NIC's buffer to RAM.
- Higher-layer protocols then process the data (e.g., TCP/IP copying data to socket buffers), and the application is notified.
Transmit (TX)
- The application writes data to a memory buffer.
- The kernel copies the data to the NIC's buffer.
- The NIC transmits the data over the network.
Direct Memory Access (DMA)
Remote Direct Memory Access (RDMA)
A Brief History
- High-Performance Computing
- Data centers
- Cloud computing
- Storage systems (e.g., NVMe over Fabrics)
- Machine Learning and AI.
Hardware RDMA
Architecture
Implications
While hardware RDMA delivers a substantial reduction in network latency, bypassing the operating system kernel and application layers on the receiver side introduces several significant implications:
Increased Complexity in Client Protocol Implementation
Security Vulnerabilities
Software RDMA
Software RDMA emerged as a response to security concerns associated with hardware RDMA. In contrast to hardware RDMA, software RDMA employs dedicated kernel threads. These threads mediate communication between the NIC and application-registered memory regions.
This approach maintains the application thread bypass, eliminating user-space thread scheduling overhead. Additionally, the kernel threads possess the capability to execute logic and transmit data, enabling programmable network operations.
Software RDMA offers a sweet spot between hardware RDMA and the traditional network stack.
Pony Express
Google's Snap microkernel networking architecture, leverages a custom scheduler for CPU-NIC interactions, replacing interrupt-driven methods. This allows dedicated CPU cores to fully utilize NIC capacity, removing CPU bottlenecks. Snap's extensibility permits user-defined kernel code (engines) to handle network requests on these dedicated cores. Pony Express is one such engine.
Pony Express features a two-layered architecture: an application-facing layer that manages state and interacts with applications, including shared memory registration, and a lower layer that handles message processing and flow control.
Pony Express supports one-sided operations, such as RDMA reads, that extends beyond basic memory access. For example, it can perform hash table lookups if the application registers a hash map data structure.
Applications can utilize dedicated or shared Pony engine cores, communicating via shared command queues.
Performance
While traditional RPC achieve around 100,000 IOPS/core, Pony threads can reach 5M IOPS/core.
The throughput of Pony threads is influenced by the network's maximum transmission unit (MTU). Larger MTUs can improve NIC saturation but increase retransmission costs. Evaluations indicate that a single Pony thread can saturate 67 Gbps with a 5 KB MTU and 38 Gbps with a 1500B MTU. Considering modern 100 Gbps NICs, a few Pony threads can effectively saturate the NIC.
For conservative estimations, applications can assume 10 Gbps per Pony thread, translating to 250,000 IOPS per thread with a 5 KB MTU.
Hardware vs. Software RDMA Comparison
- Software RDMA can achieve higher user-defined IOPS due to the CPU's ability to execute custom logic, enhancing performance. For instance, software RDMA can implement custom flow control mechanisms to mitigate fabric pauses, a limitation of hardware RDMA. This demonstrates instances where software outperforms hardware, as seen in virtualization.
- Software RDMA offers significantly greater flexibility compared to hardware RDMA. Its reprogrammable nature allows for continuous innovation and adaptation even after deployment.
- Hardware RDMA, however, eliminates CPU overhead. While software RDMA incurs CPU costs, these costs so minimal that it is almost zero.
NIC - The Limiting Factor of Modern Networking
Caching
Caching plays a crucial role in optimizing serving systems, which often rely on replication to handle high client request volumes. Caches are essentially volatile key-value stores where both keys and values are stored in memory only. There are no persistence guarantees, so one must not rely on them for data correctness. They should only be used for performance improvements.
Cache Strategies
Two primary caching strategies exist:
In-Memory Cache
This approach embeds the cache directly within each serving process.- Each replica incurs the full RAM cost of its local cache.
- Cached data is not shared across replicas, leading to redundant database fetches (e.g., if replica 1 caches key X, replica 2 must still retrieve it independently).
Distributed Cache
A separate, cluster-deployed caching layer is utilized.- Reduces memory overhead by storing each key-value pair only once.
- Enables data sharing across serving replicas.
- Introduces network latency as serving processes must communicate with the cache cluster. Numerous optimization techniques exist to mitigate this latency.
Cache Write Policies
Write-Through
Write-Back
With a write-back cache, write operations are initially performed only on the cache. Updates to the backing data store are delayed and performed asynchronously, typically based on a timer or when a cache line is evicted.Write-Around
Data is written directly to the backing store, bypassing the cache. This is used to prevent cache pollution, where infrequently written data fills the cache.CliqueMap
CliqueMap, a widely used key-value store at Google, builds upon the foundational principles of Pilaf, an RMA-based storage system. While not developed internally, CliqueMap has become a critical component of Google's infrastructure.
Key architectural choices include:
- Reads: CliqueMap leverages RDMA for read operations, significantly reducing latency compared to traditional RPC. However, it falls back to RPC in certain scenarios.
- Writes: All write operations in CliqueMap are handled via RPC.
CliqueMap vs. Memcached
CliqueMap offers several advantages over the industry-standard Memcached:
RDMA-Powered Reads
Unlike Memcached, which relies solely on RPC, CliqueMap utilizes RDMA for reads. At scale, RPC introduces considerable overhead (authentication, flow control, etc.), impacting performance. RDMA provides a more efficient approach, especially for read-heavy workloads, common in caching scenarios.Native Replication
Memcached lacks built-in replication, requiring external solutions (e.g., Redis wrappers) for redundancy. CliqueMap is designed with native replication capabilities, enhancing data availability and fault tolerance.Data Organization
A CliqueMap cluster employs multiple replicas. Each replica organizes data into distinct regions: a data region and an index region.
Data Region
The data region is segmented into slabs, and a slab-based allocator manages memory for value storage. Slab sizes are configurable to optimize for specific workloads. The memory allocated for slabs can be dynamically resized by the CliqueMap server threads (via RPC writes) to accommodate growing data.Each data entry in the data region comprises:
- Format: A versioning identifier.
- Key Length: Specifies the size of the key.
- Key: The key itself.
- Data Length: Specifies the size of the value.
- Data: The value associated with the key.
- Metadata: Includes the version number, ensuring data consistency.
- Checksum: A checksum of the entire entry, crucial for data integrity.
Index Region
Each index entry contains:
- KeyHash: The hash of the key, used for bucket assignment.
- VersionNumber: A unique, monotonically increasing version number that's updated on every write.
- Pointer: The memory address of the corresponding value in the data region.
Consistent Hashing
Replication
CliqueMap utilizes quorum replication, supporting:
- R=1: No replication (single replica).
- R=2: Replication across two replicas (for immutable key-value pairs).
- R=3.2: Replication across three replicas, requiring a majority of two for operations.
Reading a Key-Value Pair (GET)
Reading a key-value pair in CliqueMap involves the following steps:
- The key is hashed. The corresponding replica backends are identified based on the hash.
- Key-value is retrieved from the replicas using RDMA.
- Version numbers of the retrieved values are compared to ensure consistency.
- In R=2 mode, versions must match.
- In R=3.2 mode, a majority (2 out of 3) must match. The majority version number is the consistent version number.
- Finally, the client verifies the checksum of the fetched data. The retrieved key is compared to the original key to mitigate potential (though rare) hash collisions.
Read Strategies
CliqueMap employs several strategies for reads:
2xR (Two Round Trips)
- Retrieving the bucket from index region of all replicas using the key hash.
- Fetching the actual value from preferred backend's data region.
SCAR (Scan and Read)
Performs a single RDMA round trip. Pony Express reads the index region, retrieves the data region pointer, and returns the data entry. The Pony thread needs to execute some custom logic to make this possible. This consumes more server-side CPU but reduces latency. However, the value will be returned by all the replicas. This is preferable for small value sizes.RPC Fallback
If RDMA reads cannot provide the value (as in case of overflow buckets, described later), a fallback to traditional RPC is used. This is significantly slower than 2xR and SCAR.Relaxed Reads
CliqueMap also offers relaxed reads, which prioritize performance over consistency. In this mode, the value is fetched from a single replica. This saves both server-side computation and client-side bandwidth consumption. However, the retrieved value may be outdated.Writing to a Key-Value Pair (SET)
Total Order on Mutations
CliqueMap ensures total ordering of key-value pair updates, leveraging Google's TrueTime, a distributed time synchronization system. Each mutation is assigned a unique, monotonically increasing version number in the format <TrueTime, Client ID, Sequence Number>
.
This versioning system means that mutation order is determined by the version number, not necessarily the arrival time of requests. While this might seem to violate consistency, TrueTime guarantees that clients eventually generate higher version numbers for subsequent requests, maintaining consistency.
Erase Operations
The erase operation removes a key-value pair from the cache. To prevent conflicts with concurrent mutations, erase also uses version numbers. After an erase, the key-value pair is moved to a fully-associative tombstone cache. This ensures that reads for older versions of the key-value pair can still be served.
Compare-And-Swap (CAS)
CAS is a crucial primitive for CliqueMap. It performs a mutation only if the current version number of the key-value pair matches a provided version number. This is essential for implementing transactional behavior on single key-value pairs (CliqueMap, like many key-value stores, doesn't support multi-key transactions).
The client reads the current value and its version number, and the subsequent mutation succeeds only if the version numbers match. This enables atomic updates and prevents race conditions.
Cache Evictions
As a volatile key-value store, CliqueMap can lose data without impacting system correctness. Eviction, the process of removing key-value pairs, is essential for accommodating new key-value pairs.
Eviction Types
CliqueMap employs two primary eviction types:
- Capacity Evictions: These occur when the cache reaches its memory capacity and needs to free space for new key-value pairs.
- Associativity Evictions: Due to the set-associative bucket structure, each bucket has a limited number of index entries. If numerous keys map to the same bucket, some keys must be evicted to make room for new entries.
It is possible to address associativity evictions using overflow buckets. When a bucket overflows, new key-value pairs are stored in these overflow buckets. While this avoids immediate eviction, accessing data in overflow buckets incurs an RPC overhead.
Eviction Policies
CliqueMap supports standard eviction policies, such as Least Recently Used (LRU). However, the RDMA-based read operations present a challenge for tracking key access times.
To address this, clients communicate key access times to the server backends via separate batch RPCs. This allows the server to approximate LRU behavior despite the use of RDMA for reads.
Quorum Repairs
In R=3.2 mode, each key is replicated across three replicas. The loss of one replica results in a dirty quorum, while the loss of two leads to an inquorate state. CliqueMap implements on-demand quorum repair to recover key-value pairs in these scenarios.
Quorum repair is also triggered when a replica restarts.
The repair process involves the following steps:
- Replicas scan their cohorts (the other replicas holding the same key) to determine the current version of the key-value pairs.
- If version numbers don't match, replicas fetch the latest version of the key.
This mechanism is also valuable for updating stale replicas, which might have missed a mutation (since R=3.2 only requires two successful writes out of three).
Evaluation
The authors conducted a comprehensive evaluation of the caching system, focusing on key performance metrics, particularly tail latency (p95, p99, p999). Minimizing tail latency is crucial for caching systems, as they underpin downstream applications, directly impacting overall system performance.
Key evaluation findings include:
- At 3 million operations per second (ops/sec), the p999 latency was ~10 ms.
- R=3.2 mode demonstrated superior performance compared to R=1, particularly in tail latency. This is attributed to its ability to select the optimal server for read operations.
- 2xR outperformed SCAR for large value sizes.
- Data migration during quorum repair, performed via RPCs, introduced latency overhead.
- CliqueMap is optimized for read-heavy workloads. Higher read percentages resulted in lower observed latencies.
- Hardware vs. Software RDMA:
- Pony Express demonstrated robust throughput and low latency even under high client load.
- Hardware RDMA significantly outperformed software RDMA, even when limited to the 2xR strategy (hardware RDMA only supported 2xR).
Comments
Post a Comment