Presented at Usenix OSDI '20, this influential paper from VMware Research, with contributions from EPFL (Swiss), has since received significant attention within the distributed systems community. It was authored by Marcos Aguirela, a researcher at VMware.
Paper Link
Microservices Latency
Almost all backend systems are composed of a multi-level microservices architecture.
If 99% of requests to the backends are fast, the overall speed of the application will still be slow, as any single backend can introduce latency. To optimize the latency of the system, the application needs to optimize each backend to achieve four or five nines of availability, meaning that only when the p9999 or p99999 latency is low will the overall latency be good. For example, consider a service that depends on five downstream services, each with a good 99.9% latency. The probability that a random request will be slow is calculated as 1 - (0.999 ^ 5), which equals approximately 0.005 or 0.5%. This means there is still a 0.5% chance of a slow request, or a 99.5% chance of not being slow. This calculation still doesn't consider that the upstream service itself may run into issues. For example:
CPU: The thread processing responses from the downstream services may not be scheduled on time by the CPU.
Page fault: The thread processing the response may run into a page fault and get descheduled by the operating system.
Network congestion: The network path used by the thread to communicate with downstream services or clients might experience congestion.
Jeff Dean has an excellent article on what it takes to optimize for tail latency in a cloud-like environment where thousands of jobs compete for resources. Researchers have explored various approaches to optimize the tail latency of applications, including:
Steering interrupts to different CPU cores to improve processing efficiency.
Implementing custom kernel scheduling policies, such as the ghost scheduler introduced by Google for Linux, to prioritize certain tasks.
Using custom paging mechanisms, like huge pages, to reduce Translation Lookaside Buffer (TLB) misses and improve memory access times.
Microsecond Applications
Within the scope of this discussion, a pertinent question arises: is microsecond-level latency truly necessary? While human perception is generally limited to delays exceeding 10 milliseconds, the focus on microsecond-level latency is driven by several factors:
- As previously explained, latency exhibits an additive characteristic in layered systems. Higher-level systems depend heavily on the performance of underlying components. Even small latencies at each stage accumulate, eventually resulting in an overall latency that becomes noticeable to humans.
- Furthermore, this paper highlights additional examples of systems where low latency is critical. These include robot control systems, certain key-value stores designed for rapid access, and high-frequency trading platforms (although achieving success in HFT often necessitates even finer-grained latency at the nanosecond level).
Direct Memory Access
Recommended Read: Paper Insights - Eliminating Receive Livelock in an Interrupt-driven Kernel where I introduced how networking works.
All input/output (I/O) operations, including those involving Solid State Drives (SSDs) and other storage devices, require accessing the Dynamic Random-Access Memory (DRAM).
Transmission (TX):
In a non-direct access scenario, the Central Processing Unit (CPU) would typically read the data packets from DRAM and then write them to the memory-mapped I/O device.
This approach is inefficient as it is slow and consumes significant CPU cycles. To mitigate this overhead, a Direct Memory Access (DMA) engine is employed as a hardware accelerator.
For transmission, the CPU or another controller provides the DMA engine with the starting memory address in DRAM and the length of the data to be transferred to the I/O device. The DMA engine then handles the transfer directly, freeing up the CPU.
Reception (RX):
For reception, the CPU or another controller provides the DMA engine with the destination memory address in DRAM where the incoming bytes from the I/O device should be written.
Consequently, the CPU is largely freed from involvement in the data reception path, allowing it to perform other tasks concurrently.
Virtually all modern Network Interface Cards (NICs) are equipped with integrated DMA engines to offload the burden of data transfer from the CPU, thereby improving network performance and reducing CPU utilization.
Remote Direct Memory Access (RDMA)
Within traditional networking stacks, when an application intends to retrieve data from a counterpart application operating on a remote machine, it initiates a Remote Procedure Call (RPC). This RPC execution requires a pathway through the operating system kernel and the Network Interface Card (NIC) on the sending host. Subsequently, the data request traverses the NIC and the kernel of the receiving host. Following this kernel-level processing, the appropriate application thread on the receiving machine is invoked to access and read the requested data from its memory. Finally, this retrieved data is transmitted back through the identical network stack, effectively retracing the original path of the request.
In contrast, when employing Remote Direct Memory Access (RDMA), the Network Interface Card (NIC) is granted direct access to specifically designated memory regions on the remote machine. This capability for remote direct access fundamentally eliminates the necessity for central processing unit (CPU) intervention and, as a consequence, completely bypasses the involvement of both the operating system kernel and application threads in the data transfer process. The NIC itself possesses the inherent ability to directly read the required memory locations on the remote system and transmit the corresponding response back to the requesting application.
Let's take a detailed look behind RDMA mechanics.
Memory Registration
It's important to understand that not all of the host's memory is available for remote access via RDMA. Only specific portions of the host's memory must be explicitly registered with the Network Interface Card (NIC) to enable remote operations. This registration process makes these designated memory regions accessible to the NIC's RDMA controller, which can then perform read and write operations directly on those memory locations without CPU intervention.
Queue Pairs
A Queue Pair (QP) represents the fundamental communication endpoint of an RDMA connection. Any work request submitted to one end of a QP can be processed and received at the other end of the paired QP on the remote machine. Conceptually, a QP comprises two distinct queues: a Send Queue for initiating outgoing messages and a Receive Queue for handling incoming messages.
Messages placed onto the Send Queue are essentially commands directed towards the NIC on the local host. The client's CPU encodes these commands and places them into the Send Queue of the QP. Subsequently, the host's NIC hardware takes over and executes these commands. These commands can encompass fundamental memory operations such as:
Read: Retrieve data from a specified memory address on the remote host.
Write: Transfer and store data to a designated memory address on the remote host.
Atomic: Perform atomic memory operations like Compare and Swap (CAS) on a given memory address or Fetch and Increment a numerical value at a specific location. These operations are performed indivisibly by the remote NIC.
RDMA Connection Types
RDMA supports several variations of connection semantics, offering different trade-offs in terms of reliability and connection establishment:
Reliable Connection (RC)
This type of RDMA connection prioritizes reliable data delivery. It leverages acknowledgements generated by the NIC hardware to ensure that commands are successfully received by the remote peer. Furthermore, RC guarantees the ordering of message delivery. To achieve this reliability, the underlying hardware implements mechanisms for retransmissions (retries), acknowledgements, and sequence number tracking, all managed directly by the NIC. The Reliable Connection (RC) in RDMA is often considered the functional equivalent of TCP in traditional networking.
Unreliable Connection (UC): In contrast to RC, Unreliable Connection does not provide guarantees for reliable delivery or ordering of messages.
Connection-oriented vs. Connectionless: RDMA also distinguishes between connection-oriented and connectionless communication. Reliable Connection (RC) typically involves establishing a dedicated communication channel between the QPs before data transfer can occur. Unreliable Connection (UC) can be either connection-oriented or connectionless.
Datagram (UD): This is a connectionless mode of communication, similar to UDP in traditional networking. No dedicated channel is established before communication. Unreliable Datagram (UD) offers lower overhead but lacks the reliability and ordering guarantees of RC.
Access Checks
To ensure memory protection and controlled access in RDMA environments, each registered memory region is associated with an access flag. This flag explicitly defines the permissible operations on that memory region, specifying whether it can be read from or written to remotely.
Similarly, each Queue Pair (QP) also carries an access flag that defines the allowed access privileges to the remote memory regions when operations are initiated through that specific QP.
Collectively, the access flags associated with both the memory regions and the Queue Pairs provide a robust mechanism for implementing various access control policies, regulating how clients can interact with the host's registered memory.
Consensus
In summary, consensus algorithms address the fundamental challenge of achieving agreement on a single value among distributed processes. These algorithms must satisfy the following properties:
Termination (Liveness): Every non-faulty process eventually decides on a value.
Integrity (Safety): If all non-faulty processes initially propose the same value, that value must be the decided value.
Agreement (Safety): All non-faulty processes agree on the same decided value.
Fault Tolerance: The algorithm must maintain its properties despite process failures.
One of the most important critical application of consensus is in generation of distributed logs.
A distributed log is a sequence of ordered instructions that an RSM executes. Each log entry represents an instruction for the machines, ensuring that all nodes in the RSM execute these instructions in the same order.
For example, the following diagram shows an RSM with a single variable X stored in it. The instructions for the state machines can be blind-writes (X := Y) or read-writes (IF v(X) = Y THEN X := Z):
Note that different nodes of an RSM may be at different position in the execution log but will finally converge to the same final state.
Mu
Mu is a microsecond consensus system that makes use of RDMA at its core to achieve consensus.
Mu's architecture is divided into 2 planes -
1. Replication plane - This is where distributed logs are written.
2. Background plane - For management, leader election etc.
All communication take place over RDMA. However, one thing to note about RDMA is - the host side CPU is not involved. Most communication happens as follows:
- The client makes use of RDMA to write their messages to server's memory. The RC nature of RDMA connection ensures that the messages are delivered to server's memory.
- The server periodically poll the memory to discover the messages and consequently act on them.
The same is applicable to both replication and background plane.
Roles
Each node in the system assume a role of either leader or follower.
RDMA Memory Regions
The RDMA memory regions is split into two - one for the replication plane and another for the background plane.
Permissions
All nodes have permissions to write to each other's background plane. However, only the current leader is allowed to write to replication plane.
The replication plane memory region consists of the following:
1. Proposal number - The next proposal number - only in the leader node.
2. Logs - The actual distributed log entries in the sequence. To limit the size the distributed logs are written in a circular buffer. Old logs are erased after all nodes have caught up to certain log position.
The background plane memory region consists of the following:
1. Leader election details - Details of current leader. Additionally, the leader keeps track of the heartbeat sequence.
2. Permission details
Replication Plane
Let's walkthrough to how the replication plane works. The meat of the work is in Listing 2.
Essentially, the authors have implemented the Paxos algorithm using RDMA. The implementation is simple to follow for anyone who is well-versed with Paxos:
- There is a proposer (aka leader in this paper) who is proposing log for a particular log position (that it believes is empty). The acceptors (aka follower in this paper) reply back with an ACK or any value that is present in the slot.
- The proposer then proposes a value for the slot - it can be its own value if the slot was empty on all the acceptors, otherwise it must be the value returned by acceptors. If proposer's original value is not accepted into a slot position, the proposer will try again for the next slot position.
The only catch in case of Mu is - the acceptors (i.e. follower's) CPU is not invoked. Both these steps needs to be done by the proposer (i.e. leader) itself. The follower's CPU will periodically fetch the committed logs and apply them to the state machine.
We will see in a while how a leader is elected. But let's assume that we have a leader who has been granted write permission to the replication plane memory region my majority of followers called confirmed followers. Multiple leaders are possible, say, when one leader believes that it has majority followers whereas some followers may have granted the write permission to a new leader without the existing leader being aware of the change.
Each node maintins the Log struct consisting of the following fields:
1. minProposal - The Paxos minimum proposal number. The proposer can only send new proposals with a number that is larger than all minProposals.
2. FUO - The first empty position of the slots that is not set with any value. Note that all logs upto FUO are correct and can be applied directly.
3. slots[] - Containing the entries <proposal number, value>. The proposal number is the proposal number for which the value was accepted. The value is the actual log.
Let's walkthrough Listing 2 which is the main algorithm:
- Line 13-14 - As mentioned earlier, a proposer continuously exectues the two phases of Paxos - PREAPRE and ACCEPT until its value is accepted into one of the log slot. Note that the next log needs to be written to the FUO position.
- Line 18-24 - The leader finds the highest proposal number that it must use for the FUO slot (the FUO that it has and knows of). Then writes that proposal number everywhere. It also reads any existing value at that slow position in any node. This is like Paxos PREPARE phase where acceptors return ACK or existing value of FUO slot.
- Line 25-28 - The most crucial lines - At the end of PREPARE phase, if a value was returned, that same value will be used in the ACCEPT phase (alas!). FUO will be incremented and the algorithm will simply try again for the next slot.
- Line 31-33 - If the slot is found to be empty on all confirmed followers, the log will be written to that position. If there is a race with another leader (multiple leaders are possible), for example, if the slot was written by another leader, then the write will abort. The proposer needs to retry for the next slot.
- Line 34-35 - Again very important, the proposer gives up only when it has written its own proposed value to some slot.
Do multiple leader harm the consensus?
No, not at all. Paxos, by design, is safe. It sacrifices termination in exchange (FLP impossibility). Let's explore the scenarios:
Say there are two nodes which believe they are leader. This can happen when majority followers give write permission to someone else while the previous leader was still believing itself to be the leader. In this scenario, the previous leader would no longer have the write permission on majority which will share at least one node in common with the new cohort. As a result, the existing leader won't be able to write anything - neither proposal number nor log positions. It can still read and update itself.
Indeeed, the write permission on the replication plane memory region is what acts as the shield even when CPU is not involved. Because write permission is granted when host's CPU is involved.
Catch-up
Both new leaders and existing followers may be behind in terms logs. To catch-up, they can read FUO from all other nodes and blindly apply all the logs upto the highest GUP to their logs. (listing 3 and 4).
Then set the FUO to the highest known FUO. In deed each aspiring leader has to do that before it can become the leader.
Why does this work? If FUO is incremented in even one replica, then the exact command would eventually be reflected on all other replica. This is ensured by the monotonic proposal number and the way Paxos work. All slots upto FUO on every replica will always have the exact same commamd.
Omitting Prepare Phase
A good optimization, since after the leader election, it is well known that the leader has been granted write permission on the confirmed followers. The next slot can be directly written with the leader's value. The min proposal value is already known. If the leader loses leadership, it will be reflexted by the fact that the write on majority in the accept phase will fail.
Paper Review
This paper is well-written, effectively introducing all necessary background topics before delving into the design details. The evaluation presented is also comprehensive. There is no doubt that the problem addressed by the authors has broad applicability across numerous systems. Personally, I found this paper to be a more accessible read than Raft, and comparable in difficulty to Paxos. Readers with a solid understanding of RDMA should find the paper straightforward to follow. Given the adoption of these ideas within the industry, I would highly recommend reading this contribution.
Comments
Post a Comment