Skip to main content

Paper Insights #20 - Paxos Made Simple

We will examine Paxos, a cornerstone consensus protocol within distributed systems. While the title might imply a gentle introduction, Leslie Lamport's original paper provides a comprehensive, dense treatment of Paxos. This influential work is essential for anyone studying computer science.

Paper Link

Let's begin!

Basic Definitions

We begin with fundamental definitions regarding failures in distributed systems. 

Failure Types

Two primary failure modes are commonly observed in distributed systems:

  • Crash Failures: These occur when a process terminates unexpectedly, potentially due to hardware malfunctions, software errors, or power outages.
  • Byzantine Failures: These involve processes exhibiting arbitrary or malicious behavior, such as sending incorrect data or deviating from protocol specifications.

While handling Byzantine failures are crucial in internet-scale systems and decentralized protocols like blockchains, distributed systems within controlled environments, such as those within an organization's network, only need to think about crash-fault tolerance. This simplification assumes non-malicious node behavior, which is practical when nodes are under a single administrative domain.

Network Types

Furthermore, we distinguish between:

  • Asynchronous Network: These networks has no upper bound on packet delivery times.
  • Synchronous Network: These networks assume bounded packet delivery times.

Real-world networks are inherently asynchronous. Synchronous systems are primarily theoretical constructs. Partially synchronous networks, relevant for Byzantine fault tolerance, are outside the scope of this discussion.

Consensus

Consensus algorithms address the fundamental challenge of achieving agreement on a single value among distributed processes. These algorithms must satisfy the following properties:

  • Termination (Liveness): Every non-faulty process eventually decides on a value.
  • Integrity (Safety): If all non-faulty processes initially propose the same value, that value must be the decided value.
  • Agreement (Safety): All non-faulty processes agree on the same decided value.
  • Fault Tolerance: The algorithm must maintain its properties despite process failures.

Consensus algorithms are essential for distributed systems, serving as the foundation for coordination services like Apache ZooKeeper. Their role is indispensable in building reliable distributed applications.

FLP Impossibility

Having defined consensus, we now examine the FLP (Fischer-Lynch-Paterson) impossibility result, as presented in Impossibility of Distributed Consensus with One Faulty Process. This theorem establishes that -

In an asynchronous network, no consensus protocol can simultaneously guarantee safety, liveness, and fault tolerance.

The original paper provides a rigorous mathematical proof under the assumption of an asynchronous network where -

  • Messages may be arbitrarily delayed and reordered.
  • But are eventually delivered exactly once. 
This is the best-case assumption of asynchronous networks, which are often less reliable in practice. The FLP theorem demonstrates that even with a single process crash fault, consensus cannot be achieved. I would highly recommend reading the paper.

Given the prevalence of crash faults and the inherent asynchronicity of real-world networks, the FLP result implies that no consensus algorithm really exists!

FLP Impossibility vs. CAP Theorem

Although both FLP Impossibility and CAP Theorem address limitations in distributed systems, they differ significantly. 

FLP states that in an asynchronous network with eventual message delivery, termination cannot be guaranteed if even one process may crash. CAP, conversely, addresses distributed data stores and posits that in a network with potential message loss, consistency and availability cannot be simultaneously guaranteed.

The key difference lies in their network assumptions. CAP makes stronger assumptions than FLP. It's important to note that weaker assumptions lead to stronger impossibility results. FLP's weaker assumptions (eventual message delivery) yield a more powerful and widely recognized result.

Recommended Reading: A Brief Tour of FLP Impossibility for a deeper insight.

Practical Consensus Algorithms

Despite the theoretical limitations imposed by the FLP impossibility result, practical consensus algorithms have been developed that perform effectively in most real-world scenarios. These algorithms represent compromises that prioritize practicality.

While compromises are made, safety remains paramount. Liveness is often relaxed, relying on probabilistic termination, which typically occurs with sufficient randomness.


Network 

Safety

Liveness

Crash Faults

Byzantine Faults

Paxos

Asynchronous

Yes

No

Yes

No

Raft

Asynchronous

Yes

No

Yes

No

PBFT

Synchronous/

Partial-Synchronous

Yes

Yes

Yes

Yes

Dolev-Strong

Synchronous

Yes

Yes

Yes

Yes

Honeybadger

Asynchronous

Yes

No*

Yes

Yes

PoW

Asynchronous

No*

No*

Yes

Yes

* - Actually the property does hold but is subject to probability. In PoW, for example, there's a tiny chance that two blocks might be created simultaneously, leading to a temporary fork in the blockchain. However, the protocol is designed so that the chain with the most accumulated work will eventually become the canonical chain, making disagreements highly improbable.

This article focuses on crash-fault tolerant algorithms, specifically Paxos and Raft. These algorithms are widely applicable to private network-based systems, which constitute a significant portion of industrial deployments.

Paxos

Following the introductory remarks, we will now delve into a comprehensive explanation of the Paxos consensus protocol. Paxos is fundamentally a consensus algorithm designed to function within an asynchronous network environment. This algorithm operates under the assumption that the system experiences only crash faults but do not exhibit Byzantine faults. The primary objective of Paxos is to enable agreement among nodes while guaranteeing safety, though not guaranteeing strict liveness.

Node Roles

Within the Paxos algorithm, each node is assigned one of three distinct roles:

  • Proposer: This node initiates proposals for values that the system should agree upon.
  • Acceptor: This node responds to proposals and votes on proposed values.
  • Learner: This node learns the agreed-upon value.

The Algorithm (Read Carefully)

The overarching goal of this algorithm is to ensure that all participating nodes agree on a single value, denoted as v. The algorithm proceeds in two primary phases:

Prepare Phase

  • A proposer selects a unique proposal number, N, and sends a PREPARE request to all acceptors. Each acceptor, upon receiving a prepare request, may choose to accept the proposal number N if, and only if, N is greater than all proposal numbers it has previously accepted.
  • If an acceptor accepts the proposal number N:
    • It promises not to accept any proposal number less than N in the future.
    • It sends a reply back to the proposer, which includes the the highest proposal (along with any value) less than N that it has previously accepted.

Accept Phase

  • Once the proposer receives a majority of responses from acceptors, it associates the value V with the proposal number N and sends an ACCEPT request to all acceptors. The value V being associated with N has two possibilities:
    • The value V is the value that the proposer initially intended to propose. This is the ideal scenario and occurs when no other proposer has been able to complete the accept phase.
    • The value V is the value that has been previously accepted by a majority of acceptors.
  • All acceptors that have accepted the proposal number N then associate the value V with N, given that they have not accepted another proposal number higher than N.

Safety

The algorithm guarantees that the first accepted value (corresponding to any proposal number) will be the only accepted value (for any higher proposal number as well).

Let's start with P2(c) and work backwardsP2(c) states that once a majority of acceptors have accepted a proposal number N and a corresponding value V, that value V is committed. This happens as follows (intuitive from the algorithm as well): 

  • The proposer, in the prepare phase, learns the highest proposal number that acceptors have already accepted. If the proposer's proposed number is less than the highest number known to the acceptors, it cannot proceed and has to try again.
  • The acceptors promise (see P1(a)), not to accept any proposal number less than N in the future.
  • As a result, once a majority of acceptors accept a proposal number N, it can be safely assigned a value V.

Using P2(c), we can now derive P2(b) which states that once a value V is chosen for a number N, then all proposers with a proposal M > N will use the same value V:

  • Say a second proposer with M begins just after N completes the accept phase. There would be at least one acceptor common between them due to majority. That acceptor will have the value V for number N.
  • This forces the second proposer to use the same value as the first. 

This is exactly what is proven inductively on page 3, last paragraph of the paper.
P2(a) is implied by P2(b), and P2 is implied by P2(a) which represents the total safety guarantee.

Optimization

To optimize the algorithm, acceptors can choose to remember only the highest proposal number they have accepted. This is because all subsequent proposals after the first successful acceptance will result in the same value. Additionally, once an acceptor has responded in the prepare phase, it has already sent back the old known value, thus it doesn't need to remember that value anymore. The responsibility of remembering this value shifts to the proposer.

It is important to note that the goal of consensus is to agree on a single value, not on a single proposal number. Different proposers may result in different proposal numbers being accepted by different acceptors (due to majority-only considerations). However, in the end, all acceptors will have the same value (i.e., the first accepted value) assigned to all their known highest proposal numbers.

Liveness

The liveness of the Paxos algorithm can be compromised in scenarios involving racing proposers.

For example, if an acceptor accepts a proposal number N but, before it receives a value for N, it receives and accepts a proposal number greater than N, then the accept phase will fail for the first proposer.

In such situations, the first proposer must retry with an even higher proposal number. However, if two or more proposers continuously race in this manner, each proposing higher numbers but never completing the accept phase, the system may never terminate.

To mitigate this issue, Lamport suggests using a single proposer in the system. This proposer can be elected through a consensus process (which is itself a form of value agreement). After the proposer is elected, it becomes responsible for all future proposals.

However:

  • The initial election of the proposer (a.k.a. the leader election), can itself never terminate.
  • Furthermore, if the single proposer fails, the system may become stuck until the Proposer recovers. 
To address these scenarios, timeouts and randomness can be introduced to facilitate progress. However, these mechanisms may result in multiple proposers proposing values simultaneously, which can further impact liveness. It is crucial to emphasize that, despite this introduction of timeouts to fix liveness issue in practical scenarios, the safety of the Paxos algorithm is never compromised. That's the beauty of the algorithm.

Fault Tolerance

The Paxos algorithm is designed to tolerate crash faults. Acceptors must persist the highest proposal number they have accepted and the associated value if the accept phase completes. Proposers must remember their highest proposal number and any values they receive during the prepare phase.

The algorithm's reliance on a majority of acceptors ensures that the failure of some nodes does not affect liveness. Additionally, because any node can become a proposer, and Paxos can handle multiple proposers safely, the failure of a proposer does not effect the algorithm.

In conclusion, Paxos is fault-tolerant to crash faults, provided that a majority of nodes remain operational. If a majority of nodes fail, no consensus algorithm can guarantee agreement anyway.

Replicated State Machines

Having established that Paxos can facilitate agreement on a single value among distributed nodes, we can extend its application to construct a replicated state machine (RSM).

An RSM involves multiple machines executing an identical sequence of instructions. This architecture provides enhanced fault tolerance compared to a standalone state machine (a single computer). The instructions for an RSM are stored in a distributed log.

Distributed Log

A distributed log is a sequence of ordered instructions that an RSM executes. Each log entry represents an instruction for the machines, ensuring that all nodes in the RSM execute these instructions in the same order.

For example, the following diagram shows an RSM with a single variable X stored in it. The instructions for the state machines can be blind-writes (X := Y) or read-writes (IF v(X) = Y THEN X := Z):



Note that different nodes of an RSM may be at different position in the execution log but will finally converge to the same final state.

Generating Distributed Logs Using Paxos

Consensus algorithms, such as Paxos, can be employed to create distributed logs:

  • The log is segmented into distinct entries.
  • Each entry is treated as a value to be proposed and accepted via the consensus algorithm.
  • Once an entry is proposed and accepted, it becomes the definitive value for that log position.

It is crucial to reiterate that Paxos guarantees that once a value is accepted for a log entry, it remains unchanged, regardless of subsequent proposals for that entry. For a distributed log:

  • When a proposer intends to append an entry to the log, it selects the next available position in the log and executes the Paxos algorithm to establish a value for that entry.
  • If the proposer fails to insert its value into the chosen position, possibly due to a competing proposer successfully establishing its value first, the proposer retries with the subsequent available slot. This process continues until the proposer successfully adds an entry. If all attempts fail, the proposer may communicate this failure to the client.

This mechanism underscores the importance of Paxos's immutability property: once a value is accepted, it remains unchanged, regardless of future proposals for the same log position.

Optimization

Executing both phases of the Paxos algorithm (prepare and accept) in real-time can introduce latency. To mitigate this, proposers can proactively execute the prepare phase to reserve log positions and subsequently propose values for them. This optimization is particularly effective when a single proposer reserves several slots in advance, significantly enhancing performance.

Paxos v/s Raft

It's a point of personal amusement to me that Raft is often perceived as more understandable than Paxos. I've consistently found Paxos to be significantly clearer, more logical, and concise in its presentation.

Raft was authored by Ongaro and Ousterhout from Stanford University and presented at Usenix ATC '14. It is a very impactful paper as many systems like CockroachDB, MongoDB, and Kafka now use Raft.

The primary distinction I observe between Paxos and Raft lies in their handling of proposers. In Paxos, concurrent proposers are permissible. While a single leader can be elected via timeouts for performance optimization, multiple proposers do not compromise safety.

Conversely, Raft designates a single leader responsible for log replication. While this approach is often cited as simplifying, I personally find it challenging to reason about in all potential scenarios.

Nevertheless, here's a concise overview of Raft's operation:

Leader Election

  • Raft elects a single designated leader to manage the system.
  • In the event of leader failure, the remaining servers initiate an election to select a new leader.
  • This election process employs timeouts and a voting mechanism to ensure the selection of a unique leader.

Log Replication

  • The leader receives client requests and appends them as entries to its log.
  • The leader then replicates these log entries to the follower servers.
  • Upon receiving acknowledgments from a majority of followers for a log entry, the leader commits that entry.
  • Committed entries are subsequently applied to the system's state.

Paper Review

This paper was one of the earliest papers that I read. At that time, I didn't know the difference between consensus and quorum (yes, I was too naive). It was incredibly hard for me to grasp all the concepts and reason about everything described in the paper.

It may require several readings to grasp all the contents of the paper, but once you do, you will master one of the most important concepts in computer science. This paper is a must-read for all distributed systems enthusiasts.

Comments

Popular Posts

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...