Skip to main content

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic.

Paper Link

The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases.

Serializability

Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized.

Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency). This differs from linearizability, which focuses on single-object operations, as strict serializability applies to multi-object transactions (e.g., updates across multiple rows).

Serializability is a transaction isolation level, one of several including read uncommitted, read committed, and repeatable read. This discussion focuses exclusively on serializable transactions. These transactions are implemented using either:

  • Locks - Prevents concurrent execution of transactions that modify the same object. This only works for standalone non-distributed SQL databases.
  • Timestamp Ordering - Implements a global ordering of transactions ordered by timestamps from a source of truth. This approach can offer serializability and indeed, strict serializability, however, such a source of truth is difficult to obtain for most systems.
There are other ways to improve the methods above. For example, versioned objects can be used for multi-version concurrency control so that locks are not required. Again, this only works for standalone SQL databases. We will revisit serializability at the end of this article.

Single Data Transactions

Single-data transactions are the ones that modify only a single data item (e.g., a single row). These are offered by NoSQL databases, where multi-data transactions are not supported for performance. The item being modified in a transaction resides on a set of replicas. All replicas are expected to apply the same modification as a result of the transaction.

These single-item transactions are also commonly implemented by distributed lock services like ZooKeeper, using a distributed log built on consensus algorithms. These implementations also ensure linearizability.

In this article, we will focus on multi-data transactions. For more details on how to implement single-data transactions, refer ZooKeeper: Wait-free coordination for Internet-scale systems.

Multi-Data Transactions

Multi-data transactions are the ones that modify multiple data item (e.g., multiple rows). The term transaction commonly refers to a multi-item transaction, which is a core feature of SQL databases.

Consider a simple banking scenario with accounts held by A and B. A transfers money to B.

The client code will be something like:

txn = Transaction.start();
balA = txn.Read(A);
balB = txn.Read(B);
txn.Write(A, balA - $50);
txn.Write(B, balB + $50);
txn.Commit();

The transaction needs to ensure:

  • Atomicity - The transfer either completes fully or not at all. 
  • Consistency - The total balance across all accounts remains constant after any transfer.

Implementation in Non-Distributed SQL Databases

In a non-distributed database, all data resides on a single server. B+ trees, indexed by primary keys with leaf nodes pointing to row data, are a common data structure used in these systems.

Transactions are typically implemented by acquiring locks on the data items that are read/modified. The data items are then updated, and finally, the transaction commits. However, if a transaction fails before reaching the commit stage, a mechanism is required to revert any changes made. This is where transaction logs provide the necessary means to handle pre-commit failures.

Transaction Logs

Transaction logs are essential for database recovery and ensure data consistency. Two primary types exist:

  • UNDO Logs: These logs are written before data modifications. They capture the original state of the data, allowing for rollback operations in case of failures.
  • REDO Logs: These logs are written after data modifications. They record the changes made to the data, enabling the system to reapply these changes during recovery.

UNDO logs contain the following types of records:

  • [start, TID]: Indicates the beginning of a transaction with identifier TID.
  • [write, TID, X, old_value]: Records the modification of data item X by transaction TID. old_value stores the original value of X before the modification.
  • [commit, TID]: Signals the successful completion of transaction TID.
  • [abort, TID]: Indicates that transaction TID has been aborted.

If a transaction fails (i.e., lacks a commit or abort entry in the UNDO log), the rollback system:

  1. Iterates through the log entries for the failed transaction.
  2. For each write entry, restores the original data value (old_value) to the corresponding data item.
  3. Adds an abort entry to the UNDO log for the failed transaction, marking its termination.

This process ensures that the database remains in a consistent state even after unexpected failures. Such failures are retried by the client.

In our bank example above, assuming A had a balance of $100 and B had a balance of $150, the following would the entries in the UNDO logs for a committed transaction:

start 100024
write 100024 A $100
write 100024 B $150
commit 100024

It's crucial to understand that each entry in the UNDO log directly corresponds to a modification performed by the client within the scope of a single transaction.

Implementation in Distributed SQL Databases

Now, consider a system where data is distributed across multiple nodes. In our bank example, account balance A might reside on one server, while account balance B might reside on another. When a transaction involves modifying data on multiple nodes (like transferring funds between accounts A and B), it must be atomic

The actions performed by the client within the scope of the transaction are sent to each node independently. The start and commit messages are sent to all the nodes. We can repeat the same process of having UNDO logs:

Node 1

start 100024
write 100024 A $100
commit 100024

Node 2

start 100024
write 100024 B $150
commit 100024

But, if the client crashes after committing a transaction to node 1 but before committing to node 2, we would be left in an inconsistent state.

Node 1

start 100024
write 100024 A $100
commit 100024

Node 2

start 100024
write 100024 B $150
commit 100024

Since node 1 has already committed the transaction, it will not rollback the changes made locally.

This is exactly where we need Two-Phase Commits (2PC)We want all nodes to either commit or abort the entire transaction.

Consensus v/s 2PC

Consensus algorithms, such as Paxos, ensure that all replicas of a state machine execute the exact same sequence of actions or apply the same set of instructions. All replicas are equivalent. This guarantees consistent state across all replicas.

In contrast, 2PC focuses on ensuring that all participants in a distributed transaction take the same action atomically. This means all participants either commit or abort the transaction together. Each participant is entirely a different entity.

Note that, in case of consensus, there is a concept of majority of replicas being caught up to head. However, there's no concept of a majority decision in 2PC. All must commit or abort the transaction.

Example: Hotel Booking

The example below is taken from Predrag's Blog. I would recommend reading How Paxos and Two-Phase Commit Differ.

Consider a hotel booking scenario. The booking process involves multiple steps:

  1. Deducting money from the user's bank account.
  2. Reserving the hotel room.
  3. Sending a confirmation to the user.

These steps must be executed atomically. If any step fails, the entire transaction should be rolled back.

In this example:

  • The participants include the booking website (coordinator and cohort), the bank (cohort), and the hotel (cohort).
  • Each participant can be implemented as a highly available replicated state machine (RSM) using algorithms like Paxos. This ensures that each participant itself can tolerate failures and remain operational.
  • 2PC ensures that all participants either successfully complete all steps or none of them.

Key Differences Summarized:

  • Consensus: Focuses on consistent state across replicas of a single state machine.
  • 2PC: Focuses on atomic actions across multiple participants in a distributed transaction.

The Two Phase Commit Protocol

The primary objective of 2PC is to ensure that all participants involved in a distributed transaction reach a unanimous consensus on whether to commit or abort the transaction.

Once all transaction instructions have been logged (written to transaction logs) and applied to the data structures on all nodes, the 2PC protocol determines whether the transaction should be committed or aborted.

The Protocol

The two-phase commit (2PC) protocol proceeds as follows:

Phase 1: Prepare

A designated node (often the client) acts as the coordinator. It broadcasts PREPARE messages to all participating nodes (cohorts). At this stage, each cohort can still decide whether to commit or abort. If locks are used, a cohort will typically commit. However, in multi-version concurrency control, for example, a cohort might abort if read version numbers no longer matches the latest version numbers after 2PC began.

If a cohort decides to commit, it sends a COMMIT-VOTE to the coordinator. Otherwise, it sends an ABORT-VOTE. Crucially, once a cohort sends a COMMIT-VOTE, it must commit the transaction in the next phase. To guarantee this, cohorts must persistently record their commitment. Common strategies include:

  • Writing the commitment to storage, such as adding a "pre-commit T" entry to the transaction log, indicating the cohort's commitment to transaction T.
  • Using a Paxos group of replicas to act as a single, highly available cohort. The pre-commit message is logged in the distributed log, ensuring all replicas in the group are aware of the commitment.

This paper focuses on individual nodes with local storage as cohort members, therefore we will only consider the first strategy (logging the commitment to local storage). The second strategy (using Paxos groups) is implemented by Spanner.

Recommended ReadSpanner: Google's Globally-Distributed Database

Phase 2: Commit/Abort

If any cohort voted to abort, the coordinator broadcasts an ABORT message to all cohorts. If all cohorts voted to commit, the coordinator broadcasts a COMMIT message. Upon receiving a COMMIT message, each cohort must commit the transaction. After this, all locks can be released.

Cohorts then acknowledge the commit/abort message with an ACK.

The prepare phase can be viewed as a pre-commit phase, adding an extra step for all cohort members to agree on the transaction's final outcome.

Under of Hood of 2PC

Let's examine the details of the two-phase commit (2PC) protocol, focusing on network communication and storage writes. Storage writes can be forced or unforced. Forced writes are expensive as they are not buffered by operating system or by the disk write buffer cache. Forced writes are guaranteed to be durable. On the other hand, unforced writes may still be lost.

Cohort Responsibilities

Before replying to a PREPARE message: A cohort must force-log its COMMIT-VOTE. This ensures that even if the cohort crashes, it remembers its vote.

Before replying to a COMMIT or ABORT message: A cohort must force-log the fact that it has COMMITTED or ABORTED. This guarantees that the cohort knows the final outcome, even after a crash. These logs are crucial. They prevent cohorts from repeatedly querying the coordinator about the transaction's status. If the coordinator received the cohort's reply, it's guaranteed never to receive another query about that transaction from that cohort.

If a cohort crashes before logging the COMMIT or ABORT message, upon recovery, it must query the coordinator for the final outcome. Since the cohort previously voted to commit, and if the coordinator confirms a COMMIT, the cohort proceeds with the commit (it's the cohort's responsibility to honor its commitment). The coordinator, having not received a ACK reply, retains the transaction's final status until all cohorts have acknowledged it.

The cohort doesn't need to keep anything in-memory. It just needs to force log COMMIT-VOTE and COMMIT/ABORT messages.

Coordinator Responsibilities

The coordinator, unlike a cohort, requires a combination of in-memory and persistent storage for its data.

Protocol Database

The coordinator maintains an in-memory database (a.k.a. a protocol database) with the following information for each transaction:

  • TID
  • stable: A boolean indicating whether the transaction's existence has been persistently logged.
  • state: The transaction's current state (initiated, preparing, aborted, or committed).
  • per-cohort information: For each cohort, its ID and vote.

The per-cohort information can be deleted after the coordinator receives the ACK for the second phase message. The entire transaction record can be removed from memory once acknowledgements from all cohorts are received.

Storage Logs

The coordinator's log only needs to force record the final transaction state (COMMIT or ABORT). It doesn't need to retain other details. If the coordinator crashes and a cohort inquires about a transaction, the coordinator replies with the logged final state.

If the coordinator crashes before the end of the PREPARE phase (i.e., before the final state is logged), it can safely assume an ABORT and respond accordingly to any cohort queries.

Finally, the coordinator should have an unforced "end" record for each transaction. This is a performance optimization to prevent the coordinator from restarting already-completed transactions; it doesn't affect correctness.

Summary

Coordinator:

  • Force-log the COMMIT or ABORT decision before sending the corresponding message.
  • Maintain an unforced transaction end record.

Cohort:

  • Force-log the COMMIT-VOTE before replying to the PREPARE message.
  • Force-log the COMMIT or ABORT action before replying to the COMMIT or ABORT message.

Applications of 2PC

2PC is a critical protocol for ensuring atomicity in distributed systems, meaning that all participating systems either complete a transaction or none do. While commonly associated with databases, 2PC has applications in various scenarios where atomic actions across multiple systems are essential. Here are some key examples:

  • Exactly-once message delivery: 2PC guarantees exactly-once delivery of messages, preventing duplicates or lost messages in distributed environments. The messages are delivered, processed, and acked in a single transaction.
  • Online booking systems: In systems like flight or hotel reservations, 2PC ensures that all parts of a booking process (e.g., seat reservation, payment processing, confirmation) are completed together or not at all.
  • E-commerce transactions: When you make a purchase online, 2PC ensures that the order, payment, and inventory updates are all synchronized.
  • Distributed caching: 2PC can be used to maintain consistency across distributed caches, ensuring that all cache nodes have the same data.
  • Workflow management: In complex workflows involving multiple systems, 2PC can coordinate the execution of tasks and ensure that all steps are completed successfully.

2PC Variants

The 2PC variant we've been discussing is known as Presumed Nothing (PrN). Other variants exist, including Presumed Abort (PrA), Presumed Commit (PrC), and New Presumed Commit (NPrC), which is the focus of this paper. It's important to understand that these variants are performance optimizations of the core 2PC protocol; they don't change the fundamental logic. However, because 2PC is used in so many distributed systems, even small performance gains can have a significant impact on overall system performance.

Optimization Targets

The goal of all these variants is to minimize the number of forced writes to disk and the number of messages exchanged. The primary targets for optimization are:

  • Coordinator: The forced log of the ABORT record.
  • Cohort: The forced log of the COMMIT or ABORT action and the reply sent after committing or aborting.

It's crucial to note that certain operations cannot be optimized -

  • The forced log of the COMMIT-VOTE is essential and cannot be avoided; cohorts must log their vote before sending it.
  • Additionally, the reply to the PREPARE message is also necessary; cohorts must respond.
  • Finally, forced log of COMMIT message is essential; coordinator must log committed transactions.

Presumed Abort (PrA)

In the PrA variant, the coordinator takes a more optimistic approach with regard to aborted transactions.  Instead of force-logging an ABORT decision, the coordinator simply removes the transaction's entry from its in-memory database. The default assumption is that a transaction has been aborted unless proven otherwise. If the coordinator crashes and receives a query about a transaction that's no longer in its in-memory database and not in its COMMIT logs, it assumes the transaction was aborted and responds accordingly.

Furthermore, cohorts do not need to force-log ABORT action or acknowledge an ABORT message. Since the coordinator doesn't maintain any persistent state for aborted transactions, there's no action required upon receiving an ACK. Consequently, cohorts themselves don't even need to send an ABORT reply.

Failure Recovery

Note:

  • The coordinator force logs COMMIT decisions.
  • The cohorts force logs COMMIT action and ACKs to COMMIT messages.

Coordinator Failure:

  • Before COMMIT is logged: The transaction is discarded. Any cohort querying the transaction's status is told to abort.
  • Before all ACKs are received: The transaction's status is COMMIT if and only if the COMMIT message was successfully logged before the crash; otherwise, the status is ABORT.

Cohort Failure:

  • Before receiving the decision message: The cohort queries the coordinator for the decision.
  • After sending the ACK for ABORT (since ABORT logs are not forced): The cohort queries the coordinator for the decision and will idempotently receive ABORT again.

Presumed Commit (PrC)

In the PrC variant, the coordinator optimistically assumes that transactions will commit. The default assumption is that a transaction has committed.

However, PrC needs to log the PREPARE message. The coordinator does force-log both the PREPARE and COMMIT messages.

Furthermore, cohorts do not need to force-log COMMIT action or acknowledge an COMMIT message.

Failure Recovery

Notes:

  • The coordinator force-logs PREPARE and COMMIT messages.
  • The cohort force-logs ABORT action and ACKs to ABORT messages.

Coordinator Failure

  • Before PREPARE is logged: The transaction is discarded.
  • Before COMMIT is logged: The coordinator requests votes again from the cohorts.
  • Before ABORT message is sent: The coordinator requests votes again.
  • Before ACKs on ABORT: The coordinator requests ACKs again.

Cohort Failure

  • Before receiving the decision message: The cohort queries the coordinator.
  • After sending the ACK (since COMMIT logs are not forced): The cohort simply queries the coordinator again and will idempotently receive ABORT again.

Note: PrC is not always COMMIT only. It does ABORT as well based on cohort's decisions.

ACKs and Garbage Collection

Our focus here is on garbage collection (GC) of transaction data on the coordinator side, not on the cohorts.

Presumed Abort (PrA):

The final ACK for a COMMIT message triggers the removal of the COMMIT decision from persistent storage and the removal of the committed transaction from the protocol database.

ABORT decisions are never logged. Aborted transactions are removed from the protocol database as soon as any cohort sends an ABORT-VOTE.

Presumed Commit (PrC):

The final ACK for an ABORT message triggers the removal of the aborted transaction from the protocol database and the removal of the PREPARE message from persistent storage. The PREPARE message can also be removed after the COMMIT message is logged.

Upon receiving all ACKs for COMMIT, the COMMIT message can be removed, and the committed transaction can be removed from the protocol database.

A key difference is that in PrA, ACKs for ABORT are neither required for correctness nor for garbage collection. In PrC, ACKs for COMMIT are not required for correctness, but they are necessary for garbage collection.

Post-Recovery Behavior and GC

PrA: If an ACK for COMMIT is not received, the coordinator will still have the COMMIT log and will query the cohorts for the transaction's outcome. ACKs for ABORT are never required. After recovery, aborted transactions are simply forgotten.

PrC: If an ACK for ABORT is not received, the coordinator will still have the PREPARE message. It will query the cohorts for the decision. Upon receiving confirmation of the abort, it will resend the ABORT message and collect ACKs again. If an ACK for COMMIT is not received, the coordinator will have the COMMIT message and will resend the COMMIT message and collect ACKs.

In both cases (PrA and PrC), eventual garbage collection of entries in both the protocol database and persistent storage on the coordinator is guaranteed.

Summary

PrA and PrC optimize storage writes, as summarized in the table below (note again that cohorts always force-log their COMMIT-VOTE and reply, and coordinator always force-logs COMMIT message):


Action

PrN

PrA

PrC

NPrC

Coordinator

PREPARE Log

No

No

Yes

No

ABORT Log

Yes

No

No

No

Cohort

COMMIT Log + ACK

Yes

Yes

No

No

ABORT Log + ACK

Yes

No

Yes

Yes

At first glance, PrC might seem superior. If we assume that most transactions commit successfully, avoiding ACKs for commits appears advantageous. However, the assumption that most transactions commit is not always valid. Furthermore, PrC introduces an additional forced log write for the PREPARE message. The paper argues that PrA is often a better choice, and this summary highlights why.

New Presumed Commit (NPrC)

NPrC is a further optimization of PrC designed to reduce logging overhead, especially in the optimistic scenario where crashes are infrequent. It achieves this without affecting the correctness of the 2PC protocol.

Data Structures

  • tidl: The lowest transaction ID that is not recorded. Transactions with IDs lower than tidl are presumed committed.
  • tidh: The highest transaction ID that is not recorded. Transactions with IDs higher than tidh are new.
  • REC: The set of recent transactions.
REC = {tid | tidl < tid < tidh}

  • COM: The subset of REC containing transactions that are known to be committed. Committed transactions have forced COMMIT logs, as in PrC.
  • IN: The set of transactions for which nothing has been written to storage. If a crash occurs, all transactions in IN are presumed aborted.
IN = REC - COM

Logging tidh

Logged periodically (and can be forced, as it's infrequent) as new transactions with higher IDs arrive.

Logging tidl

Logged when the transaction with the lowest ID in REC is committed. The actual logged value might be higher than the current tidl because other transactions in the range might have already committed.

  • The cost of logging tidl can be amortized with the cost of logging the COMMIT for the transaction itself.
  • If the transaction is ABORTed, tidl can be logged unforced after all ACKs are received. Since all ACKs are received, it means no one will ever query about the transaction again, so tidl can safely be advanced.

Crash Recovery

tidh is set to a very high value to avoid overlap with previous transaction IDs. IN is rebuilt using the known values of tidl, tidh, and the COM set. Post this, the recovery is as follows based on the crash point of the coordinator:

  • Before tidl is advanced: COM is used to determine if a record was actually committed.
  • After tidl is advanced: The transaction is presumed committed.
  • Before tidl is advanced past the ID of an aborted transaction: The transaction becomes part of IN upon recovery and is thus presumed as aborted.
  • After tidl is advanced past the ID of an aborted transaction: All ACKs for the abort would have been received, and no one should query about the transaction again.
The paper discusses optimizations for persisting IN and quickly determining IN after recovery to minimize the chance of incorrectly presuming a transaction to be aborted. However, these optimizations are less relevant for modern systems with ample RAM. (See sections 4.2.3 and 4.3 of the paper for details).

Garbage Collection

Essentially, transactions between tidl and tidh are considered aborted unless they are within the COM set.

Recall that:
  • In PrA, abort decisions are not logged. If the coordinator fails before receiving votes, the transaction is considered aborted, and no garbage is generated.
  • In PrC, if the coordinator fails before receiving votes, it has logged PREPARE messages for each transaction, allowing it to re-collect votes after recovery. No garbage is generated.
However, in NPrC, after recovery, only the IN range (transactions to be aborted) is available. Crucially, no PREPARE or COMMIT messages exist for them. This absence of information creates permanent garbage. The coordinator cannot determine the involved cohorts, so it must retain this aborted transaction range to respond with ABORT to future queries for transactions within that range.

NPrC v/s PrC

After a crash, an NPrC coordinator might have more work to do than a PrC coordinator because it has less persistent information. However, crashes are assumed to be rare, so this extra work during recovery is generally not a significant performance bottleneck.

Read-Only Transactions

Serializable queries require read-only transactions (also known as snapshot transactions). These transactions acquire locks, but their outcome (commit or abort) is irrelevant to the cohort.

Cohorts respond to a PREPARE message with a READ-ONLY-VOTE. The coordinator, upon receiving only READ-ONLY-VOTEs from all cohorts, recognizes the transaction as read-only and omits sending COMMIT/ABORT messages.

In PrC, the coordinator doesn't need to log a COMMIT/ABORT message for a read-only transaction, but it must delete (unforced) the logged PREPARE message. NPrC offers further optimization: the PREPARE message is never logged, and the transaction is simply forgotten after all cohorts respond with READ-ONLY-VOTE. Crucially, no further communication about the transaction result occurs.

Serializability in 2PC

This would probably be the most important section of the article.

The basic 2PC protocol, in its various forms (PrN, PrA, PrC, and NPrC), can violate both serializability and strict serializability. Here's why:

Violation in Read-Only Transactions

The most straightforward violation occurs with read-only transactions. Participants in a read-only transaction release their locks after sending their READ-ONLY-VOTE. This can break serializability because subsequent transactions might modify the data before the read-only transaction's effective commit point.

Essentially, the read-only transaction might read a snapshot of the data that's no longer consistent with the actual position of the transaction in the order.

Violation in Read-Write Transactions

Serializability requires some total order of all transactions. If a transaction T1 comes before another T2 in this order, T2 must see all of T1's effects. Conversely, if T2 comes before T1, T2 must not see any of T1's effects. 2PC can violate this.

Consider the bank database with accounts A, B, and C on nodes 1, 2, and 3. A has $100, B has $150, and C has $0. Two transactions, T1 (transfer $50 from A to B) and T2 (transfer $200 from B to C), are submitted, with T2 following T1.

  1. Transaction T1 is initiated.
  2. Nodes 1 and 2 accept T1 and acknowledge the PREPARE message.
  3. Node 2 commits T1 and releases its locks before Node 1 commits.
  4. Transaction T2 is initiated.
  5. Nodes 2 and 3 accept T2. Critically, Node 2 now reflects the updated balance of B (due to the partial commit of T1).
  6. T2 commits on both Nodes 2 and 3.

The problem is that T1 is still uncommitted on Node 1 (and hence only partially committed) when T2 reads the updated balance of B on Node 2. This creates a read uncommitted isolation level, where T2 sees the effects of T1 before T1 is fully committed, violating serializability.

In essence, 2PC, in its basic form, doesn't prevent transactions from observing intermediate, inconsistent states of other transactions, thus failing to guarantee serializability and therefore strict serializability.

Solutions

Two primary approaches address the serializability concern:

  1. Coordinator Timestamp: The coordinator assign commit timestamps. This approach requires that the coordinator cannot change, otherwise, timestamps may go out of order!

  2. Commit Timestamp: Each COMMIT-VOTE is assigned a timestamp range. Locks are released when the upper bound of the COMMIT-VOTE timestamp range is reached. This eliminates the need for the coordinator to track anything. The cohort's COMMIT-VOTE specifies a permissible timestamp range, ensuring no conflicting transactions within that range. A new transaction would receive a timestamp range which will be higher than that for all existing transaction.

(2) is generally used as it allows any client to be the coordinator. NPrC is also compatible with this approach. It is interesting to note that this can also solve strict serializability as all transactions will receive a globally consistent ordered timestamps. However, it requires near-perfect clock synchronization across all the cohorts. This is impossibly achieved by Spanner.

Bonus - Three-Phase Commits (3PC)

2PC has significant vulnerabilities. It's not resilient to coordinator failures, cohort failures, or network partitions. Any of these events can halt 2PC progress.

A key concern arises after the coordinator has already logged a COMMIT or ABORT decision. While the coordinator must communicate the outcome to the client as soon as possible, it faces a blocking problem. Before it can respond, it needs acknowledgements from all cohorts:

  • PrN and PrA: ACKs from all cohorts for the commit request.
  • PrC: ACKs from all cohorts for the abort request.

Indefinite Blocking

If a cohort crashes before COMMIT phase, 2PC recovery is not possible until that cohort restarts. The coordinator can't finalize the transaction because it's waiting for confirmation from the unavailable cohort. The protocol is blocked indefinitely.

Three-Phase Commit (3PC)

3PC addresses this blocking issue by introducing a PRE-COMMIT phase between PREPARE and COMMIT.

  • If the coordinator crashes before receiving PRE-COMMIT acks, the decision is always ABORT.
  • If the coordinator has all PRE-COMMIT acks, the decision is always COMMIT.

This crucial change means that once the coordinator logs a COMMIT decision, it can safely commit regardless of cohort availability. Any cohort that misses the COMMIT message will eventually receive it.

The coordinator is no longer dependent on cohorts after logging the COMMIT decision, so it can immediately inform the client. Before logging the COMMIT decision, the decision is always ABORT. Therefore, cohort crashes no longer block the coordinator from responding to the client.

Of course, 3PC adds extra round trips. It's often preferable to use 2PC with highly resilient cohort members (like Paxos groups) where possible, mitigating the risk of cohort failures.

Paper Review

I personally found this paper very difficult to read. It felt cramped at times, and I even suspected some errors. However, everything in the paper is correct. The challenge lies in understanding all aspects, even though the underlying concepts seem simple. Despite this, the paper serves as a valuable first step in understanding distributed SQL databases. I highly recommend it.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...