Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. It was presented at the prestigious Very Large Databases (VLDB) ‘93.

Paper Link

Let's begin by reviewing fundamental concepts of SQL databases.

1. Serializability

Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized.

Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency). This differs from linearizability, which focuses on single-object operations, as strict serializability applies to multi-object transactions (e.g., updates across multiple rows of a table).

Serializability is a transaction isolation level, one of several including read uncommitted, read committed, and repeatable read. This discussion focuses exclusively on serializable transactions. Such transactions are implemented using either:

Locks - Prevents concurrent execution of transactions that modify the same object. This only works for standalone non-distributed SQL databases.
Timestamp Ordering - Implements a global ordering of transactions ordered by timestamps from a source of truth. This approach can offer serializability and indeed, strict serializability, however, such a source of truth is difficult to obtain for most systems.

There are other ways to improve the methods above. For example, versioned objects can be used for multi-version concurrency control so that locks are not required. We will revisit serializability at the end of this article.

2. Single Data Transactions

Single-data transactions are the ones that atomically modify only a single data item (e.g., a single row). These are commonly offered by NoSQL databases, where multi-data transactions are not supported for performance. The item being modified in a transaction resides on a set of replicas. All replicas are expected to apply the same modification as a result of the transaction.

These single-item transactions are also commonly implemented by distributed lock services like ZooKeeper, using a distributed log built on consensus algorithms. These implementations also ensure linearizability.

In this article, we will focus on multi-data transactions. For more details on how to implement single-data transactions, refer Paper Insights - ZooKeeper: Wait-free coordination for Internet-scale systems.

3. Multi-Data Transactions

Multi-data transactions are the ones that modify multiple data items (e.g., multiple rows). The term "transaction" commonly refers to a multi-item transaction, which is a core feature of SQL databases.

Consider a simple banking scenario with accounts held by A and B. A transfers money to B, as shown in illustration 1.

Illustration 1: Bank account transfer.

The client code will be something like:

txn = Transaction.start();
balA = txn.Read(A);
balB = txn.Read(B);
txn.Write(A, balA - $50);
txn.Write(B, balB + $50);
txn.Commit();

The transaction needs to ensure:

Atomicity - The transfer either completes fully or not at all.
Consistency - The total balance across all accounts remains constant after any transfer.

3.1. Implementation in Non-Distributed SQL Databases

In a non-distributed database, all data resides on a single server. B+ trees, indexed by primary keys with leaf nodes pointing to row data, are a common data structure used in these systems.

Transactions are typically implemented by acquiring locks on the data items that are read/modified. The data items are then updated, and finally, the transaction commits. However, if a transaction fails before reaching the commit stage, a mechanism is required to revert any changes made. This is where transaction logs provide the necessary means to handle pre-commit failures.

3.1.1. Transaction Logs

Transaction logs are essential for database recovery and ensure data consistency. Two primary types exist:

UNDO Logs: These logs are written before data modifications. They capture the original state of the data, allowing for rollback operations in case of failures.

REDO Logs: These logs are written after data modifications. They record the changes made to the data, enabling the system to reapply these changes during recovery.

UNDO logs contain the following types of records:

[start, TID]: Indicates the beginning of a transaction with identifier TID.
[write, TID, X, old_value]: Records the modification of data item X by transaction TID. old_value stores the original value of X before the modification.
[commit, TID]: Signals the successful completion of transaction TID.
[abort, TID]: Indicates that transaction TID has been aborted.

If a transaction fails (i.e., lacks a commit or abort entry in the UNDO log), the rollback system:

Iterates through the log entries for the failed transaction.
For each write entry, restore the original data value (old_value) to the corresponding data item.
Adds an abort entry to the UNDO log for the failed transaction, marking its termination.

This process ensures that the database remains in a consistent state even after unexpected failures. Such failures are retried by the client.

In our bank example above, assuming A had a balance of $100 and B had a balance of $150, the following would the entries in the UNDO logs for a committed transaction:

start 100024
write 100024 A $100
write 100024 B $150
commit 100024

It's crucial to understand that each entry in the UNDO log directly corresponds to a modification performed by the client within the scope of a single transaction.

3.2. Implementation in Distributed SQL Databases

Now, consider a system where data is distributed across multiple nodes. In our bank example, account balance A might reside on one server, while account balance B might reside on another.

The actions performed by the client within the scope of the transaction are sent to each node independently. The start and commit messages are sent to all the nodes. We can repeat the same process of having UNDO logs:

Node 1

start 100024
write 100024 A $100
commit 100024

Node 2

start 100024
write 100024 B $150
commit 100024

But, if the client crashes after committing a transaction to node 1 but before committing to node 2, we would be left in an inconsistent state.

Node 1

start 100024
write 100024 A $100
commit 100024

Node 2

start 100024
write 100024 B $150
commit 100024

Since node 1 has already committed the transaction, it will not rollback the changes made locally.

This is exactly where we need Two-Phase Commits (2PC). We want all nodes to either commit or abort the entire transaction.

4. Consensus v/s 2PC

Consensus algorithms, such as Paxos, ensure that all replicas of a state machine execute the exact same sequence of actions or apply the same set of instructions. All replicas are equivalent. This guarantees a consistent state across all replicas.

In contrast, 2PC focuses on ensuring that all participants in a distributed transaction take the same action atomically. This means all participants either commit or abort the transaction together. Each participant is entirely a different entity.

Note that, in case of consensus, there is a concept of majority of replicas being caught up to head. However, there's no concept of a majority decision in 2PC. All must commit or abort the transaction.

4.1. Example: Hotel Booking

The example below is taken from Predrag's Blog. See: How Paxos and Two-Phase Commit Differ.

Consider a hotel booking scenario. The booking process involves multiple steps:

Deducting money from the user's bank account.
Reserving the hotel room.
Sending a confirmation to the user.

These steps must be executed atomically. If any step fails, the entire transaction should be rolled back.

In this example:

The participants include the booking website (coordinator and cohort), the bank (cohort), and the hotel (cohort).
Each participant can be implemented as a highly available replicated state machine (RSM) using algorithms like Paxos. This ensures that each participant itself can tolerate failures and remain operational.
2PC ensures that all participants either successfully complete all steps or none of them.

4.2. Key Differences Summarized:

Consensus: Focuses on consistent state across replicas of a single state machine.
2PC: Focuses on atomic actions across all participants in a distributed transaction.

5. The Two Phase Commit Protocol

The primary objective of 2PC is to ensure that all participants involved in a distributed transaction reach a unanimous consensus on whether to commit or abort the transaction.

Once all transaction instructions have been logged (written to transaction logs) and applied to the data structures on all nodes, the 2PC protocol determines whether the transaction should be committed or aborted.

5.1. The Protocol

2PC proceeds as follows:

Phase 1: PREPARE

A designated node (often the client) acts as the coordinator. It broadcasts PREPARE messages to all participating nodes (cohorts). At this stage, each cohort can still decide whether to commit or abort.

If locks are used in the transaction, a cohort will typically commit. However, in multi-version concurrency control, for example, a cohort might abort if read version numbers no longer matches the latest version numbers after 2PC began.

If a cohort decides to commit, it sends a COMMIT-VOTE to the coordinator. Otherwise, it sends an ABORT-VOTE. Crucially, once a cohort sends a COMMIT-VOTE, it must be able to commit the transaction in the next phase. To guarantee this, cohorts must persistently record their commitment. Common strategies include:

Writing the commitment to storage, such as adding a "pre-commit T" entry to the transaction log, indicating the cohort's commitment to transaction T.
Using a Paxos group of replicas to act as a single, highly available cohort and logging the pre-commit message in the distributed log, ensuring all replicas in the group are aware of the commitment.

This paper focuses on individual nodes with local storage as cohort members, therefore we will only consider the first strategy (logging the commitment to local storage). The second strategy (using Paxos groups) is implemented by Spanner.

Phase 2: COMMIT / ABORT

If any cohort voted to abort, the coordinator broadcasts an ABORT message to all cohorts. If all cohorts voted to commit, the coordinator broadcasts a COMMIT message. Upon receiving a COMMIT message, each cohort must commit the transaction. After this, all locks can be released.

Cohorts then acknowledge the commit/abort message with an ACK.

The prepare phase can be viewed as a pre-commit phase, adding an extra step for all cohort members to agree on the transaction's final outcome.

5.2. Under of Hood of 2PC

Let's examine the details of 2PC, focusing on network communication and storage writes. Storage writes can be forced or unforced. Forced writes are expensive as they are not buffered by the operating system or by the disk write buffer cache. However, forced writes are guaranteed to be durable. On the other hand, unforced writes may still be lost.

5.2.1. Cohort Responsibilities

Before replying to a PREPARE message: A cohort must force-log its COMMIT-VOTE. This ensures that even if the cohort crashes, it remembers its vote.

Before replying to a COMMIT or ABORT message: A cohort must force-log the fact that it has committed or aborted. This guarantees that the cohort knows the final outcome, even after a crash. These logs are crucial. They prevent cohorts from repeatedly querying the coordinator about the transaction's status. If the coordinator received the cohort's reply, it's guaranteed never to receive another query about that transaction from that cohort.

If a cohort crashes before logging the COMMIT or ABORT message, upon recovery, it must query the coordinator for the final outcome. Since the cohort previously voted to commit, and if the coordinator confirms a COMMIT, the cohort proceeds with the commit (it's the cohort's responsibility to honor its commitment). The coordinator, having not received a ACK reply, retains the transaction's final status until all cohorts have acknowledged it.

The cohort doesn't need to keep anything in-memory. It just needs to force log COMMIT-VOTE and COMMIT/ABORT messages.

5.2.2. Coordinator Responsibilities

The coordinator, unlike a cohort, requires a combination of in-memory and persistent storage for its data.

5.2.2.1. Protocol Database

The coordinator maintains an in-memory database (a.k.a. a protocol database) with the following information for each transaction:

TID
stable: A boolean indicating whether the transaction's existence has been persistently logged.
state: The transaction's current state (initiated, preparing, aborted, or committed).
per-cohort information: For each cohort, its ID and vote.

The per-cohort information can be deleted after the coordinator receives the ACK for the second phase message. The entire transaction record can be removed from memory once acknowledgements from all cohorts are received.

5.2.2.2. Storage Logs

The coordinator's log only needs to force record the final transaction state (COMMIT or ABORT). It doesn't need to retain other details. If the coordinator crashes and a cohort inquires about a transaction, the coordinator replies with the logged final state.

If the coordinator crashes before the end of the PREPARE phase (i.e., before the final state is logged), it can safely assume an ABORT and respond accordingly to any cohort queries.

Finally, the coordinator should have an unforced "end" record for each transaction. This is a performance optimization to prevent the coordinator from restarting already-completed transactions; it doesn't affect correctness.

5.2.3. Summary

Coordinator:

Force-log the COMMIT or ABORT decision before sending the corresponding message.
Maintain an unforced transaction end record.

Cohort:

Force-log the COMMIT-VOTE before replying to the PREPARE message.
Force-log the COMMIT or ABORT action before replying to the COMMIT or ABORT message.

5.3. Applications of 2PC

2PC is a critical protocol for ensuring atomicity in distributed systems, meaning that all participating systems either complete a transaction or none do. While commonly associated with databases, 2PC has applications in various scenarios where atomic actions across multiple systems are essential. Here are some key examples:

Exactly-once message delivery: 2PC guarantees exactly-once delivery of messages, preventing duplicates or lost messages in distributed environments. The messages are delivered, processed, and acked in a single transaction.
Online booking systems: In systems like flight or hotel reservations, 2PC ensures that all parts of a booking process (e.g., seat reservation, payment processing, confirmation) are completed together or not at all.
E-commerce transactions: When you make a purchase online, 2PC ensures that the order, payment, and inventory updates are all synchronized.
Distributed caching: 2PC can be used to maintain consistency across distributed caches, ensuring that all cache nodes have the same data.
Workflow management: In complex workflows involving multiple systems, 2PC can coordinate the execution of tasks and ensure that all steps are completed successfully.

6. 2PC Variants

The 2PC variant we've been discussing is known as Presumed Nothing (PrN). Other variants exist, including Presumed Abort (PrA), Presumed Commit (PrC), and New Presumed Commit (NPrC), which is the focus of this paper. These variants are performance optimizations of the core 2PC protocol; they don't change the fundamental logic. However, because 2PC is used in so many distributed systems, even small performance gains can have a significant impact on overall system performance.

6.1. Optimization Targets

The goal of all these variants is to minimize the number of forced writes to disk and the number of messages exchanged. The primary targets for optimization are:

Coordinator: The forced log of the ABORT record.
Cohort: The forced log of the COMMIT or ABORT action and the reply sent after committing or aborting.

Note that the following operations cannot be optimized -

The forced log of the COMMIT-VOTE is essential and cannot be avoided; cohorts must log their vote before sending it.
Additionally, the reply to the PREPARE message is also necessary; cohorts must respond.
Finally, forced log of the COMMIT message is essential; the coordinator must log committed transactions.

6.2. Presumed Abort

In the PrA variant, the coordinator takes a more optimistic approach with regard to aborted transactions. Instead of force-logging an ABORT decision, the coordinator simply removes the transaction's entry from its in-memory database. The default assumption is that a transaction has been aborted unless proven otherwise. If the coordinator crashes and receives a query about a transaction that's no longer in its in-memory database and not in its COMMIT logs, it assumes the transaction was aborted and responds accordingly.

Furthermore, cohorts do not need to force-log ABORT action or acknowledge an ABORT message. Since the coordinator doesn't maintain any persistent state for aborted transactions, there's no action required upon receiving an ACK. Consequently, cohorts themselves don't even need to send an ABORT reply.

6.2.1. Failure Recovery

Note that:

The coordinator force-logs COMMIT decisions.
The cohorts force-log COMMIT action and ACK to COMMIT messages.

Coordinator Failure:

Before COMMIT is logged: The transaction is discarded. Any cohort querying the transaction's status is asked to abort.
Before all ACKs are received: The transaction's status is COMMIT if and only if the COMMIT message was successfully logged before the crash; otherwise, the status is ABORT.

Cohort Failure:

Before receiving the decision message: The cohort queries the coordinator for the decision.
After sending the ACK for ABORT (since ABORT logs are not forced): The cohort queries the coordinator for the decision and will idempotently receive ABORT again.

6.3. Presumed Commit

In the PrC variant, the coordinator optimistically assumes that transactions will be committed.

However, PrC needs to log the PREPARE message. The coordinator does force-log both the PREPARE and COMMIT messages.

Furthermore, cohorts do not need to force-log COMMIT action or acknowledge an COMMIT message.

6.3.1. Failure Recovery

Note that:

The coordinator force-logs PREPARE and COMMIT messages.
The cohorts force-log ABORT action and ACK to ABORT messages.

Coordinator Failure

Before PREPARE is logged: The transaction is discarded.
Before COMMIT is logged: The coordinator requests votes again from the cohorts.
Before the ABORT message is sent: The coordinator requests votes again.
Before ACKs on ABORT: The coordinator requests ACKs again.

Cohort Failure

Before receiving the decision message: The cohort queries the coordinator.
After sending the ACK (since COMMIT logs are not forced): The cohort simply queries the coordinator again and will idempotently receive ABORT again.

Note: PrC is not always COMMIT only. It does ABORT as well based on the cohort's decisions.

6.4. ACKs and Garbage Collection

Our focus here is on garbage collection (GC) of transaction data on the coordinator side, not on the cohorts.

6.4.1. Presumed Abort

The final ACK for a COMMIT message triggers the removal of the COMMIT decision from persistent storage and the removal of the committed transaction from the protocol database.

ABORT decisions are never logged. Aborted transactions are removed from the protocol database as soon as any cohort sends an ABORT-VOTE.

6.4.2. Presumed Commit

The final ACK for an ABORT message triggers the removal of the aborted transaction from the protocol database and the removal of the PREPARE message from persistent storage. The PREPARE message can also be removed after the COMMIT message is logged.

Upon receiving all ACKs for COMMIT, the COMMIT message can be removed, and the committed transaction can be removed from the protocol database.

A key difference is that in PrA, ACKs for ABORT are neither required for correctness nor for GC. In PrC, ACKs for COMMIT are not required for correctness, but they are necessary for GC.

6.4.3. Post-Recovery Behavior and GC

PrA: If an ACK for COMMIT is not received, the coordinator will still have the COMMIT log and will query the cohorts for the transaction's outcome. ACKs for ABORT are never required. After recovery, aborted transactions are simply forgotten.

PrC: If an ACK for ABORT is not received, the coordinator will still have the PREPARE message. It will query the cohorts for the decision. Upon receiving confirmation of the abort, it will resend the ABORT message and collect ACKs again. If an ACK for COMMIT is not received, the coordinator will have the COMMIT message and will resend the COMMIT message and collect ACKs.

In both cases (PrA and PrC), eventual garbage collection of entries in both the protocol database and persistent storage on the coordinator is guaranteed.

6.5. Summary

PrA and PrC optimize storage writes, as summarized in the table below (note again that cohorts always force-log their COMMIT-VOTE and reply, and coordinator always force-logs COMMIT message):

	Action	PrN	PrA	PrC	NPrC
Coordinator	PREPARE Log	No	No	Yes	No
Coordinator	ABORT Log	Yes	No	No	No
Cohort	COMMIT Log + ACK	Yes	Yes	No	No
Cohort	ABORT Log + ACK	Yes	No	Yes	Yes

At first glance, PrC might seem superior. If we assume that most transactions commit successfully, avoiding ACKs for commits appears advantageous. However, the assumption that most transactions will be committed is not always valid. Furthermore, PrC introduces an additional forced log write for the PREPARE message. The paper argues that PrA is often a better choice, and this summary highlights why.

7. New Presumed Commit (NPrC)

NPrC is a further optimization of PrC designed to reduce logging overhead, especially in the optimistic scenario where crashes are infrequent. It achieves this without affecting the correctness of the 2PC protocol.

7.1. Data Structures

Illustration 2: Data structures for NPrC.

The following data structures are used in NPrC, as shown in illustration 2:

tid_l: The lowest transaction ID that is not recorded. Transactions with IDs lower than tid_l are presumed committed.
tid_h: The highest transaction ID that is not recorded. Transactions with IDs higher than tid_h are new.
REC: The set of recent transactions.

REC = {tid | tid_l < tid < tid_h}

COM: The subset of REC containing transactions that are known to be committed. Committed transactions have forced COMMIT logs, as in PrC.
IN: The set of transactions for which nothing has been written to storage. If a crash occurs, all transactions in IN are presumed aborted.

IN = REC - COM

7.2. Logging tid_h

Logged periodically (and can be forced, as it's infrequent) as new transactions with higher IDs arrive.

7.3. Logging tid_l

Logged when the transaction with the lowest ID in REC is committed. The actual logged value might be higher than the current tid_l because other transactions in the range might have already committed.

The cost of logging tid_l can be amortized with the cost of logging the COMMIT for the transaction itself.
If the transaction is ABORTed, tid_l can be logged unforced after all ACKs are received. Since all ACKs are received, it means no one will ever query about the transaction again, so tid_l can safely be advanced.

7.4. Crash Recovery

tid_h is set to a very high value to avoid overlap with previous transaction IDs. IN is rebuilt using the known values of tid_l, tid_h, and the COM set. Post this, the recovery is as follows based on the crash point of the coordinator:

Before tid_l is advanced: COM is used to determine if a record was actually committed.
After tid_l is advanced: The transaction is presumed committed.
Before tid_l is advanced past the ID of an aborted transaction: The transaction becomes part of IN upon recovery and is thus presumed as aborted.
After tid_l is advanced past the ID of an aborted transaction: All ACKs for the abort would have been received, and no one should query about the transaction again.

The paper discusses optimizations for persisting IN and quickly determining IN after recovery to minimize the chance of incorrectly presuming a transaction to be aborted. However, these optimizations are less relevant for modern systems with ample RAM. (See sections 4.2.3 and 4.3 of the paper for details).

7.5. Garbage Collection

Essentially, transactions between tid_l and tid_h are considered aborted unless they are within the COM set.

Recall that:

In PrA, abort decisions are not logged. If the coordinator fails before receiving votes, the transaction is considered aborted, and no garbage is generated.
In PrC, if the coordinator fails before receiving votes, it has logged PREPARE messages for each transaction, allowing it to re-collect votes after recovery. No garbage is generated.

However, in NPrC, after recovery, only the IN range (transactions to be aborted) is available. Crucially, no PREPARE or COMMIT messages exist for them. This absence of information creates permanent garbage. The coordinator cannot determine the involved cohorts, so it must retain this aborted transaction range to respond with ABORT to future queries for transactions within that range.

7.6. NPrC v/s PrC

After a crash, an NPrC coordinator might have more work to do than a PrC coordinator because it has less persistent information. However, crashes are assumed to be rare, so this extra work during recovery is generally not a significant performance bottleneck.

8. Read-Only Transactions

Serializable queries require read-only transactions (also known as snapshot transactions). These transactions acquire locks, but their outcome (commit or abort) is irrelevant to the cohorts.

Cohorts respond to a PREPARE message with a READ-ONLY-VOTE. The coordinator, upon receiving only READ-ONLY-VOTEs from all cohorts, recognizes the transaction as read-only and omits sending COMMIT/ABORT messages.

In PrC, the coordinator doesn't need to log a COMMIT/ABORT message for a read-only transaction, but it must delete (unforced) the logged PREPARE message. NPrC offers further optimization: the PREPARE message is never logged, and the transaction is simply forgotten after all cohorts respond with READ-ONLY-VOTE. Crucially, no further communication about the transaction result occurs.

9. Serializability in 2PC

This would probably be the most important section of the article.

The basic 2PC protocol, in its various forms (PrN, PrA, PrC, and NPrC), can violate both serializability and strict serializability. Here's why:

9.1. Violation in Read-Only Transactions

The most straightforward violation occurs with read-only transactions. Participants in a read-only transaction release their locks after sending their READ-ONLY-VOTE. This can break serializability because subsequent transactions might modify the data before the read-only transaction's effective commit point.

Essentially, the read-only transaction might read a snapshot of the data that's no longer consistent with the actual position of the transaction in the order.

9.2. Violation in Read-Write Transactions

Serializability requires some total order of all transactions. If a transaction T₁ comes before another T₂ in this order, T₂ must see all of T₁'s effects. Conversely, if T₂ comes before T₁, T₂ must not see any of T₁'s effects. 2PC can violate this.

Consider the bank database with accounts A, B, and C on nodes 1, 2, and 3. A has $100, B has $150, and C has $0. Two transactions, T₁ (transfer $50 from A to B) and T₂ (transfer $200 from B to C), are submitted, with T₂ following T₁.

Transaction T₁ is initiated.
Nodes 1 and 2 accept T₁ and acknowledge the PREPARE message.
Node 2 commits T₁ and releases its locks before Node 1 commits.
Transaction T₂ is initiated.
Nodes 2 and 3 accept T₂. Critically, Node 2 now reflects the updated balance of B (due to the partial commit of T₁).
T₂ commits on both Nodes 2 and 3.

The problem is that T₁ is still uncommitted on Node 1 (and hence only partially committed) when T₂ reads the updated balance of B on Node 2. This creates a read uncommitted isolation level, where T₂ sees the effects of T₁ before T₁ is fully committed, violating serializability.

In essence, 2PC, in its basic form, doesn't prevent transactions from observing intermediate, inconsistent states of other transactions, thus failing to guarantee serializability and therefore strict serializability.

9.3. Solutions

Two primary approaches address the serializability concern:

Coordinator Timestamp: The coordinator assigns commit timestamps. This approach requires that the coordinator cannot change, otherwise, timestamps may go out of order!
Commit Timestamp: Each COMMIT-VOTE is assigned a timestamp range. Locks are released when the upper bound of the COMMIT-VOTE timestamp range is reached. This eliminates the need for the coordinator to track anything. The cohort's COMMIT-VOTE specifies a permissible timestamp range, ensuring no conflicting transactions within that range. A new transaction would receive a timestamp range which will be higher than that for all existing transactions.

(2) is generally used as it allows any client to be the coordinator. NPrC is also compatible with this approach. It is interesting to note that this can also solve strict serializability as all transactions will receive globally consistent ordered timestamps. However, it requires near-perfect clock synchronization across all the cohorts. This is impossibly achieved by Spanner.

10. Bonus - Three-Phase Commits (3PC)

2PC has significant vulnerabilities. It's not resilient to coordinator failures, cohort failures, or network partitions. Any of these events can halt 2PC progress.

A key concern arises after the coordinator has already logged a COMMIT or ABORT decision. While the coordinator must communicate the outcome to the client as soon as possible, it faces a blocking problem. Before it can respond, it needs acknowledgements from all cohorts:

PrN and PrA: ACKs from all cohorts for the commit request.
PrC: ACKs from all cohorts for the abort request.

10.1. Indefinite Blocking

If a cohort crashes before the COMMIT phase, 2PC recovery is not possible until that cohort restarts. The coordinator can't finalize the transaction because it's waiting for confirmation from the unavailable cohort. The protocol is blocked indefinitely.

10.2. Three-Phase Commit (3PC)

3PC addresses this blocking issue by introducing a PRE-COMMIT phase between PREPARE and COMMIT.

If the coordinator crashes before receiving PRE-COMMIT acks, the decision is always ABORT.
If the coordinator has all PRE-COMMIT acks, the decision is always COMMIT.

This crucial change means that once the coordinator logs a COMMIT decision, it can safely commit regardless of cohort availability. Any cohort that misses the COMMIT message will eventually receive it.

The coordinator is no longer dependent on cohorts after logging the COMMIT decision, so it can immediately inform the client. Before logging the COMMIT decision, the decision is always ABORT. Therefore, cohort crashes no longer block the coordinator from responding to the client.

Of course, 3PC adds extra round trips. It's often preferable to use 2PC with highly resilient cohort members (like Paxos groups) where possible, mitigating the risk of cohort failures.

11. Paper Review

I personally found this paper very difficult to read. It felt cramped at times, and I even suspected some errors. However, everything in the paper is correct. The challenge lies in understanding all aspects, even though the underlying concepts seem simple. Despite this, the paper serves as a valuable first step in understanding distributed SQL databases. I highly recommend it.

Pico