Lampson and Lomet's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic.
The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases.
Serializability
Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized.
Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency). This differs from linearizability, which focuses on single-object operations, as strict serializability applies to multi-object transactions (e.g., updates across multiple rows).
Serializability is a transaction isolation level, one of several including read uncommitted, read committed, and repeatable read. This discussion focuses exclusively on serializable transactions. These transactions are implemented using either:
- Locks - Prevents concurrent execution of transactions that modify the same object. This only works for standalone non-distributed SQL databases.
- Timestamp Ordering - Implements a global ordering of transactions ordered by timestamps from a source of truth. This approach can offer serializability and indeed, strict serializability, however, such a source of truth is difficult to obtain for most systems.
Single Data Transactions
Single-data transactions are the ones that modify only a single data item (e.g., a single row). These are offered by NoSQL databases, where multi-data transactions are not supported for performance. The item being modified in a transaction resides on a set of replicas. All replicas are expected to apply the same modification as a result of the transaction.
These single-item transactions are also commonly implemented by distributed lock services like ZooKeeper, using a distributed log built on consensus algorithms. These implementations also ensure linearizability.
In this article, we will focus on multi-data transactions. For more details on how to implement single-data transactions, refer ZooKeeper: Wait-free coordination for Internet-scale systems.
Multi-Data Transactions
Multi-data transactions are the ones that modify multiple data item (e.g., multiple rows). The term transaction commonly refers to a multi-item transaction, which is a core feature of SQL databases.
Consider a simple banking scenario with accounts held by A and B. A transfers money to B.
The client code will be something like:
txn = Transaction.start();
balA = txn.Read(A);
balB = txn.Read(B);
txn.Write(A, balA - $50);
txn.Write(B, balB + $50);
txn.Commit();
The transaction needs to ensure:
- Atomicity - The transfer either completes fully or not at all.
- Consistency - The total balance across all accounts remains constant after any transfer.
Implementation in Non-Distributed SQL Databases
In a non-distributed database, all data resides on a single server. B+ trees, indexed by primary keys with leaf nodes pointing to row data, are a common data structure used in these systems.
Transactions are typically implemented by acquiring locks on the data items that are read/modified. The data items are then updated, and finally, the transaction commits. However, if a transaction fails before reaching the commit stage, a mechanism is required to revert any changes made. This is where transaction logs provide the necessary means to handle pre-commit failures.
Transaction Logs
Transaction logs are essential for database recovery and ensure data consistency. Two primary types exist:
- UNDO Logs: These logs are written before data modifications. They capture the original state of the data, allowing for rollback operations in case of failures.
- REDO Logs: These logs are written after data modifications. They record the changes made to the data, enabling the system to reapply these changes during recovery.
UNDO logs contain the following types of records:
- [start, TID]: Indicates the beginning of a transaction with identifier TID.
- [write, TID, X, old_value]: Records the modification of data item X by transaction TID. old_value stores the original value of X before the modification.
- [commit, TID]: Signals the successful completion of transaction TID.
- [abort, TID]: Indicates that transaction TID has been aborted.
If a transaction fails (i.e., lacks a commit or abort entry in the UNDO log), the rollback system:
- Iterates through the log entries for the failed transaction.
- For each write entry, restores the original data value (old_value) to the corresponding data item.
- Adds an abort entry to the UNDO log for the failed transaction, marking its termination.
This process ensures that the database remains in a consistent state even after unexpected failures. Such failures are retried by the client.
In our bank example above, assuming A had a balance of $100 and B had a balance of $150, the following would the entries in the UNDO logs for a committed transaction:
start 100024
write 100024 A $100
write 100024 B $150
commit 100024
It's crucial to understand that each entry in the UNDO log directly corresponds to a modification performed by the client within the scope of a single transaction.
Implementation in Distributed SQL Databases
Now, consider a system where data is distributed across multiple nodes. In our bank example, account balance A might reside on one server, while account balance B might reside on another. When a transaction involves modifying data on multiple nodes (like transferring funds between accounts A and B), it must be atomic.
The actions performed by the client within the scope of the transaction are sent to each node independently. The start and commit messages are sent to all the nodes. We can repeat the same process of having UNDO logs:
Node 1
start 100024
write 100024 A $100
commit 100024
Node 2
start 100024
write 100024 B $150
commit 100024
But, if the client crashes after committing a transaction to node 1 but before committing to node 2, we would be left in an inconsistent state.
Node 1
start 100024
write 100024 A $100
commit 100024
Node 2
start 100024
write 100024 B $150commit 100024
Since node 1 has already committed the transaction, it will not rollback the changes made locally.
This is exactly where we need Two-Phase Commits (2PC). We want all nodes to either commit or abort the entire transaction.
Consensus v/s 2PC
Consensus algorithms, such as Paxos, ensure that all replicas of a state machine execute the exact same sequence of actions or apply the same set of instructions. All replicas are equivalent. This guarantees consistent state across all replicas.
In contrast, 2PC focuses on ensuring that all participants in a distributed transaction take the same action atomically. This means all participants either commit or abort the transaction together. Each participant is entirely a different entity.
Note that, in case of consensus, there is a concept of majority of replicas being caught up to head. However, there's no concept of a majority decision in 2PC. All must commit or abort the transaction.
Example: Hotel Booking
The example below is taken from Predrag's Blog. I would recommend reading How Paxos and Two-Phase Commit Differ.
Consider a hotel booking scenario. The booking process involves multiple steps:
- Deducting money from the user's bank account.
- Reserving the hotel room.
- Sending a confirmation to the user.
These steps must be executed atomically. If any step fails, the entire transaction should be rolled back.
In this example:
- The participants include the booking website (coordinator and cohort), the bank (cohort), and the hotel (cohort).
- Each participant can be implemented as a highly available replicated state machine (RSM) using algorithms like Paxos. This ensures that each participant itself can tolerate failures and remain operational.
- 2PC ensures that all participants either successfully complete all steps or none of them.
Key Differences Summarized:
- Consensus: Focuses on consistent state across replicas of a single state machine.
- 2PC: Focuses on atomic actions across multiple participants in a distributed transaction.
The Two Phase Commit Protocol
The primary objective of 2PC is to ensure that all participants involved in a distributed transaction reach a unanimous consensus on whether to commit or abort the transaction.
Once all transaction instructions have been logged (written to transaction logs) and applied to the data structures on all nodes, the 2PC protocol determines whether the transaction should be committed or aborted.
The Protocol
The two-phase commit (2PC) protocol proceeds as follows:
Phase 1: Prepare
A designated node (often the client) acts as the coordinator. It broadcasts PREPARE messages to all participating nodes (cohorts). At this stage, each cohort can still decide whether to commit or abort. If locks are used, a cohort will typically commit. However, in multi-version concurrency control, for example, a cohort might abort if read version numbers no longer matches the latest version numbers after 2PC began.
If a cohort decides to commit, it sends a COMMIT-VOTE to the coordinator. Otherwise, it sends an ABORT-VOTE. Crucially, once a cohort sends a COMMIT-VOTE, it must commit the transaction in the next phase. To guarantee this, cohorts must persistently record their commitment. Common strategies include:
- Writing the commitment to storage, such as adding a "pre-commit T" entry to the transaction log, indicating the cohort's commitment to transaction T.
- Using a Paxos group of replicas to act as a single, highly available cohort. The pre-commit message is logged in the distributed log, ensuring all replicas in the group are aware of the commitment.
This paper focuses on individual nodes with local storage as cohort members, therefore we will only consider the first strategy (logging the commitment to local storage). The second strategy (using Paxos groups) is implemented by Spanner.
Recommended Read: Spanner: Google's Globally-Distributed Database
Phase 2: Commit/Abort
If any cohort voted to abort, the coordinator broadcasts an ABORT message to all cohorts. If all cohorts voted to commit, the coordinator broadcasts a COMMIT message. Upon receiving a COMMIT message, each cohort must commit the transaction. After this, all locks can be released.
Cohorts then acknowledge the commit/abort message with an ACK.
The prepare phase can be viewed as a pre-commit phase, adding an extra step for all cohort members to agree on the transaction's final outcome.
Under of Hood of 2PC
Let's examine the details of the two-phase commit (2PC) protocol, focusing on network communication and storage writes. Storage writes can be forced or unforced. Forced writes are expensive as they are not buffered by operating system or by the disk write buffer cache. Forced writes are guaranteed to be durable. On the other hand, unforced writes may still be lost.
Cohort Responsibilities
Before replying to a PREPARE message: A cohort must force-log its COMMIT-VOTE. This ensures that even if the cohort crashes, it remembers its vote.
Before replying to a COMMIT or ABORT message: A cohort must force-log the fact that it has COMMITTED or ABORTED. This guarantees that the cohort knows the final outcome, even after a crash. These logs are crucial. They prevent cohorts from repeatedly querying the coordinator about the transaction's status. If the coordinator received the cohort's reply, it's guaranteed never to receive another query about that transaction from that cohort.
If a cohort crashes before logging the COMMIT or ABORT message, upon recovery, it must query the coordinator for the final outcome. Since the cohort previously voted to commit, and if the coordinator confirms a COMMIT, the cohort proceeds with the commit (it's the cohort's responsibility to honor its commitment). The coordinator, having not received a ACK reply, retains the transaction's final status until all cohorts have acknowledged it.
The cohort doesn't need to keep anything in-memory. It just needs to force log COMMIT-VOTE and COMMIT/ABORT messages.
Coordinator Responsibilities
Protocol Database
The coordinator maintains an in-memory database (a.k.a. a protocol database) with the following information for each transaction:
- TID
- stable: A boolean indicating whether the transaction's existence has been persistently logged.
- state: The transaction's current state (initiated, preparing, aborted, or committed).
- per-cohort information: For each cohort, its ID and vote.
The per-cohort information can be deleted after the coordinator receives the ACK for the second phase message. The entire transaction record can be removed from memory once acknowledgements from all cohorts are received.
Storage Logs
The coordinator's log only needs to force record the final transaction state (COMMIT or ABORT). It doesn't need to retain other details. If the coordinator crashes and a cohort inquires about a transaction, the coordinator replies with the logged final state.
If the coordinator crashes before the end of the PREPARE phase (i.e., before the final state is logged), it can safely assume an ABORT and respond accordingly to any cohort queries.
Finally, the coordinator should have an unforced "end" record for each transaction. This is a performance optimization to prevent the coordinator from restarting already-completed transactions; it doesn't affect correctness.
Summary
Coordinator:
- Force-log the COMMIT or ABORT decision before sending the corresponding message.
- Maintain an unforced transaction end record.
Cohort:
- Force-log the COMMIT-VOTE before replying to the PREPARE message.
- Force-log the COMMIT or ABORT action before replying to the COMMIT or ABORT message.
Applications of 2PC
2PC is a critical protocol for ensuring atomicity in distributed systems, meaning that all participating systems either complete a transaction or none do. While commonly associated with databases, 2PC has applications in various scenarios where atomic actions across multiple systems are essential. Here are some key examples:
- Exactly-once message delivery: 2PC guarantees exactly-once delivery of messages, preventing duplicates or lost messages in distributed environments. The messages are delivered, processed, and acked in a single transaction.
- Online booking systems: In systems like flight or hotel reservations, 2PC ensures that all parts of a booking process (e.g., seat reservation, payment processing, confirmation) are completed together or not at all.
- E-commerce transactions: When you make a purchase online, 2PC ensures that the order, payment, and inventory updates are all synchronized.
- Distributed caching: 2PC can be used to maintain consistency across distributed caches, ensuring that all cache nodes have the same data.
- Workflow management: In complex workflows involving multiple systems, 2PC can coordinate the execution of tasks and ensure that all steps are completed successfully.
2PC Variants
The 2PC variant we've been discussing is known as Presumed Nothing (PrN). Other variants exist, including Presumed Abort (PrA), Presumed Commit (PrC), and New Presumed Commit (NPrC), which is the focus of this paper. It's important to understand that these variants are performance optimizations of the core 2PC protocol; they don't change the fundamental logic. However, because 2PC is used in so many distributed systems, even small performance gains can have a significant impact on overall system performance.
Optimization Targets
The goal of all these variants is to minimize the number of forced writes to disk and the number of messages exchanged. The primary targets for optimization are:
- Coordinator: The forced log of the ABORT record.
- Cohort: The forced log of the COMMIT or ABORT action and the reply sent after committing or aborting.
It's crucial to note that certain operations cannot be optimized -
- The forced log of the COMMIT-VOTE is essential and cannot be avoided; cohorts must log their vote before sending it.
- Additionally, the reply to the PREPARE message is also necessary; cohorts must respond.
- Finally, forced log of COMMIT message is essential; coordinator must log committed transactions.
Presumed Abort (PrA)
In the PrA variant, the coordinator takes a more optimistic approach with regard to aborted transactions. Instead of force-logging an ABORT decision, the coordinator simply removes the transaction's entry from its in-memory database. The default assumption is that a transaction has been aborted unless proven otherwise. If the coordinator crashes and receives a query about a transaction that's no longer in its in-memory database and not in its COMMIT logs, it assumes the transaction was aborted and responds accordingly.
Furthermore, cohorts do not need to force-log ABORT action or acknowledge an ABORT message. Since the coordinator doesn't maintain any persistent state for aborted transactions, there's no action required upon receiving an ACK. Consequently, cohorts themselves don't even need to send an ABORT reply.
Failure Recovery
Note:
- The coordinator force logs COMMIT decisions.
- The cohorts force logs COMMIT action and ACKs to COMMIT messages.
Coordinator Failure:
- Before COMMIT is logged: The transaction is discarded. Any cohort querying the transaction's status is told to abort.
- Before all ACKs are received: The transaction's status is COMMIT if and only if the COMMIT message was successfully logged before the crash; otherwise, the status is ABORT.
Cohort Failure:
- Before receiving the decision message: The cohort queries the coordinator for the decision.
- After sending the ACK for ABORT (since ABORT logs are not forced): The cohort queries the coordinator for the decision and will idempotently receive ABORT again.
Presumed Commit (PrC)
In the PrC variant, the coordinator optimistically assumes that transactions will commit. The default assumption is that a transaction has committed.
However, PrC needs to log the PREPARE message. The coordinator does force-log both the PREPARE and COMMIT messages.
Furthermore, cohorts do not need to force-log COMMIT action or acknowledge an COMMIT message.
Failure Recovery
Notes:
- The coordinator force-logs PREPARE and COMMIT messages.
- The cohort force-logs ABORT action and ACKs to ABORT messages.
Coordinator Failure
- Before PREPARE is logged: The transaction is discarded.
- Before COMMIT is logged: The coordinator requests votes again from the cohorts.
- Before ABORT message is sent: The coordinator requests votes again.
- Before ACKs on ABORT: The coordinator requests ACKs again.
Cohort Failure
- Before receiving the decision message: The cohort queries the coordinator.
- After sending the ACK (since COMMIT logs are not forced): The cohort simply queries the coordinator again and will idempotently receive ABORT again.
Note: PrC is not always COMMIT only. It does ABORT as well based on cohort's decisions.
ACKs and Garbage Collection
Our focus here is on garbage collection (GC) of transaction data on the coordinator side, not on the cohorts.
Presumed Abort (PrA):
The final ACK for a COMMIT message triggers the removal of the COMMIT decision from persistent storage and the removal of the committed transaction from the protocol database.
ABORT decisions are never logged. Aborted transactions are removed from the protocol database as soon as any cohort sends an ABORT-VOTE.
Presumed Commit (PrC):
The final ACK for an ABORT message triggers the removal of the aborted transaction from the protocol database and the removal of the PREPARE message from persistent storage. The PREPARE message can also be removed after the COMMIT message is logged.
Upon receiving all ACKs for COMMIT, the COMMIT message can be removed, and the committed transaction can be removed from the protocol database.
A key difference is that in PrA, ACKs for ABORT are neither required for correctness nor for garbage collection. In PrC, ACKs for COMMIT are not required for correctness, but they are necessary for garbage collection.
Post-Recovery Behavior and GC
PrA: If an ACK for COMMIT is not received, the coordinator will still have the COMMIT log and will query the cohorts for the transaction's outcome. ACKs for ABORT are never required. After recovery, aborted transactions are simply forgotten.
PrC: If an ACK for ABORT is not received, the coordinator will still have the PREPARE message. It will query the cohorts for the decision. Upon receiving confirmation of the abort, it will resend the ABORT message and collect ACKs again. If an ACK for COMMIT is not received, the coordinator will have the COMMIT message and will resend the COMMIT message and collect ACKs.
In both cases (PrA and PrC), eventual garbage collection of entries in both the protocol database and persistent storage on the coordinator is guaranteed.
Summary
PrA and PrC optimize storage writes, as summarized in the table below (note again that cohorts always force-log their COMMIT-VOTE and reply, and coordinator always force-logs COMMIT message):
At first glance, PrC might seem superior. If we assume that most transactions commit successfully, avoiding ACKs for commits appears advantageous. However, the assumption that most transactions commit is not always valid. Furthermore, PrC introduces an additional forced log write for the PREPARE message. The paper argues that PrA is often a better choice, and this summary highlights why.
New Presumed Commit (NPrC)
NPrC is a further optimization of PrC designed to reduce logging overhead, especially in the optimistic scenario where crashes are infrequent. It achieves this without affecting the correctness of the 2PC protocol.
Data Structures
- tidl: The lowest transaction ID that is not recorded. Transactions with IDs lower than tidl are presumed committed.
- tidh: The highest transaction ID that is not recorded. Transactions with IDs higher than tidh are new.
- REC: The set of recent transactions.
- COM: The subset of REC containing transactions that are known to be committed. Committed transactions have forced COMMIT logs, as in PrC.
- IN: The set of transactions for which nothing has been written to storage. If a crash occurs, all transactions in IN are presumed aborted.
Logging tidh
Logged periodically (and can be forced, as it's infrequent) as new transactions with higher IDs arrive.
Logging tidl
Logged when the transaction with the lowest ID in REC is committed. The actual logged value might be higher than the current tidl because other transactions in the range might have already committed.
- The cost of logging tidl can be amortized with the cost of logging the COMMIT for the transaction itself.
- If the transaction is ABORTed, tidl can be logged unforced after all ACKs are received. Since all ACKs are received, it means no one will ever query about the transaction again, so tidl can safely be advanced.
Crash Recovery
tidh is set to a very high value to avoid overlap with previous transaction IDs. IN is rebuilt using the known values of tidl, tidh, and the COM set. Post this, the recovery is as follows based on the crash point of the coordinator:- Before tidl is advanced: COM is used to determine if a record was actually committed.
- After tidl is advanced: The transaction is presumed committed.
- Before tidl is advanced past the ID of an aborted transaction: The transaction becomes part of IN upon recovery and is thus presumed as aborted.
- After tidl is advanced past the ID of an aborted transaction: All ACKs for the abort would have been received, and no one should query about the transaction again.
Garbage Collection
- In PrA, abort decisions are not logged. If the coordinator fails before receiving votes, the transaction is considered aborted, and no garbage is generated.
- In PrC, if the coordinator fails before receiving votes, it has logged PREPARE messages for each transaction, allowing it to re-collect votes after recovery. No garbage is generated.
NPrC v/s PrC
After a crash, an NPrC coordinator might have more work to do than a PrC coordinator because it has less persistent information. However, crashes are assumed to be rare, so this extra work during recovery is generally not a significant performance bottleneck.
Read-Only Transactions
Serializable queries require read-only transactions (also known as snapshot transactions). These transactions acquire locks, but their outcome (commit or abort) is irrelevant to the cohort.
Cohorts respond to a PREPARE message with a READ-ONLY-VOTE. The coordinator, upon receiving only READ-ONLY-VOTEs from all cohorts, recognizes the transaction as read-only and omits sending COMMIT/ABORT messages.
In PrC, the coordinator doesn't need to log a COMMIT/ABORT message for a read-only transaction, but it must delete (unforced) the logged PREPARE message. NPrC offers further optimization: the PREPARE message is never logged, and the transaction is simply forgotten after all cohorts respond with READ-ONLY-VOTE. Crucially, no further communication about the transaction result occurs.
Serializability in 2PC
Violation in Read-Only Transactions
The most straightforward violation occurs with read-only transactions. Participants in a read-only transaction release their locks after sending their READ-ONLY-VOTE. This can break serializability because subsequent transactions might modify the data before the read-only transaction's effective commit point.
Essentially, the read-only transaction might read a snapshot of the data that's no longer consistent with the actual position of the transaction in the order.
Violation in Read-Write Transactions
Serializability requires some total order of all transactions. If a transaction T1 comes before another T2 in this order, T2 must see all of T1's effects. Conversely, if T2 comes before T1, T2 must not see any of T1's effects. 2PC can violate this.
Consider the bank database with accounts A, B, and C on nodes 1, 2, and 3. A has $100, B has $150, and C has $0. Two transactions, T1 (transfer $50 from A to B) and T2 (transfer $200 from B to C), are submitted, with T2 following T1.
- Transaction T1 is initiated.
- Nodes 1 and 2 accept T1 and acknowledge the PREPARE message.
- Node 2 commits T1 and releases its locks before Node 1 commits.
- Transaction T2 is initiated.
- Nodes 2 and 3 accept T2. Critically, Node 2 now reflects the updated balance of B (due to the partial commit of T1).
- T2 commits on both Nodes 2 and 3.
The problem is that T1 is still uncommitted on Node 1 (and hence only partially committed) when T2 reads the updated balance of B on Node 2. This creates a read uncommitted isolation level, where T2 sees the effects of T1 before T1 is fully committed, violating serializability.
In essence, 2PC, in its basic form, doesn't prevent transactions from observing intermediate, inconsistent states of other transactions, thus failing to guarantee serializability and therefore strict serializability.
Solutions
Two primary approaches address the serializability concern:
-
Coordinator Timestamp: The coordinator assign commit timestamps. This approach requires that the coordinator cannot change, otherwise, timestamps may go out of order!
-
Commit Timestamp: Each COMMIT-VOTE is assigned a timestamp range. Locks are released when the upper bound of the COMMIT-VOTE timestamp range is reached. This eliminates the need for the coordinator to track anything. The cohort's COMMIT-VOTE specifies a permissible timestamp range, ensuring no conflicting transactions within that range. A new transaction would receive a timestamp range which will be higher than that for all existing transaction.
Bonus - Three-Phase Commits (3PC)
2PC has significant vulnerabilities. It's not resilient to coordinator failures, cohort failures, or network partitions. Any of these events can halt 2PC progress.
A key concern arises after the coordinator has already logged a COMMIT or ABORT decision. While the coordinator must communicate the outcome to the client as soon as possible, it faces a blocking problem. Before it can respond, it needs acknowledgements from all cohorts:
- PrN and PrA: ACKs from all cohorts for the commit request.
- PrC: ACKs from all cohorts for the abort request.
Indefinite Blocking
If a cohort crashes before COMMIT phase, 2PC recovery is not possible until that cohort restarts. The coordinator can't finalize the transaction because it's waiting for confirmation from the unavailable cohort. The protocol is blocked indefinitely.
Three-Phase Commit (3PC)
3PC addresses this blocking issue by introducing a PRE-COMMIT phase between PREPARE and COMMIT.
- If the coordinator crashes before receiving PRE-COMMIT acks, the decision is always ABORT.
- If the coordinator has all PRE-COMMIT acks, the decision is always COMMIT.
This crucial change means that once the coordinator logs a COMMIT decision, it can safely commit regardless of cohort availability. Any cohort that misses the COMMIT message will eventually receive it.
The coordinator is no longer dependent on cohorts after logging the COMMIT decision, so it can immediately inform the client. Before logging the COMMIT decision, the decision is always ABORT. Therefore, cohort crashes no longer block the coordinator from responding to the client.
Of course, 3PC adds extra round trips. It's often preferable to use 2PC with highly resilient cohort members (like Paxos groups) where possible, mitigating the risk of cohort failures.
Paper Review
I personally found this paper very difficult to read. It felt cramped at times, and I even suspected some errors. However, everything in the paper is correct. The challenge lies in understanding all aspects, even though the underlying concepts seem simple. Despite this, the paper serves as a valuable first step in understanding distributed SQL databases. I highly recommend it.
Comments
Post a Comment