Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced TrueTime that revolutionized system assumptions. Authored by J.C. Corbett, with contributions from pioneers like Jeff Dean and Sanjay Ghemawat, this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field.

Paper Link

I would highly recommend reading the following before jumping into this article:

1. Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency.

2. A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system.

3. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases where I introduced Amazon Aurora that builds a (not quite) distributed SQL database without using 2PC.

Let's begin with some basic concepts.

Databases

Databases generally fall into two categories:

SQL Databases: These adhere to ACID properties (Atomicity, Consistency, Isolation, Durability) to ensure reliable transactions.
NoSQL Databases: These often follow BASE principles (Basically Available, Soft state, Eventually consistent), prioritizing availability and scalability over strict consistency.

SQL Databases

SQL databases guarantee ACID properties, ensuring reliable data transactions:

Atomicity: Transactions are treated as a single, indivisible unit of work. Either all changes within a transaction are applied, or none are.
Consistency: A transaction maintains the integrity of the database by ensuring that all defined rules and constraints (like keys, values, and other constraints) are preserved.
Isolation: Transactions are executed independently of each other. One transaction's operations won't interfere with another's.
Durability: Once a transaction is committed, the changes are permanent and will survive even system failures.

Examples of popular SQL databases include MySQL, Microsoft SQL Server, Google Spanner (a distributed database), and Amazon Aurora (also a distributed database).

NoSQL Databases

NoSQL databases often prioritize availability and scalability over strict consistency, adhering to what's sometimes described as BASE principles:

Basically Available: NoSQL databases often favor availability over strong consistency. This means the system remains operational even if some data is temporarily inconsistent. Conflicts arising from concurrent operations are typically resolved later.
Soft State: The database's state isn't guaranteed to be consistent at any given moment. It's a "soft" state, meaning its accuracy is probabilistic.
Eventually Consistent: The database strives to achieve consistency eventually. Given enough time, all updates will propagate throughout the system, and all nodes will see the same data.

I have visited some NoSQL databases:

Cassandra - A Decentralized Structured Storage System

Dynamo: Amazon's Highly Available Key-value Store

In this article, our focus will be on SQL databases only.

The CAP Theorem

Disclaimer: I strictly believe the FLP impossibility result is a more fundamental result than the commonly stated CAP theorem (see this blog post for more information). However, for clarity, here's a breakdown of the CAP theorem:

A distributed data store can only guarantee two out of the following three properties: Consistency, Availability, and Partition Tolerance. It's a trade-off, and you can't have all three simultaneously.

This leads to choices like:

Strong Consistency vs. Eventual Consistency: How strictly do you need data to be consistent across all nodes?
Availability: How important is it that the system remains operational and responsive, even during failures and partitions?
Partition Tolerance: How well does the system handle network partitions? This also involves considering the types of faults that cause these partitions.

ACID databases are categorized as CP (Consistent and Partition-tolerant). This means that in the event of a network partition, some shards of the database might become unavailable to maintain consistency. BASE databases, on the other hand, are AP (Available and Partition-tolerant). They prioritize availability during network partitions, even if it means temporarily sacrificing consistency.

Beyond CAP, other related theorems exist, such as:

PACELC: This theorem expands on CAP, stating that if there's a partition, you must choose between Availability (A) and Consistency (C), else (when there's no partition), you still have a trade-off between Latency (L) and Consistency (C).
Harvest-Yield: This is an older theorem that's less commonly discussed today.

Availability Numbers

System availability is often expressed using "nines". For instance:

Three 9s: 99.9% availability
Four 9s: 99.99% availability
Five 9s: 99.999% availability

This percentage represents the portion of queries expected to be answered without a system failure (though the response might not be correct or consistent).

Typically, availability increases with the geographic scope of the system. Local replication might offer lower availability, while regional and global systems aim for higher availability. This is because larger, geographically distributed systems have more resources and are less susceptible to correlated failures that could affect a smaller, localized system. For example, a system might provide 99.9% availability within a single cluster, 99.99% within an availability zone, and 99.999% globally.

SQL Databases

As previously mentioned, SQL databases must guarantee ACID properties. Atomicity and Consistency are crucial, and these are often achieved through transaction isolation.

Consider a bank database and a transfer transaction that moves funds from account A to account B. This transaction must exhibit the following properties:

Atomicity: The entire transfer must either complete successfully or fail entirely. The database (and any external systems) should never observe an intermediate state where the money has left account A but not yet arrived in account B.
Consistency: The total balance across all accounts must remain constant after any transaction.

Transaction isolation is a key characteristic of SQL databases. It's also a concept that's often misunderstood, so we'll delve into it in more detail.

Transaction Isolation

Transaction isolation determines how and when the changes made by one transaction become visible to other concurrent transactions. Essentially, it defines the level of segregation between transactions.

The simplest way to ensure isolation is to execute transactions serially (one after another). However, this approach is often too slow for practical applications, so databases allow concurrent transactions.

Concurrency introduces the possibility of one transaction seeing the partial, non-atomic effects of another. Returning to our bank transfer example, say A has a initial balance of $100 and B has a initial balance of $150, and $50 is transferred from account A to account B. The steps might look like this:

start 100014

write 100014 A $50 # Step 2

write 100014 B $200

commit 100014

These steps are also recorded in the transaction log. If another transaction runs concurrently, it could, after step 2, see that A's balance has been reduced, but B's balance hasn't yet been updated.

Several techniques are used to prevent this:

Locks: The first transaction could acquire an exclusive lock on account A after the write operation. This prevents other transactions from reading A's value until the first transaction commits. Locks are managed using a two-phase locking protocol (not to be confused with the two-phase commit protocol).

Version Numbers: Data items A and B could have version numbers. Any write operation creates a new version. These new versions are only committed when the transaction commits. The transaction can only commit successfully if the version numbers of A and B haven't changed by another transaction in the meantime.

Levels of Isolation

SQL databases offer different levels of transaction isolation, each with varying degrees of consistency and performance:

Serializable: This is the highest isolation level, guaranteeing that the effect of concurrent transactions is equivalent to some serial execution order (also known as multi-data sequential consistency). This makes reasoning about transactions straightforward for users.
Repeatable Reads: Similar to Serializable, but allows other transactions to insert new data items. This means that while a transaction might see a consistent snapshot of existing data, it might see newly inserted data itesm.
Read Committed: Allows a transaction to read changes made by other committed transactions while it's in progress. While this might seem reasonable, it can lead to inconsistencies. Imagine two transactions: T₁ transferring $50 from A to B, and another T₂ reading the balances of A and B. A serializable read would return either {A: $100, B: $150} or {A: $50, B: $200}, depending on the order of execution. However, a Read Committed read could return {A: $100, B: $200}. This can occur if the T₂ reads A's balance ($100), then T₁ commits, and then T₂ reads B's balance ($200).
Read Uncommitted (Dirty Reads): The lowest isolation level, allowing a transaction to read uncommitted values from other transactions. This can lead to reading data modified by transaction that is later rolled back, resulting in inconsistencies.

The mechanisms used by databases to achieve serializability at high concurrency are complex, involving concepts like serializable schedules, conflict serializability, view serializability, and recoverability. For the sake of brevity, I'll omit those details here.

Reads

In the context of database transactions, a "data item" refers to the smallest unit of data that can be locked and is guaranteed to be read and written atomically. Typically, a data item can be a row in a table.

SQL databases support various types of reads:

Single Data Item Read: Retrieves only one data item.
Multiple Data Item Read: Retrieves multiple data items.

A database read is considered consistent if the retrieved values reflect the state of the data items at some point in time.

For instance, if accounts A and B initially have balances of $100 and $150, respectively, and a transaction transfers $50 from A to B, a consistent read should return either {A: $100, B: $150} (before the transfer) or {A: $50, B: $200} (after the transfer).

Consistently Reading a Single Data Item

Reading a single data item consistently is straightforward. Regardless of when the query executes, it will retrieve a consistent value for that single item. This is supported by all isolation levels excluding read uncommitted.

Consistently Reading Multiple Data Items

Consistent reads across multiple data items can be achieved using read-only transactions. These transactions often acquire locks on all involved data items before reading, and release the locks afterward. This is only guaranteed by the serializable isolation level.

Distributed SQL Databases

Distributed SQL databases are those where data is sharded across multiple nodes. Transaction execution within these systems is similar to how it works in standalone databases:

Locking: The necessary data items are locked on each node involved in the transaction.
Commit: The transaction is then committed. However, the commit process is more complex in a distributed setting, requiring a two-phase commit (2PC) protocol.

In our bank database example, if data items A and B are on node 1 and node 2, then the transfer transaction would be recorded as follows:

Node 1

start 100014

write 100014 A $50

commit 100014

Node 2

start 100014

write 100014 B $200

commit 100014

Serializability v/s Strict Serializability

This is a crucial section of this article.

Serializability ensures that the effects of all transactions are equivalent to some sequential execution order (also known as sequential consistency). Strict serializability (or external consistency) builds upon this by considering the temporal order of transactions.

A helpful analogy (attributed to Dr. D. Mazieres) clarifies the distinction:

Imagine you successfully commit a transaction transferring money from your account to a friend's account. You then immediately call your friend to check their balance. If they don't see the transferred funds, the database violates strict serializability (external consistency).

However, the database could still maintain internal consistency (serializability). This means the final database state reflects some valid sequential order of transactions, even if it doesn't align with the real-time order of events. For example, if your friend later tries to withdraw $200, the success or failure of that withdrawal will depend on whether the database ordered their transaction after or before your transfer transaction, even if, in reality, your friend's request came after your transfer.

Another way to explain strict serializability is that the database's transaction ordering must respect the temporal semantics observed by external users. Formally, if transaction T₂ is submitted by a client only after transaction T₁ commits, then the database must order T₂ after T₁.

Strict Serializability in Standalone Databases

In serializable mode, SQL databases can guarantee serializability because all data resides within the database's boundaries. It's easy to implement locking mechanisms to ensure a transaction completes before the next one begins.

Even strict serializability is straightforward to achieve. Since all operations occur within the same process, the effect of a committed transaction is immediately visible to the next transaction.

Strict Serializability in Distributed Key-Value Systems

Data stores that operate on single data items (such as key-value stores, distributed file systems, and lock services like ZooKeeper) primarily rely on a distributed log to coordinate changes to the state of those items.

With single data items, strict serializability effectively becomes linearizability. The distributed log itself serves as the single source of truth for the ordering of transactions. Once a transaction is committed to the immutable distributed log, its effects become immediately visible to subsequent transactions.

All writes are linearizable because they must go through the leader. Even during leader changes, the new leader must catch up on all existing committed transactions in the distributed log before accepting new transactions. This ensures that the new leader's writes respect the prior ordering.

Linearizable reads are achieved by executing a read-only transaction at the leader. The leader guarantees that a transaction is acknowledged only after it's committed to the distributed log. Therefore, all read-only transaction through the leader see the latest state of the data store (because they are causally related to and ordered after previous transactions, as they go through the same leader).

It's important to note that relaxed reads (i.e., reads from a non-leader replica) might not be linearizable.

Strict Serializability in Aurora

Amazon Aurora achieves strict serializability because its transaction processing occurs within a single, non-distributed instance. This single instance handles all transactions and provides the latest committed value at any given point in time. This architecture is similar to a standalone database, making serializability, and even strict serializability, easy to implement.

However, as I've argued before, this architecture also means that Amazon Aurora isn't a truly distributed SQL database. All transactions must pass through a single primary instance, which can easily become a performance bottleneck.

Strict Serializability in Distributed SQL Database with 2PC

In a true distributed SQL database, data items are distributed across multiple participants, and these participants must coordinate commits using a protocol like two-phase commit (2PC). Each participant might itself be a Paxos group, composed of multiple nodes maintaining a distributed log for fault tolerance.

However, 2PC alone does not guarantee serializability. I've illustrated this with an example in a previous post, which I'll reproduce here. Consider the bank database with accounts A, B, and C on nodes 1, 2, and 3. A has $100, B has $150, and C has $0. Two transactions, T₁ (transfer $50 from A to B) and T₂ (transfer $200 from B to C), are submitted, with T₂ following T₁.

Transaction T₁ is initiated.
Nodes 1 and 2 accept T₁ and acknowledge the PREPARE message.
Node 2 commits T₁ and releases its locks before Node 1 commits.
Transaction T₂ is initiated.
Nodes 2 and 3 accept T₂. Critically, Node 2 now reflects the updated balance of B (due to the partial commit of T₁).
T₂ commits on both Nodes 2 and 3.

The problem is that T₁ is still uncommitted on Node 1 (and hence only partially committed) when T₂ reads the updated balance of B on Node 2. This creates a read uncommitted isolation level, where T₂ sees the effects of T₁ before T₁ is fully committed, violating serializability.

Solution?

The solution to achieving both serializability and strict serializability in a distributed SQL database involves establishing a globally consistent and externally consistent ordering of transactions. This is accomplished by timestamping every transaction upon entry into the system. ¿When a transaction is received by any cohort, it's guaranteed to receive a timestamp greater than all previous transaction timestamps. Furthermore, all transactions will be ordered according to these timestamps.

However, generating such timestamps isn't trivial. It demands tight clock synchronization across all participating nodes. This is where Google's TrueTime comes into play.

TrueTime

Google's TrueTime revolutionized the generation of globally ordered timestamps, representing a significant breakthrough in distributed systems and fundamentally altering common assumptions. This innovation is what made the Spanner paper a landmark contribution to the field.

As Lamport highlighted in another seminal paper, achieving perfect global clock synchronization across all machines is impossible. Clocks can drift (running fast or slow), become completely out of sync, or be nearly perfect but still insufficient for ensuring system correctness. Consequently, most distributed system algorithms avoid relying on clock synchronization for correctness. I say "most" because some algorithms do use clock synchronization to simplify assumptions; without such simplifications, certain problems (like consensus) would be impossible to solve (see the FLP impossibility).

While clocks can drift, the error bound can be limited. When we use Now() API to read the time, it might return a time t that is ε away from the true global time. The actual time would lie within the interval [t - ε, t + ε]. TrueTime leverages this concept, but on a scale that makes the error bound ε extremely small at a global level.

API

TrueTime offers a simple and intuitive API (a hallmark of well-designed systems):

TT.now(): Returns a TTInterval object, representing a time interval: [earliest, latest].
TT.after(t): Returns true if time t has definitely passed.
TT.before(t): Returns true if time t has definitely not yet arrived.

The absolute time assigned to an event e, denoted as t_abs(e), falls within the interval returned by TT.now() at the time of the event's invocation (e_now):

tt.earliest <= t_abs(e_now) <= tt.latest

The interval itself is bounded by 2 * ε.

TT.after(..) and TT.before(..) are essentially wrappers around TT.now().

Architecture

Google has invested heavily in achieving and maintaining the tight error bounds required by TrueTime.

Recommended Read: Practical Uses of Synchronized Clocks in Distributed Systems where I discuss how NTP works.

TrueTime relies on GPS and atomic clocks as its time references. While GPS provides highly accurate time, it's susceptible to hardware failures and radio interference. Atomic clocks, while very stable, can also drift due to frequency errors. These potential sources of error contribute to the overall error bound.

Each data center has one or more time servers (or "time masters"). These masters fall into two categories:

GPS Time Masters: Equipped with GPS receivers, these nodes receive time information directly from satellites.
Armageddon Masters: Equipped with local atomic clocks, these nodes serve as a backup to GPS time masters in case satellite connections are unavailable.

Within each data center:

There's at least one time master.
Each machine runs a time slave daemon (the terminology comes from the original paper which is a bit old).

The masters synchronize their time references with each other. Armageddon masters advertise a slowly increasing time uncertainty, while GPS masters advertise a much smaller uncertainty (close to ±40ns).

The time slave daemons poll multiple masters to mitigate vulnerabilities. Each master provides a time interval, and these intervals must be merged to determine the largest common interval.

Marzullo's algorithm is used for this interval intersection. It's a classic algorithm for finding the largest intersection of multiple intervals, which is exactly what's needed here. Any intervals that don't intersect with the largest common interval are discarded.

Calculating ε

Each daemon advertises a slowly increasing clock uncertainty, derived from a conservative estimate of worst-case local clock drift.

The daemon polls time masters only every 30 seconds, assuming a maximum drift rate of 200μs per second. This polling interval contributes approximately 6ms to the uncertainty. An additional 1ms is added to account for communication delays to the time masters. Thus, ε is typically between 1 and 7ms.

With that understanding of TrueTime, let's now turn our attention to Google Spanner.

Spanner

Spanner was developed to replace Google's F1 and Megastore databases, which lacked both scalability and the characteristics of true distributed SQL databases. Spanner was designed to offer:

Geographic replication for global availability.
Massive scalability, handling tens of terabytes of data.
External consistency.

When Spanner was created, no other distributed database offered external consistency (although the theoretical underpinnings were understood). Google assembled a team of scientists and engineers to make this groundbreaking system a reality.

The Architecture

Spanner is a distributed SQL database that stores data in a distributed B+ tree structure. However, its architecture has some nuances.

A database resides within a universe. Within a database, tables exist. Spanner implements the following mapping within each table:

(key: string, timestamp: int64) -> string

Each key has an associated timestamp, representing the transaction time when it was created or last modified. This timestamp is crucial for Spanner's multi-version concurrency control mechanism, which we'll discuss later.

This structure might resemble a key-value store, but Spanner is not simply a key-value store. It fully supports SQL semantics. Those familiar with Google Bigtable might find Spanner's organization somewhat similar. It's plausible that Spanner was built upon Bigtable technology with the long-term goal of eventually replacing it.

The keyspace is partitioned into directories, each with its own replication and placement policy. For instance, a geographic region could be incorporated into the key, and the keyspace could be divided into directories, each corresponding to the keys for a specific region. These directories could then be placed in the appropriate regions. Importantly, a directory contains a contiguous range of keys from the overall keyspace.

The keyspace is subdivided into tablets. A tablet is the fundamental unit of storage in Spanner. A tablet can contain one or more directories. Spanner co-locates frequently accessed directories within the same tablet for performance optimization.

Interleaved Tables

Tables within a Spanner database can be interleaved, meaning one table can be designated as a child of another. The top-level table in this hierarchy is the directory table. Child tables share a key prefix with their parent table.

Here is an example of Customers directory table with Orders child table:

Table Customers {

CustomerID INT64 PRIMARY KEY,

Name STRING,

...

}

Table Orders {

CustomerID INT64 REFERENCES Customers(CustomerID), # Shared prefix

OrderID INT64 PRIMARY KEY,

OrderDate DATE,

...

}

Table interleaving is beneficial because it allows related rows across different tables to be stored within the same directory, improving data locality and query performance.

Span Servers

Servers in Spanner are grouped into zones, which are similar to physically isolated data centers. Each zone contains hundreds or thousands of spanservers, and each spanserver can host between 100 and 1000 tablets.

A zone master is responsible for assigning tablets to spanservers. The paper has some other details on the universe master and the placement driver that I would skip for brevity.

Paxos Groups and Transaction Manager

A tablet isn't stored on a single machine. It's replicated according to the directory's replication policy. The set of machines holding a particular tablet forms a Paxos state machine, responsible for that tablet's data. The set of replicas within a Paxos state machine is called a Paxos group.

Directories within a tablet can be moved using the movedir background task. The Paxos group logs a transaction after the move completes, so movedir doesn't block operations. The entire directory can be replicated to another group, and then ownership updated via a transaction.

Lock Table

The Paxos group's leader stores the lock table, which is used during transaction processing. The lock table maps key ranges to lock states. Implicitly, locks are held on a range of keys. The other Paxos group members participate in consensus and log transaction writes and commits.

Transactions within Paxos Group

When a transaction stays within a single Paxos group (i.e., all read/modified data items reside within that group), 2PC is unnecessary. The Paxos group's distributed log and the leader's lock table are sufficient to manage the transaction because all replicas within a Paxos group have the same data.

Locks are acquired at the transaction's start, modifications are logged to Paxos, and then a COMMIT log entry is added.

Transactions Across Paxos Groups

When a transaction spans multiple Paxos groups, 2PC is required. These Paxos state machines provide highly available cohorts for 2PC. (See my previous article on the differences between Paxos and 2PC). For 2PC, spanservers have a unit called transaction manager.

The leader of a Paxos group in 2PC is called the participant leader, while other replicas within the Paxos group are participant slaves. One of the Paxos group becomes the coordinator, and its leader becomes the coordinator leader. The leaders of the other Paxos groups involved in 2PC become coordinator slaves. (Apologies for the offensive terminologies; the paper is a bit old).

Paxos Read-Only Transactions

Within a Paxos group, all read-only transactions must go through the leader for serializability. The leader ensures the transaction is ordered after all existing transactions. The leader also requires a quorum from a majority of replicas before a read is considered successful (e.g., in case of a view change). This overhead can be avoided by executing reads only at the Paxos leader, but this requires strong clock synchronization. Spanner uses TrueTime to make this possible.

To understand more on how clock synchronization help, refer to my previous post on clock synchronization where I explained how Harp relies on read-from-primary that requires clock synchronization for external consistency.

Witness Nodes

Witness replicas are a special type of nodes. They play a crucial role in ensuring high availability and fault tolerance. Unlike read-write and read-only replicas, witness replicas do not maintain a full copy of data, nor do not they serve read requests.

Witness replicas do participate in voting for write commits. They contribute to the quorum required for successful writes, ensuring data consistency and durability. Witness replicas also participate in leader election, but they are not eligible to become the leader replica.

Transaction Processing

As mentioned earlier, implementing 2PC requires globally consistent, monotonically increasing timestamps for transactions. This is where TrueTime becomes essential. Let's examine how Spanner assigns timestamps to transactions.

Spanner supports two main transaction types:

Read-write (R-W) transactions: These are the standard transactions that use locks (pessimistic concurrency control).
Read-only (R-O) transactions: Because transactions have globally consistent timestamps and the data store is versioned, read-only transactions can run lock-free. We'll explore how this works shortly.

Spanner also offers another type of read called a snapshot read with bounded staleness.

Paxos Leader Leases

Spanner optimizes read-only transactions by having them typically execute only at the Paxos group leader. This is straightforward if the leader remains constant, but leader changes during a transaction may result in inconsistent values. Yet, single-node reads at the leader significantly improve performance and hence desirable.

Spanner achieves this optimization using leader leases (similar to Harp but with accuracy). The leader obtains a lease from the replicas, guaranteeing its leadership for the lease duration. These leases depend on accurate clock synchronization, which Spanner provides via TrueTime.

The lease mechanism is detailed in Appendix A of the paper. Here's a simplified explanation:

v_i,r^Leader: The earliest time a vote request was sent by a potential leader i to a replica r.

t_i,r^end: The time replica r grants a lease to i until, calculated as TT.now().latest + 10s. Replicas won't grant another vote until TT.after(t_i,r^end) is true.

The new leader's lease (after receiving majority votes) is valid for the interval [TT.now().latest, min_i(v_i,r^Leader) + 10s]. This ensures the leader's assumed lease expiration is always earlier than the actual expiration at any voter, preventing concurrent leader elections (single-vote property). With guaranteed single leadership (leader-disjointness), read-only transactions can safely execute at the leader only.

If a transaction with timestamp T is accepted, it must fall within the leader's lease bounds. Similarly, the leader relinquishes its lease only after s_max has passed, where s_max is the maximum timestamp of any transaction processed by that leader.

Assigning Timstamps

Consistent timestamp assignment during 2PC is crucial for Spanner's external consistency. Transactions arrive at a coordinator, which is also the Paxos leader. Because there's only one Paxos leader at any given time (leader disjointness), and that leader's timestamp assignment is constrained by its lease, the leader must assign timestamps greater than all previous transaction timestamps.

Spanner's external consistency guarantee is: If transaction T₂ starts after transaction T₁ commits, then T₂'s commit timestamp must be greater than T₁'s. This ensures that T₂ is ordered after T₁, maintaining external consistency. For example, if you execute T₁ and then tell a friend, they can subsequently execute T₂ and observe the effect of T₁.

Formally:

Let e_i^start and e_i^commit be the start and commit events of transaction T_i, and s_i be the assigned timestamp. Then:

t_abs(e₁^commit) < t_abs(e₂^start) => s₁ < s₂

Spanner uses TrueTime to assign timestamps, ensuring that each timestamp is TT.now().latest. This makes it greater than any previously assigned timestamp. However, this alone isn't sufficient.

Commit Wait: After assigning a timestamp s_i, all 2PC participants wait until TT.after(s_i) is true. This guarantees that any participant invoking TT.now().latest for a subsequent transaction will receive a timestamp greater than s_i. Therefore, all newer transactions will have higher timestamps.

This proof is detailed in Section 4.1.2 of the paper. It relies on the assumption that clients only send a second transaction after observing the commit of the first.

This timestamping mechanism applies to both read-only and read-write transactions.

Executing a R-W Transaction

Let's walk through an example of a read-write transaction in Spanner involving three data items, A, B, and C, residing in different Paxos groups. The transaction is as follows:

txn = new Transaction()

a = txn.Read(A)

b = txn.Read(B)

txn.Write(C, a + b)

txn.Commit()

txn.Read(A): The Paxos group leader for A returns the value of A and read-locks A to prevent concurrent modifications.
txn.Read(B): The Paxos group leader for B read-locks B and returns its value.
txn.Write(C, A + B): The client buffers the write to C locally. Spanner doesn't support read-your-writes (RYW) within a transaction, so writes can be buffered until commit.
txn.Commit(): The client initiates the commit. Any of the Paxos groups (A, B, or C) can act as the coordinator. Let's assume the Paxos group for C becomes the coordinator. The Write(C) and Commit() requests are sent to C's Paxos group.
Prepare Phase: C's Paxos leader locks C, logs the new value, and sends a PREPARE message to the leaders of A and B's Paxos groups (these are the cohorts in the 2PC).
Ack: The leaders for A and B log the COMMIT message and acknowledge with their local TrueTime timestamps (TT.now().latest).
Timestamp Assignment: The leader at C determines the maximum of all received timestamps (including its own) and assigns this as the transaction's commit timestamp. This timestamp is guaranteed to be greater than any timestamp assigned to a previous transaction affecting the same data items.
Commit Wait: The leader at C waits until TT.after(s) is true, where s is the assigned commit timestamp.
Commit Phase: C sends the COMMIT message to A and B, releasing all locks. A and B also release their locks.

FAQs:

Q: What happens if two orthogonal transactions are submitted by two different clients at the same time?

A: They might receive the same timestamp (even down to nanosecond precision). Ties may be broken arbitrarily. However, this doesn't affect external consistency, as neither transaction can have committed before the other began.

Q: What if the leader of any Paxos group goes down during 2PC after PREPARE reply?

A: The COMMIT message would still be logged. In general, when a new leader comes up, it temporarily locks the entire database until all existing transactions have been cleared according to their timestamp order. This prevents LOCK messages for reads from being logged to Paxos. As noted in the paper, this approach relies on long-lived Paxos leaders for efficiency.

Q: What if, between lock(A) succeeding and lock(B) succeeding, B is modified by another transaction?

A: Nothing. The transaction that modified B will be ordered before the current transaction. Neither serializability nor strict serializability is impacted.

Deadlock Prevention

Because lock acquisition during individual transactions is unsynchronized, deadlocks are possible.

Spanner implements transactions as sessions between clients and the leaders of the involved Paxos groups. Two deadlock prevention schemes are considered:

Wound-Wait: The transaction that started later is aborted (or "wounded") to allow the earlier transaction to proceed. Since timestamps haven't been assigned yet at this stage, ordering is determined solely by transaction start time (may be provided by client).
Wound-Die: The older transaction waits for the younger one to complete. This avoids preemption but can lead to more aborts and rollbacks.

Spanner makes use of wound-wait algorithm to prevent deadlocks.

Executing a R-O transaction

Read-only transactions in Spanner are highly optimized. Because all transactions receive timestamps and Spanner stores versioned data, read-only transactions don't require locks. Once a timestamp is assigned to a read-only transaction, it can execute at any sufficiently up-to-date Paxos group replica. The read values will reflect a consistent snapshot at the assigned timestamp. This makes read-only transactions exceptionally fast and non-failing.

To determine which replicas can handle a read-only transaction with timestamp t, each replica tracks a "safe time" (t_safe), representing the highest timestamp to which the replica's data is consistent. A replica can serve a read if t <= t_safe.

t_safe is calculated as the minimum of two values:

t_safe^Paxos: The highest timestamp of a transaction committed within the Paxos group.
t_safe^TM: This accounts for pending transactions. If there are no pending transactions, t_safe^TM is infinity, meaning the replica is safe to read from as long as t_safe^Paxos is greater than t. If there are pending transactions, t_safe^TM is calculated as:

t_safe^TM = min_i(s_i^prepare) - 1

Where s_i^prepare is the timestamp of the PREPARE message for pending transaction i. Since pending transactions are assigned a timestamp that is the maximum of all cohort prepare timestamps, we find the minimum prepare timestamp across all pending transactions and set t_safe^TM to be less than that value.

For reads within a single Paxos group, the read-only transaction timestamp could be set to the timestamp of the last seen read-write transaction (LastTS()). For 2PC reads, the timestamp could be negotiated by querying all cohorts for their latest timestamps and taking the minimum. However, Spanner avoids these approaches by simply setting sread to TT.now().latest. While this simplifies the process, it might introduce some delay as the read operation may wait for all transactions up to sread to commit.

Section 4.2.4 of the paper details further refinements, focusing on efficiently determining the latest timestamp. The core idea is fine-grained tracking of PREPARE messages and the key ranges affected by transactions. This allows the system to exclude non-conflicting transactions when determining the timestamp for read-only transactions.

Executing Reads at a given timestamp

Spanner's versioned data and timestamped reads enable a powerful feature: snapshot reads. These allow reading the database as it existed at a specific point in time, even in the past. Critically, snapshot reads still provide a consistent view of all data items read at that point in time. This is equivalent to snapshot isolation. SQL queries often leverage snapshot reads (at a given timestamp) to retrieve consistent data without the overhead of locking.

Schema Changes

Spanner's globally ordered timestamps simplify schema changes. A schema change is treated like any other operation and is assigned a timestamp. Each Spanner server then applies the schema change independently. Reads and writes against the new schema are only allowed after the schema change has been applied locally and TT.after(s) is true (where s is the schema change timestamp).

Importantly, schema changes are not implemented as a massive distributed transaction across all Spanner servers. Such a large transaction would be impractical to complete and could potentially lock the entire database.

Evaluation

In experiments with a single replica per Paxos group, commit wait time was approximately 5ms, and Paxos latency was about 9ms. As the number of replicas increased, latency remained relatively constant, but with a lower standard deviation.
Killing Paxos leaders resulted in a 10-second wait for a new leader to be elected. The Spanner paper emphasizes the importance of Paxos leader stability for higher availability. Shorter lease times could help in this scenario. In the event of leader failure, restarting the leader is generally preferred.
For TrueTime, the error bound ε is roughly less than 4ms (p99) and 10ms (p999).

Google's F1 database, which was based on master-slave replicated MySQL, was replaced by Spanner. F1 maintains a history of all changes made to Spanner, which is itself written back to Spanner (change history).

The paper provides further internal details on how directories are also fragmented.

Spanner v/s CAP Theorem

Let's revisit the CAP theorem and examine how Spanner fits within its constraints. The information in this section is derived from the Eric Brewer's paper)

Spanner, with its highly available Paxos groups for transaction management, might appear to violate the CAP theorem. However, it doesn't. Spanner prioritizes consistency. Its high availability is achieved at a significant maintenance cost borne by Google. Technically, it's still classified as a CP system. Spanner offers five 9s of availability, a very attractive proposition.

Network Partitions

Network partitions, which can lead to availability loss, account for a relatively small percentage (8%) of Spanner's outages. Common network issues include:

Individual data centers becoming disconnected.
Misconfigured bandwidth.
Network hardware failures.
One-way traffic failures.

Google's extensive experience in network engineering allows them to build a highly reliable network. Spanner's network leverages Google-owned infrastructure, including the software-defined B4 network. Google's control over packet flow and routing enhances its resilience against network partitions.

Handling Crashes

2PC itself isn't crash-tolerant. If a participant fails, committing or aborting a transaction becomes challenging. This is an inherent limitation of 2PC. Spanner mitigates this somewhat by using Paxos groups for participants, ensuring high availability.
Furthermore, during Paxos leader changes, a Spanner group essentially freezes until the lease expires. The paper suggests restarting the leader as a better solution that waiting for least expiration.

Handling Network Partitions

2PC is not partition-tolerant; it cannot complete during a partition.

Even during partitions, Spanner transactions can proceed if the leader and a majority of a Paxos group remain on the same side of the partition.

Reads are generally more resilient. They can be performed at any replica that's caught up to the transaction's commit timestamp. Snapshot reads offer even greater robustness. However, if none of the replicas are sufficiently up-to-date (e.g., due to being on the non-majority side of a partition), reads will eventually fail.

TrueTime v/s CAP Theorem

Spanner's functionality heavily relies on TrueTime's accuracy and availability:

Transaction timestamps: TrueTime provides the timestamps assigned to transactions.
Paxos leader leases: Paxos group leader leases depend on TrueTime.
Snapshot isolation: TrueTime's ability to generate consistent, ordered timestamps is fundamental to Spanner's versioning mechanism, which makes snapshot isolation possible. Without TrueTime, identifying data objects and ensuring a consistent snapshot (without partially applied updates) would be extremely difficult.
Schema changes: TrueTime also facilitates schema changes.

Beyond Spanner, TrueTime has broader applications:

Cross-system snapshots: TrueTime enables consistent snapshots across multiple independent systems without requiring 2PC. These systems can agree on a timestamp using TrueTime and then independently dump their state at that time.
Workflow timestamps: TrueTime timestamps can be used as tokens passed through workflows (chains of events). Every system in the workflow can agree on the same timestamp.

TrueTime itself is not immune to the effects of network partitions. The underlying time sources (GPS receivers and atomic clocks) can drift apart, and as this drift increases, transactions may experience longer wait times before committing. Eventually, transactions may time out, impacting availability.

Paper Review

This paper is a landmark contribution to distributed systems. Its density reflects the numerous groundbreaking ideas it introduced to the database community. Personally, I found it took considerable time to fully grasp all the nuances. It's essential reading for anyone passionate about distributed systems.

Pico

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

Databases

SQL Databases

NoSQL Databases

The CAP Theorem

Availability Numbers

SQL Databases

Transaction Isolation

Levels of Isolation

Reads

Consistently Reading a Single Data Item

Consistently Reading Multiple Data Items

Distributed SQL Databases

Serializability v/s Strict Serializability

Strict Serializability in Standalone Databases

Strict Serializability in Distributed Key-Value Systems

Strict Serializability in Aurora

Strict Serializability in Distributed SQL Database with 2PC

Solution?

TrueTime

API

Architecture

Calculating ε

Spanner

The Architecture

Interleaved Tables

Span Servers

Paxos Groups and Transaction Manager

Lock Table

Transactions within Paxos Group

Transactions Across Paxos Groups

Paxos Read-Only Transactions

Witness Nodes

Transaction Processing

Paxos Leader Leases

Assigning Timstamps

Executing a R-W Transaction

FAQs:

Deadlock Prevention

Executing a R-O transaction

Executing Reads at a given timestamp

Schema Changes

Evaluation

Spanner v/s CAP Theorem

Network Partitions

Handling Crashes

Handling Network Partitions

TrueTime v/s CAP Theorem

Paper Review

Comments

Post a Comment

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS