Skip to main content

Paper Insights #23 - Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

This article explores Amazon Aurora, a distributed SQL database that avoids two-phase commit (2PC). While I initially intended to jump from a discussion of 2PC directly to Google Spanner, another globally distributed SQL database, I felt it was important to cover other notable distributed SQL designs. Published in 2017 and presented at SIGMOD by Alexandre Verbitsky et al., the Aurora paper presents a clever approach that enables distributed SQL without 2PC, a concept that may initially seem counterintuitive.

Recommended Read: A New Presumed Commit Optimization for Two Phase Commit where I introduce SQL database concepts and provide an overview of how SQL databases work.

Paper Link

Let us start by revisiting some concepts.

Cloud Native

The term "cloud-native" gained prominence in the late 2000s as cloud computing adoption accelerated. (On a side-note, while cloud was initially hailed as the future, the rising costs are leading organizations to re-evaluate the role of on-premises infrastructure).

Cloud-native refers to a methodology for building and running applications that fully leverages the distributed computing and scalability capabilities of cloud environments. It focuses on how applications are designed and deployed, not simply where they reside.

Aurora is a cloud-native database service, meaning it's constructed entirely from cloud components. Let's briefly cover the relevant AWS cloud services:

Clusters & Availability Zone

A cluster is a group of interconnected computers, typically using high-bandwidth fabric networks. These clusters are often hosted within data centers.

Availability Zones (AZs) are distinct physical locations within a specific geographic region. Think of them as individual data centers, potentially composed of multiple machine clusters. AZs are themselves contained within a larger geographic region. As of February 2025, AWS Cloud spans 114 Availability Zones within 36 geographic regions. For example, the US-east-1 region includes AZs like use1-az1, use1-az2, and so on (complete list).

It's important to recognize that an AZ, or even an entire region, can become unavailable due to various factors:

  • Natural disasters
  • Network outages
  • Planned maintenance or repairs

Therefore, robust services must be replicated across multiple AZs and regions. Given its status as a global SQL database, it's logical to assume that Aurora employs similar replication strategies.

Correlated Failures

Google's paper, Availability in Globally Distributed Storage Systems, is an excellent resource for understanding data center failures, and I highly recommend it.  I'll briefly touch on some of its key points relevant to this discussion.

Correlated failures are multi-point failures that share a common root cause. 

Examples

  • Operating System Bugs: An OS bug can impact numerous machines within a cluster or even an entire availability zone, depending on how quickly it's identified and addressed. This highlights the importance of staged rollouts for OS upgrades: first a small set of machines, then a canary group, and finally, a broader deployment.

  • Hard Drive Failures: A manufacturing defect in a batch of hard drives can lead to simultaneous failures across availability zones where those drives were deployed. Mitigation strategies include staggered hard drive replacements and using drives from different batches or manufacturers within a data center.

  • Network Switch Failures: Servers are often organized into racks, with multiple machines connected within each rack. Communication between machines in different racks requires traversing a network switch. A failure of this switch can impact all machines within the affected rack.  Therefore, when replicating data within a cluster, it's essential to ensure that each replica resides on a different rack.  Ideally, no two replicas of the same data should be on machines within the same rack.

  • Natural Disasters (Fire, etc.): These can incapacitate an entire availability zone or even a region.

  • Power Outages: These can also take down an availability zone.

  • Undersea Cable Cuts: In a worst-case scenario, an undersea cable cut can render an entire region unavailable.

Mean Time To Failure (MTTF)

MTTF measures the average time a product or system works before experiencing a failure.

MTTF = Total operational time / Number of failures

Mean Time To Repair (MTTR)

MTTR is a metric used to measure how long it takes to fix a broken system or device. It's a key indicator of a system's maintainability and repair efficiency.

MTTR = Total repair time / Number of incidents

Double Fault

System design must account for the possibility of correlated failures.  While even robust systems like Aurora are designed to withstand a single correlated failure (such as an AZ outage), managing multiple, simultaneous correlated failures (double faults) is significantly more challenging. 

The key to minimizing the risk of double faults is ensuring that the MTTR << MTTF. This large disparity between MTTR and MTTF reduces the probability of a second failure occurring before the first is resolved.

Replication

Replication involves creating multiple copies of the same data, typically key-value pairs where the key identifies the value. The value itself is replicated across multiple machines. Maintaining consistency across these replicas is crucial, and several algorithms address this.

Primary-Secondary Replication

One replica is designated the primary, and all writes are directed to it. The primary then propagates these writes to the secondary replicas, often asynchronously. Relaxed reads at the secondaries can improve performance. This can be extended to multi-primary setups.

Leaderless Replication (Quorum)

With N nodes storing a replicated data item, each item has a version number that increments with each write. For a majority, at least N / 2 + 1 nodes must participate in both writes and reads at the same version number.

  • Writes: The client sends writes to all N replicas. An acknowledgment is sent when N / 2 + 1 replicas have committed the write.
  • Reads: The client queries N / 2 + 1 replicas. At least one of these will have the latest value, ensuring consistency.
Generally, if there are N replicas, W represents the write quorum, and R represents the read quorum, then W + R > N is required for consistency. Distributed file systems like Harp also use quorum replication, where the data item is a file or file sub-stripe. Quorum replication is linearizable, providing the latest value during consistent reads from R replicas.

Recommended Reads: 
It's important to note that quorum-based voting is used for replication only; it does not solve the consensus problem.

Chain Replication 


Replicas are arranged in a chain. All writes are sent to the head, and the tail sends the final acknowledgment. Each node commits the write and forwards it to the next node in the chain. Each node has more information than its successor but less than its predecessor.


Each server maintains a list of updates it has sent but that haven't been committed at the tail. Upon reconnecting, the node forwards these uncommitted updates. This list is garbage collected when the tail periodically acknowledges all updates.

  • If the head fails, the next node in the chain becomes the head. Pending writes are dropped because they were not yet committed.
  • If a middle node fails, the two adjacent nodes connect with each other. Pending writes are propagated again from the point of failure.
  • If the tail fails, its predecessor becomes the tail.

Consistent reads must go to the tail, as that's where the committed value resides.

Advantages: Lower network usage compared to quorum-based approaches, as clients don't need to send data to W members. 

Disadvantages: Higher latency due to the sequential nature of writes.

Delta is one example of an implementation of chain replication.

Relational (SQL) Databases

SQL databases guarantee Atomicity, Consistency, Isolation, and Durability (ACID) properties for transactions:
  • Atomicity: A transaction either completes entirely or not at all. No partial updates are allowed.
  • Consistency: Transactions maintain database integrity by adhering to all defined constraints. The database remains in a valid state before and after each transaction.
  • Isolation: Transactions appear to execute independently, as if they were the only ones running. Different isolation levels exist, with serializable isolation ensuring transactions appear to occur in a specific serial order.
  • Durability: Once a transaction is committed, the changes are permanent and survive even system failures.
Consider a bank transfer from account A to account B:
  • Atomicity ensures both the debit from A and the credit to B occur, or neither does.
  • Consistency ensures the total balance across all accounts remains constant.
A simple approach might involve locking the entire database during a transaction, however, this is highly inefficient. Modern databases support concurrent transactions, meaning multiple transactions can modify the same data simultaneously. This requires concurrency control mechanisms like:
  • Locks: Data items (rows, cells) are locked in shared (read) or exclusive (write) mode to prevent conflicting modifications.
  • Version Numbers: Data items are versioned, and modifications create new versions, allowing reads to occur on older versions while writes happen on newer ones.
To ensure durability despite potential server failures, databases use transaction logs (a.k.a. write-ahead logs).  These logs record all changes before they are applied to the database itself. Two types exist: UNDO logs (covered in my 2PC article) and REDO logs (used by Aurora, among others).]

REDO Logs

REDO logs contain records like:
  • [start, TID]: Transaction TID has begun.
  • [write, TID, X, new_value]: Transaction TID wrote new_value to data item X.
  • [commit, TID]: Transaction TID committed successfully.
  • [abort, TID]: Transaction TID was aborted.
During a transaction, locks are held, and REDO log entries are made. For the bank transfer example, here are the REDO log entries, assuming an initial balance of $100 and $150 in A's and B's account respectively and a transfer of $50:

start 100346
write 100346 A $50
write 100346 B $200
commit 100346

Once the commit record is logged, the transaction is considered committed. The database state can be reconstructed solely from the REDO log by applying committed transactions in the order of their commit records. The log effectively becomes the source of truth.

Interesting note: Locks are only necessary before a transaction commits (before the commit entry is in the REDO log). After a transaction commits, its modifications can be replayed from the REDO log in commit order to rebuild the database state.

Popular SQL databases

  • MySQL: A widely used, open-source relational database known for its ease of use and reliability, making it a popular choice for web applications.
  • PostgreSQL: Another powerful open-source relational database, known for its extensibility and adherence to SQL standards. It's often preferred for applications requiring complex data management and analysis.
  • Microsoft SQL Server: A robust relational database developed by Microsoft, commonly used in enterprise environments for high-volume transaction processing and business intelligence.
  • SQLite: A lightweight, serverless relational database that's embedded within applications. It's often used for mobile apps and small-scale projects.

Distributed Relational (SQL) Databases

A distributed SQL database stores data across multiple servers, potentially geographically dispersed. This partitioning also applies to transaction logs. Building such a system requires sophisticated algorithms, such as two-phase commit (2PC).

Distributing a database is a notoriously difficult problem in computer science. Google's Spanner tackled this by distributing the entire database engine, relying heavily on 2PC. While 2PC can be slow and limit throughput, Spanner's use of TrueTime enabled it to achieve reasonable performance, making it a unique example of a true distributed database.

Aurora, however, took a different approach. It's a distributed RDBMS that achieves distribution without using 2PC. The next section will explore how Aurora accomplishes this.

Aurora

Aurora avoids distributing the database engine itself. Instead, it relies on replicating transaction logs. Since the transaction log effectively is the database (as it can be used to rebuild the database state), distributing the log achieves the desired distribution.

Aurora uses a three-server setup for each database: a primary instance that handles all transactions, and two secondary replicas that maintain copies of the database. The secondaries receive asynchronous updates from the primary after a transaction commits.

Each instance is a fully functional, independent SQL database server (e.g., MySQL).

Therefore, Aurora's approach differs from traditional distributed SQL databases. The SQL processing (transaction execution) occurs on a single node (the primary). What makes Aurora "globally available" is the distributed replication of the transaction logs.

Let's delve deeper, but first, let's address some common questions:

FAQ

  • Doesn't distributed SQL require 2PC? Yes, true distributed SQL databases do. However, Aurora's SQL engine itself is not distributed. Only the transaction logs are.
  • Is consensus required for transaction log replication? No. Consensus is used to build distributed logs, where each entry represents a value agreed upon by multiple participants. In Aurora, the transaction log entries are determined by the single primary SQL instance. The underlying system's job is simply to replicate these pre-determined entries. Simple quorum replication is sufficient for this purpose.

Naive Approach to Transaction Log Replication

A key goal in Aurora is replicating the transaction logs generated by the SQL engine. A naive approach, as described in the paper, involves the following steps:

  1. Write logs to EBS.
  2. EBS mirrors the logs (using software mirroring).
  3. Forward the modifications to the replica SQL engine instance.
  4. Write logs to the replica instance's EBS.
  5. Replica instance's EBS mirrors the logs.

The paper also mentions additional logs, such as binary logs, stored in S3. Crucially, steps 1, 3, and 5 are sequential. A transaction cannot commit without the initial write to EBS (step 1). Replication cannot begin before the commit (and thus step 1). Similarly, the replica's EBS mirroring (step 5) cannot start before the logs are written to the replica's EBS (step 4, which is dependent on step 3).

These sequential dependencies introduce latency and excessive write operations.

Aurora's Architecture

Aurora employs a service-oriented architecture (SOA), separating compute and storage layers. This contrasts with a shared-nothing architecture. The storage layer handles transaction log storage, materialization, and replication, while the compute layer hosts the SQL engine instances.

Compute

The compute layer consists of three instances across different availability zones. The primary SQL engine resides in one zone, and REDO logs are forwarded to the other two replicas using chain replication, although, in this case, the ack is sent by the head, i.e., primary instance itself and the replication is asynchronous.

Storage

The storage layer maintains six replicas of each log entry, with two replicas per availability zone. Replication is quorum-based, with a write quorum (W) of 4 and read quorum (R) of 3. The primary compute instance is responsible for replicating log entries, and an entry is considered committed once four out of the six replicas acknowledge receipt.

For performance reasons, all input/output operations are handled in batches. The storage layer's internal workings are more intricate, and further details can be found in section 3.3 of the paper.

Data Partitioning

The database volume is divided into 10GB segments. The SQL engines cache these segments, which hold the actual data. Each segment is replicated six ways in the storage layer. These replicated segments are called Protection Groups (PGs).

Each segment store multiple pages of the database. A page is the smallest unit of data that the database reads from or writes to storage at one time. It's a block of contiguous memory. The compute instances caches the database pages in the buffer cache. Upon miss, the database page is read from the storage.

For each page, the primary compute instance maintains the following metadata:

  • The highest log entry number that modified the page. Since the primary commits log entries, it knows which entry last updated a given segment.
  • The storage nodes that have a complete, materialized version of the page. As the primary receives acknowledgments from the storage replicas after writes, it tracks which storage nodes has all updates for each page.

Transaction Logs

The transaction logs is the complete database.

Log Structure

Log records in Aurora are assigned a Log Sequence Number (LSN). Because the SQL engine is a single instance, log entries are pre-determined, eliminating the need for 2PC or consensus during replication. Log entries are simply replicated using Quorum. Additionally, those replicas may miss an entry will receive it through gossip protocols.

Two important LSN values are tracked:

  • Volume Complete LSN (VCL): The storage service guarantees log availability up to VCL. Log entries with LSNs higher than the VCL are not guaranteed to be available and can be discarded.
  • Volume Durable LSN (VDL): The highest LSN for which the SQL engine guarantees database consistency. The VDL must be less than or equal to the VCL.

The VCL and VDL are independent because the compute and storage layers operate independently. The VCL is a storage layer concept related to log availability guarantees. The VDL is a compute layer concept, built upon the VCL, that guarantees database consistency up to that point.

When a new log entry is added, an LSN is generated using the current VDL and an upper bound called the LSN Allocation Limit (LAL). The LAL limits how far LSNs can advance ahead of database consistency points, i.e., VDL.

While data is segmented, Aurora maintains a single, globally ordered transaction log. Each log entry includes a back pointer to the previous log entry relevant to a specific segment. These back pointers enable the system to efficiently identify all log entries associated with a given segment.

A key concept related to segments and logs is the Segment Complete LSN (SCL). SCL indicates the point at which all log records for a particular replica of a segment are complete. This is used in gossip protocol to exchange log records.

Log Materialization

The storage layer not only stores transaction logs but also independently materializes them into databases pages in the background. These materialized pages might not always be the most up-to-date version; new updates can be applied to them to reflect the latest state.

Transaction Processing

All transactions are executed by the primary instance. Since the primary cannot hold all database pages in memory, cache misses can occur. When a cache miss happens, the primary requests the complete page from a storage node that has it.

Critically, remember that there's only one primary SQL instance. All transactions flow through it. This means the primary always knows the highest sequence number that modified any given page.

The transaction flow is as follows:

  1. The primary computes state updates (writes).
  2. The primary writes the logs to the storage layer.
  3. The primary waits for a write quorum from the storage nodes.
  4. Storage nodes acknowledge as soon as the log is committed.
  5. The primary sends REDO logs to the replica SQL instances.

The VDL advances as logs are written and acknowledged. Once the VDL advances past the commit log entry for a transaction, the primary sends an acknowledgment back to the client, confirming the transaction's completion.

Reads

Reads in Aurora are typically served from the buffer cache. A storage fetch is only necessary when the buffer cache lacks the requested page. Pages in the buffer cache are guaranteed to be the latest version, ensured by evicting a page only if its LSN is greater than or equal to the current VDL.

Aurora offers different read consistency levels:

  • Strong Reads (Serialized/Transactional Reads): These reads are routed to the primary instance to guarantee the latest data, as the primary is the only instance fully up-to-date. Note that, the SQL engine must also operate in serializable isolation mode to support these reads.

  • Relaxed Reads: These reads can return slightly stale data. They can be served from any replica instance, which typically lag behind the primary by around 20ms. Replicas asynchronously consume log records with LSNs less than or equal to their respective VDLs.

  • Point-in-Time Reads: These reads retrieve data as it existed at a specific LSN (the read point). This ensures the data is at least as current as the read point. The database can serve these reads from storage replicas whose page VDL meet or exceed the specified read point.

Finally, each segment has a per-PG Minimum Read Point LSN (PGMRPL). This is the lowest LSN at which reads for that segment are in-process. The database maintains this value, which acts as a lower watermark. Log records older than the PGMRPL are no longer needed, allowing the storage layer to materialize pages up to the PGMRPL and garbage collect older log records.

Recovery

The storage layer handles log application, meaning the SQL engine doesn't need to replay logs after a crash. Upon recovery, the primary simply needs to recompute the VDL from the storage layer and resume logging new transactions from that point. The paper reports a recovery time of approximately 10 seconds.

Administration

Aurora has high fault tolerance:

  • Writes to storage remain available even if an entire AZ goes down.
  • Reads are possible with the loss of a single AZ. Furthermore, Aurora can withstand an additional failure affecting a subset of machines in another AZ, such as a switch failure. The exact setup is not discussed in the paper and so it is difficult to 

Aurora's design simplifies administration in several ways:

  • Software Upgrades: The fast recovery times enable zero-downtime patching. Because all state is persisted in the storage layer's logs, the compute layer can simply wait for active transactions to complete before restarting with the updated software (e.g., a new binary).

  • Simplified Storage Management: There's no need to shut down SQL machines to update storage or adjust disk usage.

Evaluation

  • Aurora's read and write throughput scales linearly with instance size, which is expected given the increased processing capacity of larger instances. 
  • Aurora demonstrates significantly higher throughput than MySQL, likely due to its sophisticated storage layer and batching mechanisms (though the specific reasons require further investigation).  A comparison of latency under the same test conditions could have been valuable.
  • Replica lag in Aurora is influenced by the write rate on the primary.  In one test, a write rate of 10,000 writes/sec resulted in a replica lag of only 5.38ms.  This is substantially lower than naive MySQL's replication lag, which reached 300s.

Paper Review

This paper is quite readable. The concepts are straightforward, especially once the design's niche is understood: the transaction log is the database. Furthermore, the paper demonstrates that 2PC isn't strictly required for building a globally distributed database when a single server instance processes transactions. Global availability is achieved through transaction log replication.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...