Skip to main content

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov, a renowned computer scientist who pioneered the field of distributed systems.

Paper Link

The paper provides a valuable overview of several groundbreaking systems:

  • At-most-once delivery (SCMP): This system ensures that a message is delivered at most once, preventing duplicate messages.
  • Authenticator Systems (Kerebos): This system focuses on secure authentication and authorization within distributed environments.
  • Cache consistency (Echo): This system addresses the challenges of maintaining data consistency across distributed caches.
  • Distributed Databases (Thor): This system explores the design and implementation of distributed databases.
  • Replicated File System (Harp): This system investigates the principles of replicating files across multiple servers for improved availability and performance.

While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights into the foundational principles of distributed systems. Understanding how these early systems addressed challenges like message delivery, security, and data consistency helps us appreciate the advancements made in contemporary distributed computing technologies.

Synchronized Clocks

The existence of perfectly synchronized clocks across all computers would have dramatically simplified the challenges of distributed systems.

With perfect clock synchronization:

  • All events occurring within the distributed system could be ordered based on their timestamps.
  • All participating systems (assuming they are not exhibiting Byzantine failures) would agree on the same order of events.

This total ordering would have elegantly solved fundamental problems like atomic broadcast and consensus. Both safety (ensuring that certain undesirable states are never reached) and liveness (guaranteeing that the system makes progress) could have been achieved in a synchronous environment.

Unfortunately, perfect clock synchronization is unattainable in practice.

Network Time Protocol

The closest approximation we have is Network Time Protocol (NTP), which synchronizes clocks across geographically distributed systems over the network. While NTP achieves high precision, typically within a few milliseconds, perfect synchronization is still not possible.

Architecture


NTP, designed by David L. Mills, employs a hierarchical network of systems to synchronize time across them. This hierarchical structure is organized into strata.

  • Stratum 0: Consists of highly precise time sources such as atomic clocks and GPS receivers.
  • Stratum 1: Comprises time servers maintained by governments and private organizations. These servers synchronize their time directly with Stratum 0 sources, with a few microseconds of difference.
  • Stratum 2 and above: Include servers and workstations that synchronize their time with servers in lower strata (e.g., Stratum 2 servers synchronize with Stratum 1 servers).

Algorithm


NTP operates by exchanging timing information between clients and servers.

  • Message Exchange:
    • A client sends an ECHO message to a server at its local time T0.
    • The server receives the message at its local time T1.
    • The server sends a reply message at its local time T2.
    • The client receives the reply at its local time T3.
  • Calculating Network Round Trip Time (RTT): RTT is the difference between the time elapsed from receive to send as recorded by the client minus the time elapsed from send to receive as recorded by the server.
RTT = (T3 - T0) - (T2 - T1)    ..(i)
  • Estimating Clock Offset (θ)Assuming the client's clock is behind the server's clock by θ (theta will be negative if clock is forward). T0 + θ is the server's time when message is sent by client. When we add half of RTT to it, we should get T1. The following equation holds
T0θ + RTT/2 = T1   ..(ii)

          Substituting (i) in (ii), we get:

θ = ((T1 - T0) + (T2 - T3)) / 2
  • Clock Adjustment:
    • The ECHO response includes timestamps T0, T1, T2, and T3.
    • The client uses these timestamps to calculate the clock offset (θ) and adjust its local clock accordingly.

Assumptions and Limitations

  • Constant Time Flow Rate: The derivation assumes that the rate of time flow is identical for both the client and server clocks. This assumption may not always hold true in reality.
  • Bounded Message Delay: The calculation relies on the assumption that message transmission delays are bounded. However, network conditions can cause unpredictable delays.
Due to these assumptions, NTP clock synchronization achieves high accuracy with a high probability but cannot guarantee perfect synchronization.

Important Note: While clock synchronization can improve performance in many distributed systems, it should not be considered a fundamental requirement for correctness. System design should prioritize mechanisms that function correctly even with imperfect clock synchronization.

Precision

The algorithm synchronizes the clock with a skew of ε. At any given moment, a node's clock differs from the actual time by no more than ε/2.

External Consistency

External consistency, also known as strict serializability (for multi-object operations) or linearizability (for single-object operations), ensures that the effects of transactions are immediately visible to all clients in the order they were committed.

A simple analogy from my CS244b professor, Dr. D. Mazieres, illustrates this concept:

  • Suppose you add a new row to a table in a database and successfully commit the transaction.
  • You then call up a friend and inform a friend about this change and ask them to read the database for the newly added row.
  • If your friend's read fails to retrieve the row, the database violates external consistency.

Formally, if transaction T2 is submitted by a client only after transaction T1 commits, then the database must order T2 after T1.

A database that guarantees external consistency is considered strictly serializable.

Application 1: At Most Once Delivery

A messaging system facilitates the delivery of messages from a producer (sender) to a consumer (receiver). At-most-once delivery guarantees that a message will be delivered to the consumer at most one time. This differs from at-least-once delivery, which ensures that each message is delivered at least once, potentially leading to duplicates. At-most-once delivery prioritizes avoiding duplicate messages, even if it means that some messages might be lost during transmission or processing.

Naive Approaches

  • No Sender Retries: Eliminating sender retries prevents duplicate messages, but can result in significant message loss due to network delays.
  • Sender-based Deduplication: Receivers maintain a table of active senders and track message IDs (which should be sequentially increasing). This allows for deduplication, but requires receivers to store sender information indefinitely, which is not infeasible without synchronized clocks and reliable sender retries.

Slightly Better Approach

  • Handshake: Before sending messages, a handshake is performed between sender and receiver to establish the receiver's current message reception position. This approach has high overhead for scenarios with only a few messages.

The SCMP Protocol

SCMP utilizes synchronized clocks (though it can operate, albeit less efficiently, without them).

Receivers maintain records of recent communications. Senders include timestamps with each message. A message and its effects expire after a defined "lifetime interval" (p).

Receiver's data structure (G):

  • G.time: Receiver's time.
  • G.CT: Maps connection IDs (unique identifiers for sender-receiver communication channels) to connection information (including the last known timestamp ts for that connection). Entries in G.CT are removed if their last known timestamp is older than G.time - p - Îµ.
  • G.upper: The oldest timestamp that has been removed from G.CT.
  • G.latest: A timestamp larger than timestamp of all earlier accepted messages. It is usually set equal to G.time + Î² and committed to stable storage (β here is some increment)
A message on connection C is considered new if its timestamp is greater than G.CT[C].ts or G.upper if C is new.

Post-Crash Initialization: G.upper is initialized to G.latest after a crash.

Limitations

If a connection C is forgotten by the receiver, G.upper might be used for recency checks instead of the last known timestamp for C. This can lead to valid messages being erroneously rejected as duplicates.

Correctness

While improving efficiency, clock synchronization is not strictly necessary for SCMP's correctness. A slow node's clock can cause its messages to be rejected due to their outdated timestamps.

Application 2: Authentication Tokens

Ticket-Based Authentication (Kerebos):

Communication between a client (C) and a server (S) is controlled by a ticket. The client obtains the ticket from a Ticket Granting Server (TGS). The TGS provides the client with:
  • TCS: A ticket specific to the server (S).
  • KCS: A secret session key shared between the client and the server.
These information is transmitted over an encrypted channel to ensure secure delivery only to the client.

In subsequent communications with S, the client presents the ticket (encrypted using S's private key) and the session key.

Each ticket has an expiration time, after which communication using that ticket is prohibited. If S's clock is significantly behind, it might erroneously accept expired tickets. This could necessitate S to verify each ticket with the TGS before processing requests, increasing overhead.

The server (S) must never accept and process an expired ticket. Synchronized clocks between S and TGS help ensure that S's clock is within an acceptable range of the TGS's clock, minimizing the risk of accepting expired tickets.

Correctness

There is a hard reliance on clock synchronization for absolute correctness. But the author calls out that the primary concern here is the theft of tickets from unattended workstations, which can be exploited by attackers.

Authenticators and Message Replay Prevention

Authenticators are encrypted timestamps generated using the shared session key (KCS), known only to the client and server. Each message includes an authenticator. Upon receiving a message, the server verifies the authenticity of the message and its timestamp to prevent replay attacks (where an old message is retransmitted).

Similar to at-most-once delivery systems, authenticators help ensure that each message is processed only once. Clock synchronization plays a role in minimizing the acceptance of duplicate messages.

Application 3: Cache Consistency

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton

Clients maintain local copies of server objects as cached data. This caching mechanism enhances performance by eliminating the need to repeatedly fetch objects for access. However, it necessitates a mechanism to inform clients of server-side object modifications, triggering the invalidation of local copies and subsequent retrieval of updated values from the server. This is required by the Echo file system.

Caching strategies can be broadly categorized into two types:

  • Write-Through Cache: Modifications are initially written to the cache before being committed to the source. In the context of Echo, write-through caching involves the server invalidating object copies in all client caches prior to committing the modification. This approach necessitates a lease mechanism on the server to track clients holding references to the object. If a client becomes unreachable, the server must await lease expiration.

  • Write-Behind Cache: Leases are differentiated into read leases (held by multiple clients) and write leases (exclusively held by a single client). A new write lease is granted only after all existing leases have been either relinquished or expired. The client possessing the write lease is only authorized to modify the file.

Lease mechanisms can be implemented in two ways:

  • Explicit Client Acknowledgment: Leases are assumed to have infinite expiration unless explicitly relinquished by the client. This approach relies on the unrealistic assumption of faultless processes and networks. This assumption is suitable for multi-processing environments, however, it doesn't work for distributed systems.

  • Clock Synchronization-Based Leases: Leases are assigned timestamps based on synchronized clocks. However, clock desynchronization can render the system inoperable.

Neither of these lease mechanisms is entirely satisfactory. Clock synchronization is often adopted as an acceptable compromise to enable system functionality.

Correctness

Clock desynchronization compromises system correctness. This highlights a strong dependency on clock synchronization for proper system operation. However, for many caching applications, precise lease correctness is not a critical requirement. Applications utilizing such caching systems should be designed to function correctly even if lease correctness is not guaranteed, as demonstrated in the next application example.

Application 4: Atomicity

Transaction atomicity is a fundamental requirement in SQL databases, ensuring that a series of operations within a transaction either all succeed or all fail as a unit.

Traditionally, atomicity was achieved by using locks on the objects. However, the advent of optimistic concurrency control (OCC) introduced a new approach:

  • OCC assigns version numbers to each object. The versioned objects are stored within a local client-side cache.
  • Transactions can then be processed locally on the client.
  • The client submits the completed transaction to the server. The server validates the transaction by checking if the version numbers of the objects involved still match the cached versions. If the versions match, the transaction is committed; otherwise, it is aborted and the client must retry.

OCC is supported by Thor. Clock synchronization can significantly enhance performance. Synchronized clocks contribute to better cache coherence across clients. More synchronized objects reduce the likelihood of version conflicts, leading to fewer transaction aborts and higher overall transaction commit rates.

Correctness

Clock synchronization is not necessary for the correctness of transaction commit as all commits ultimately occur on the server. 

While the underlying cache may exhibit inconsistencies due to its reliance on an imperfect lease mechanism, the application layer, in the context of transaction atomicity, maintains correctness.

Application 5: Commit Windows

Harp is a replicated file system that supports the Virtual File System (VFS) interface and is designed to work with Network File System (NFS). Unlike file systems that support complex transactions (e.g. Transactional NTFS), Harp treats each file operation as an independent atomic unit. This approach effectively transforms the file system into a key-value store (i.e., a NoSQL database) where each file is treated as a key.

Quorum-Based Replication

To ensure data consistency, Harp employs a majority voting mechanism within a quorum of replicas for committing updates to any file.

Membership Protocols and View Changes

Harp operates within a "view" where all replicas are aware of each other. Views may change dynamically to accommodate server failures or additions, but always maintain a majority of the servers. This ensures that each new view includes at least one replica from the previous view, preserving the history of committed transactions.

Upon change, a new view elects a leader, which gathers the latest view information from all replicas before accepting new transactions.

Read Operations

In such quorum-based key-value transactional stores, all reads are single key reads only. The reads can be further classified into two types: 

  • Single-Key Strong Read (Read from Majority)

    • Reads data from a majority of replicas.
    • Guarantees linearizability and external consistency, even during view changes.
      • If a view change occurs before, the transaction is simply retried.

  • Single-Key Relaxed Read (Read from Primary)

    • Reads data only from the primary replica.
    • Not linearizable due to potential view changes that may leave the primary unaware of updates.
    • May return stale data.

External Consistency

External consistency guarantees that the effects of all transactions up to a point T are visible at point T. Harp does not guarantee external consistency, with single-key relaxed reads. 

Impact of Clock Synchronization

Clock synchronization plays a crucial role in mitigating violations of external consistency. The primary replica can obtain short-lived leases from a sub-majority of backups. When a new view is established, the new primary must wait for all leases from the previous primary to expire before accepting new requests. This ensures that the primary holds valid leases from a sub-majority of backups during read operations.

Correctness

Even with clock synchronization, minor violations of external consistency can still occur.

Bonus: SQL Multi-Key Strong Reads (Transactional Reads)

For multi-key operations, SQL databases typically support transactional reads (also known as multi-key strong reads, snapshot reads, or read-only transactions). These reads guarantee that the values returned for all keys within the transaction reflect the state of the database at a consistent point in time, specifically the point at which the transaction commits.

Synchronized Rates

The paper explores the concept of relying on synchronized rates of time flow, rather than absolute synchronized clocks, to address certain challenges.

Consider an event e occurring on node O (e.g., lease expiration). Another node D (e.g., the client holding the lease) is dependent on the occurrence of event e. For correctness, node D must perceive event e no later than node O in absolute time, i.e., Td <= Te, where Td is the absolute time of event e at node D and Te is its absolute time of occurrence at node O. This principle is crucial for maintaining system consistency. For instance, the client node must relinquish the lease before the server does in absolute time.

Event e is preceded by events g on node D and f on node O, which serve as a common reference point for establishing rate synchronization. Once this reference point is established, a conservative estimate of Td can be made to ensure Td <= Te. The paper provides a straightforward derivation of this estimate.

Rate synchronization is only beneficial when a communication mechanism exists to establish a common reference point. In the absence of such a mechanism, traditional clock synchronization remains necessary.

Paper Takeaways

This paper explores a wide range of topics and presents some intriguing observations.

  • Correct system operation should not be contingent upon perfectly synchronized clocks. Systems must maintain correctness even when clock discrepancies occur or have a compensation for it. This principle applies universally across all systems examined in the paper. Notably, even in scenarios where external correctness is compromised, such as external consistency violations in Harp, internal correctness is always preserved.

  • While transactions effectively ensure the internal consistency of databases, external consistency can only be guaranteed through the use of strong reads (both in single-key and multi-key scenarios). However, strong reads often incur significant performance overhead. Leases offer a potential solution for achieving strong reads, but their effectiveness is inherently tied to the accuracy of clock synchronization.

Paper Review

This paper, while seemingly straightforward, presents a nuanced and intricate exploration of various concepts. It delves into numerous systems developed by the MIT PDOS lab during the 1990s, each of which could be a subject of independent research. This comprehensive work is an essential read for anyone passionate about distributed systems.

Comments

Popular posts from this blog

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...