Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov, a renowned computer scientist who pioneered the field of distributed systems.

Paper Link

The paper provides a valuable overview of several groundbreaking systems:

At-most-once delivery (SCMP): This system ensures that a message is delivered at most once, preventing duplicate messages.
Authenticator Systems (Kerebos): This system focuses on secure authentication and authorization within distributed environments.
Cache consistency (Echo): This system addresses the challenges of maintaining data consistency across distributed caches.
Distributed Databases (Thor): This system explores the design and implementation of distributed databases.
Replicated File System (Harp): This system investigates the principles of replicating files across multiple servers for improved availability and performance.

While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights into the foundational principles of distributed systems. Understanding how these early systems addressed challenges like message delivery, security, and data consistency helps us appreciate the advancements made in contemporary distributed computing technologies.

Synchronized Clocks

The existence of perfectly synchronized clocks across all computers would have dramatically simplified the challenges of distributed systems.

With perfect clock synchronization:

All events occurring within the distributed system could be ordered based on their timestamps.
All participating systems (assuming they are not exhibiting Byzantine failures) would agree on the same order of events.

This total ordering would have elegantly solved fundamental problems like atomic broadcast and consensus. Both safety (ensuring that certain undesirable states are never reached) and liveness (guaranteeing that the system makes progress) could have been achieved in a synchronous environment.

Unfortunately, perfect clock synchronization is unattainable in practice.

Network Time Protocol

The closest approximation we have is Network Time Protocol (NTP), which synchronizes clocks across geographically distributed systems over the network. While NTP achieves high precision, typically within a few milliseconds, perfect synchronization is still not possible.

Architecture

NTP, designed by David L. Mills, employs a hierarchical network of systems to synchronize time across them. This hierarchical structure is organized into strata.

Stratum 0: Consists of highly precise time sources such as atomic clocks and GPS receivers.
Stratum 1: Comprises time servers maintained by governments and private organizations. These servers synchronize their time directly with Stratum 0 sources, with a few microseconds of difference.
Stratum 2 and above: Include servers and workstations that synchronize their time with servers in lower strata (e.g., Stratum 2 servers synchronize with Stratum 1 servers).

Algorithm

NTP operates by exchanging timing information between clients and servers.

Message Exchange:

A client sends an ECHO message to a server at its local time T₀.
The server receives the message at its local time T₁.
The server sends a reply message at its local time T₂.
The client receives the reply at its local time T₃.

Calculating Network Round Trip Time (RTT): RTT is the difference between the time elapsed from receive to send as recorded by the client minus the time elapsed from send to receive as recorded by the server.

RTT = (T₃ - T₀) - (T₂ - T₁) ..(i)

Estimating Clock Offset (θ): Assuming the client's clock is behind the server's clock by θ (theta will be negative if clock is forward). T₀ + θ is the server's time when message is sent by client. When we add half of RTT to it, we should get T₁. The following equation holds

T₀ + θ + RTT/2 = T₁ ..(ii)

Substituting (i) in (ii), we get:

θ = ((T₁ - T₀) + (T₂ - T₃)) / 2

Clock Adjustment:

The ECHO response includes timestamps T₀, T₁, T₂, and T₃.
The client uses these timestamps to calculate the clock offset (θ) and adjust its local clock accordingly.

Assumptions and Limitations

Constant Time Flow Rate: The derivation assumes that the rate of time flow is identical for both the client and server clocks. This assumption may not always hold true in reality.
Bounded Message Delay: The calculation relies on the assumption that message transmission delays are bounded. However, network conditions can cause unpredictable delays.

Due to these assumptions, NTP clock synchronization achieves high accuracy with a high probability but cannot guarantee perfect synchronization.

Important Note: While clock synchronization can improve performance in many distributed systems, it should not be considered a fundamental requirement for correctness. System design should prioritize mechanisms that function correctly even with imperfect clock synchronization.

Precision

The algorithm synchronizes the clock with a skew of ε. At any given moment, a node's clock differs from the actual time by no more than ε/2.

External Consistency

External consistency, also known as strict serializability (for multi-object operations) or linearizability (for single-object operations), ensures that the effects of transactions are immediately visible to all clients in the order they were committed.

A simple analogy from my CS244b professor, Dr. D. Mazieres, illustrates this concept:

Suppose you add a new row to a table in a database and successfully commit the transaction.
You then call up a friend and inform a friend about this change and ask them to read the database for the newly added row.
If your friend's read fails to retrieve the row, the database violates external consistency.

Formally, if transaction T₂ is submitted by a client only after transaction T₁ commits, then the database must order T₂ after T₁.

A database that guarantees external consistency is considered strictly serializable.

Application 1: At Most Once Delivery

A messaging system facilitates the delivery of messages from a producer (sender) to a consumer (receiver). At-most-once delivery guarantees that a message will be delivered to the consumer at most one time. This differs from at-least-once delivery, which ensures that each message is delivered at least once, potentially leading to duplicates. At-most-once delivery prioritizes avoiding duplicate messages, even if it means that some messages might be lost during transmission or processing.

Naive Approaches

No Sender Retries: Eliminating sender retries prevents duplicate messages, but can result in significant message loss due to network delays.
Sender-based Deduplication: Receivers maintain a table of active senders and track message IDs (which should be sequentially increasing). This allows for deduplication, but requires receivers to store sender information indefinitely, which is not infeasible without synchronized clocks and reliable sender retries.

Slightly Better Approach

Handshake: Before sending messages, a handshake is performed between sender and receiver to establish the receiver's current message reception position. This approach has high overhead for scenarios with only a few messages.

The SCMP Protocol

SCMP utilizes synchronized clocks (though it can operate, albeit less efficiently, without them).

Receivers maintain records of recent communications. Senders include timestamps with each message. A message and its effects expire after a defined "lifetime interval" (p).

Receiver's data structure (G):

G.time: Receiver's time.
G.CT: Maps connection IDs (unique identifiers for sender-receiver communication channels) to connection information (including the last known timestamp ts for that connection). Entries in G.CT are removed if their last known timestamp is older than G.time - p - ε.
G.upper: The oldest timestamp that has been removed from G.CT.
G.latest: A timestamp larger than timestamp of all earlier accepted messages. It is usually set equal to G.time + β and committed to stable storage (β here is some increment).

A message on connection C is considered new if its timestamp is greater than G.CT[C].ts or G.upper if C is new.

Post-Crash Initialization: G.upper is initialized to G.latest after a crash.

Limitations

If a connection C is forgotten by the receiver, G.upper might be used for recency checks instead of the last known timestamp for C. This can lead to valid messages being erroneously rejected as duplicates.

Correctness

While improving efficiency, clock synchronization is not strictly necessary for SCMP's correctness. A slow node's clock can cause its messages to be rejected due to their outdated timestamps.

Application 2: Authentication Tokens

Ticket-Based Authentication (Kerebos):

Communication between a client (C) and a server (S) is controlled by a ticket. The client obtains the ticket from a Ticket Granting Server (TGS). The TGS provides the client with:

T_CS: A ticket specific to the server (S).
K_CS: A secret session key shared between the client and the server.

These information is transmitted over an encrypted channel to ensure secure delivery only to the client.

In subsequent communications with S, the client presents the ticket (encrypted using S's private key) and the session key.

Each ticket has an expiration time, after which communication using that ticket is prohibited. If S's clock is significantly behind, it might erroneously accept expired tickets. This could necessitate S to verify each ticket with the TGS before processing requests, increasing overhead.

The server (S) must never accept and process an expired ticket. Synchronized clocks between S and TGS help ensure that S's clock is within an acceptable range of the TGS's clock, minimizing the risk of accepting expired tickets.

Correctness

There is a hard reliance on clock synchronization for absolute correctness. But the author calls out that the primary concern here is the theft of tickets from unattended workstations, which can be exploited by attackers.

Authenticators and Message Replay Prevention

Authenticators are encrypted timestamps generated using the shared session key (K_CS), known only to the client and server. Each message includes an authenticator. Upon receiving a message, the server verifies the authenticity of the message and its timestamp to prevent replay attacks (where an old message is retransmitted).

Similar to at-most-once delivery systems, authenticators help ensure that each message is processed only once. Clock synchronization plays a role in minimizing the acceptance of duplicate messages.

Application 3: Cache Consistency

There are only two hard things in Computer Science: cache invalidation and naming things.

- Phil Karlton

Clients maintain local copies of server objects as cached data. This caching mechanism enhances performance by eliminating the need to repeatedly fetch objects for access. However, it necessitates a mechanism to inform clients of server-side object modifications, triggering the invalidation of local copies and subsequent retrieval of updated values from the server. This is required by the Echo file system.

Caching strategies can be broadly categorized into two types:

Write-Through Cache: Modifications are initially written to the cache before being committed to the source. In the context of Echo, write-through caching involves the server invalidating object copies in all client caches prior to committing the modification. This approach necessitates a lease mechanism on the server to track clients holding references to the object. If a client becomes unreachable, the server must await lease expiration.
Write-Behind Cache: Leases are differentiated into read leases (held by multiple clients) and write leases (exclusively held by a single client). A new write lease is granted only after all existing leases have been either relinquished or expired. The client possessing the write lease is only authorized to modify the file.

Lease mechanisms can be implemented in two ways:

Explicit Client Acknowledgment: Leases are assumed to have infinite expiration unless explicitly relinquished by the client. This approach relies on the unrealistic assumption of faultless processes and networks. This assumption is suitable for multi-processing environments, however, it doesn't work for distributed systems.
Clock Synchronization-Based Leases: Leases are assigned timestamps based on synchronized clocks. However, clock desynchronization can render the system inoperable.

Neither of these lease mechanisms is entirely satisfactory. Clock synchronization is often adopted as an acceptable compromise to enable system functionality.

Correctness

Clock desynchronization compromises system correctness. This highlights a strong dependency on clock synchronization for proper system operation. However, for many caching applications, precise lease correctness is not a critical requirement. Applications utilizing such caching systems should be designed to function correctly even if lease correctness is not guaranteed, as demonstrated in the next application example.

Application 4: Atomicity

Transaction atomicity is a fundamental requirement in SQL databases, ensuring that a series of operations within a transaction either all succeed or all fail as a unit.

Traditionally, atomicity was achieved by using locks on the objects. However, the advent of optimistic concurrency control (OCC) introduced a new approach:

OCC assigns version numbers to each object. The versioned objects are stored within a local client-side cache.
Transactions can then be processed locally on the client.
The client submits the completed transaction to the server. The server validates the transaction by checking if the version numbers of the objects involved still match the cached versions. If the versions match, the transaction is committed; otherwise, it is aborted and the client must retry.

OCC is supported by Thor. Clock synchronization can significantly enhance performance. Synchronized clocks contribute to better cache coherence across clients. More synchronized objects reduce the likelihood of version conflicts, leading to fewer transaction aborts and higher overall transaction commit rates.

Correctness

Clock synchronization is not necessary for the correctness of transaction commit as all commits ultimately occur on the server.

While the underlying cache may exhibit inconsistencies due to its reliance on an imperfect lease mechanism, the application layer, in the context of transaction atomicity, maintains correctness.

Application 5: Commit Windows

Harp is a replicated file system that supports the Virtual File System (VFS) interface and is designed to work with Network File System (NFS). Unlike file systems that support complex transactions (e.g. Transactional NTFS), Harp treats each file operation as an independent atomic unit. This approach effectively transforms the file system into a key-value store (i.e., a NoSQL database) where each file is treated as a key.

Quorum-Based Replication

To ensure data consistency, Harp employs a majority voting mechanism within a quorum of replicas for committing updates to any file.

Membership Protocols and View Changes

Harp operates within a "view" where all replicas are aware of each other. Views may change dynamically to accommodate server failures or additions, but always maintain a majority of the servers. This ensures that each new view includes at least one replica from the previous view, preserving the history of committed transactions.

Upon change, a new view elects a leader, which gathers the latest view information from all replicas before accepting new transactions.

Read Operations

In such quorum-based key-value transactional stores, all reads are single key reads only. The reads can be further classified into two types:

Single-Key Strong Read (Read from Majority)

Reads data from a majority of replicas.
Guarantees linearizability and external consistency, even during view changes.

If a view change occurs before, the transaction is simply retried.

Single-Key Relaxed Read (Read from Primary)

Reads data only from the primary replica.
Not linearizable due to potential view changes that may leave the primary unaware of updates.
May return stale data.

External Consistency

External consistency guarantees that the effects of all transactions up to a point T are visible at point T. Harp does not guarantee external consistency, with single-key relaxed reads.

Impact of Clock Synchronization

Clock synchronization plays a crucial role in mitigating violations of external consistency. The primary replica can obtain short-lived leases from a sub-majority of backups. When a new view is established, the new primary must wait for all leases from the previous primary to expire before accepting new requests. This ensures that the primary holds valid leases from a sub-majority of backups during read operations.

Correctness

Even with clock synchronization, minor violations of external consistency can still occur.

Bonus: SQL Multi-Key Strong Reads (Transactional Reads)

For multi-key operations, SQL databases typically support transactional reads (also known as multi-key strong reads, snapshot reads, or read-only transactions). These reads guarantee that the values returned for all keys within the transaction reflect the state of the database at a consistent point in time, specifically the point at which the transaction commits.

Synchronized Rates

The paper explores the concept of relying on synchronized rates of time flow, rather than absolute synchronized clocks, to address certain challenges.

Consider an event e occurring on node O (e.g., lease expiration). Another node D (e.g., the client holding the lease) is dependent on the occurrence of event e. For correctness, node D must perceive event e no later than node O in absolute time, i.e., T_d <= T_e, where T_d is the absolute time of event e at node D and T_e is its absolute time of occurrence at node O. This principle is crucial for maintaining system consistency. For instance, the client node must relinquish the lease before the server does in absolute time.

Event e is preceded by events g on node D and f on node O, which serve as a common reference point for establishing rate synchronization. Once this reference point is established, a conservative estimate of T_d can be made to ensure T_d <= T_e. The paper provides a straightforward derivation of this estimate.

Rate synchronization is only beneficial when a communication mechanism exists to establish a common reference point. In the absence of such a mechanism, traditional clock synchronization remains necessary.

Paper Takeaways

This paper explores a wide range of topics and presents some intriguing observations.

Correct system operation should not be contingent upon perfectly synchronized clocks. Systems must maintain correctness even when clock discrepancies occur or have a compensation for it. This principle applies universally across all systems examined in the paper. Notably, even in scenarios where external correctness is compromised, such as external consistency violations in Harp, internal correctness is always preserved.
While transactions effectively ensure the internal consistency of databases, external consistency can only be guaranteed through the use of strong reads (both in single-key and multi-key scenarios). However, strong reads often incur significant performance overhead. Leases offer a potential solution for achieving strong reads, but their effectiveness is inherently tied to the accuracy of clock synchronization.

Paper Review

This paper, while seemingly straightforward, presents a nuanced and intricate exploration of various concepts. It delves into numerous systems developed by the MIT PDOS lab during the 1990s, each of which could be a subject of independent research. This comprehensive work is an essential read for anyone passionate about distributed systems.

Pico

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

Synchronized Clocks

Network Time Protocol

Architecture

Algorithm

Assumptions and Limitations

Precision

External Consistency

Application 1: At Most Once Delivery

Naive Approaches

Slightly Better Approach

The SCMP Protocol

Limitations

Correctness

Application 2: Authentication Tokens

Ticket-Based Authentication (Kerebos):

Correctness

Authenticators and Message Replay Prevention

Application 3: Cache Consistency

Correctness

Application 4: Atomicity

Correctness

Application 5: Commit Windows

Quorum-Based Replication

Membership Protocols and View Changes

Read Operations

Single-Key Strong Read (Read from Majority)

Single-Key Relaxed Read (Read from Primary)

External Consistency

Impact of Clock Synchronization

Correctness

Bonus: SQL Multi-Key Strong Reads (Transactional Reads)

Synchronized Rates

Paper Takeaways

Paper Review

Comments

Post a Comment

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Paper Insights #24 - Spanner: Google's Globally-Distributed Database