This influential paper was authored by Barbara Liskov, a renowned computer scientist who pioneered the field of distributed systems.
The paper provides a valuable overview of several groundbreaking systems:
- At-most-once delivery (SCMP): This system ensures that a message is delivered at most once, preventing duplicate messages.
- Authenticator Systems (Kerebos): This system focuses on secure authentication and authorization within distributed environments.
- Cache consistency (Echo): This system addresses the challenges of maintaining data consistency across distributed caches.
- Distributed Databases (Thor): This system explores the design and implementation of distributed databases.
- Replicated File System (Harp): This system investigates the principles of replicating files across multiple servers for improved availability and performance.
While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights into the foundational principles of distributed systems. Understanding how these early systems addressed challenges like message delivery, security, and data consistency helps us appreciate the advancements made in contemporary distributed computing technologies.
Synchronized Clocks
The existence of perfectly synchronized clocks across all computers would have dramatically simplified the challenges of distributed systems.
With perfect clock synchronization:
- All events occurring within the distributed system could be ordered based on their timestamps.
- All participating systems (assuming they are not exhibiting Byzantine failures) would agree on the same order of events.
This total ordering would have elegantly solved fundamental problems like atomic broadcast and consensus. Both safety (ensuring that certain undesirable states are never reached) and liveness (guaranteeing that the system makes progress) could have been achieved in a synchronous environment.
Unfortunately, perfect clock synchronization is unattainable in practice.
Network Time Protocol
Architecture
NTP, designed by David L. Mills, employs a hierarchical network of systems to synchronize time across them. This hierarchical structure is organized into strata.
- Stratum 0: Consists of highly precise time sources such as atomic clocks and GPS receivers.
- Stratum 1: Comprises time servers maintained by governments and private organizations. These servers synchronize their time directly with Stratum 0 sources, with a few microseconds of difference.
- Stratum 2 and above: Include servers and workstations that synchronize their time with servers in lower strata (e.g., Stratum 2 servers synchronize with Stratum 1 servers).
Algorithm
NTP operates by exchanging timing information between clients and servers.
- Message Exchange:
- A client sends an ECHO message to a server at its local time T0.
- The server receives the message at its local time T1.
- The server sends a reply message at its local time T2.
- The client receives the reply at its local time T3.
- Calculating Network Round Trip Time (RTT): RTT is the difference between the time elapsed from receive to send as recorded by the client minus the time elapsed from send to receive as recorded by the server.
- Estimating Clock Offset (θ): Assuming the client's clock is behind the server's clock by θ (theta will be negative if clock is forward). T0 + θ is the server's time when message is sent by client. When we add half of RTT to it, we should get T1. The following equation holds
- Clock Adjustment:
- The ECHO response includes timestamps T0, T1, T2, and T3.
- The client uses these timestamps to calculate the clock offset (θ) and adjust its local clock accordingly.
Assumptions and Limitations
- Constant Time Flow Rate: The derivation assumes that the rate of time flow is identical for both the client and server clocks. This assumption may not always hold true in reality.
- Bounded Message Delay: The calculation relies on the assumption that message transmission delays are bounded. However, network conditions can cause unpredictable delays.
Important Note: While clock synchronization can improve performance in many distributed systems, it should not be considered a fundamental requirement for correctness. System design should prioritize mechanisms that function correctly even with imperfect clock synchronization.
Precision
External Consistency
External consistency, also known as strict serializability (for multi-object operations) or linearizability (for single-object operations), ensures that the effects of transactions are immediately visible to all clients in the order they were committed.
A simple analogy from my CS244b professor, Dr. D. Mazieres, illustrates this concept:
- Suppose you add a new row to a table in a database and successfully commit the transaction.
- You then call up a friend and inform a friend about this change and ask them to read the database for the newly added row.
- If your friend's read fails to retrieve the row, the database violates external consistency.
Formally, if transaction T2 is submitted by a client only after transaction T1 commits, then the database must order T2 after T1.
A database that guarantees external consistency is considered strictly serializable.
Application 1: At Most Once Delivery
A messaging system facilitates the delivery of messages from a producer (sender) to a consumer (receiver). At-most-once delivery guarantees that a message will be delivered to the consumer at most one time. This differs from at-least-once delivery, which ensures that each message is delivered at least once, potentially leading to duplicates. At-most-once delivery prioritizes avoiding duplicate messages, even if it means that some messages might be lost during transmission or processing.
Naive Approaches
- No Sender Retries: Eliminating sender retries prevents duplicate messages, but can result in significant message loss due to network delays.
- Sender-based Deduplication: Receivers maintain a table of active senders and track message IDs (which should be sequentially increasing). This allows for deduplication, but requires receivers to store sender information indefinitely, which is not infeasible without synchronized clocks and reliable sender retries.
Slightly Better Approach
- Handshake: Before sending messages, a handshake is performed between sender and receiver to establish the receiver's current message reception position. This approach has high overhead for scenarios with only a few messages.
The SCMP Protocol
SCMP utilizes synchronized clocks (though it can operate, albeit less efficiently, without them).
Receivers maintain records of recent communications. Senders include timestamps with each message. A message and its effects expire after a defined "lifetime interval" (p).
Receiver's data structure (G):
- G.time: Receiver's time.
- G.CT: Maps connection IDs (unique identifiers for sender-receiver communication channels) to connection information (including the last known timestamp ts for that connection). Entries in G.CT are removed if their last known timestamp is older than G.time - p - ε.
- G.upper: The oldest timestamp that has been removed from G.CT.
- G.latest: A timestamp larger than timestamp of all earlier accepted messages. It is usually set equal to G.time + β and committed to stable storage (β here is some increment).
Post-Crash Initialization: G.upper is initialized to G.latest after a crash.
Limitations
If a connection C is forgotten by the receiver, G.upper might be used for recency checks instead of the last known timestamp for C. This can lead to valid messages being erroneously rejected as duplicates.
Correctness
While improving efficiency, clock synchronization is not strictly necessary for SCMP's correctness. A slow node's clock can cause its messages to be rejected due to their outdated timestamps.
Application 2: Authentication Tokens
Ticket-Based Authentication (Kerebos):
- TCS: A ticket specific to the server (S).
- KCS: A secret session key shared between the client and the server.
Correctness
Authenticators and Message Replay Prevention
Application 3: Cache Consistency
Caching strategies can be broadly categorized into two types:
-
Write-Through Cache: Modifications are initially written to the cache before being committed to the source. In the context of Echo, write-through caching involves the server invalidating object copies in all client caches prior to committing the modification. This approach necessitates a lease mechanism on the server to track clients holding references to the object. If a client becomes unreachable, the server must await lease expiration.
-
Write-Behind Cache: Leases are differentiated into read leases (held by multiple clients) and write leases (exclusively held by a single client). A new write lease is granted only after all existing leases have been either relinquished or expired. The client possessing the write lease is only authorized to modify the file.
Lease mechanisms can be implemented in two ways:
-
Explicit Client Acknowledgment: Leases are assumed to have infinite expiration unless explicitly relinquished by the client. This approach relies on the unrealistic assumption of faultless processes and networks. This assumption is suitable for multi-processing environments, however, it doesn't work for distributed systems.
-
Clock Synchronization-Based Leases: Leases are assigned timestamps based on synchronized clocks. However, clock desynchronization can render the system inoperable.
Neither of these lease mechanisms is entirely satisfactory. Clock synchronization is often adopted as an acceptable compromise to enable system functionality.
Correctness
Clock desynchronization compromises system correctness. This highlights a strong dependency on clock synchronization for proper system operation. However, for many caching applications, precise lease correctness is not a critical requirement. Applications utilizing such caching systems should be designed to function correctly even if lease correctness is not guaranteed, as demonstrated in the next application example.
Application 4: Atomicity
Transaction atomicity is a fundamental requirement in SQL databases, ensuring that a series of operations within a transaction either all succeed or all fail as a unit.
Traditionally, atomicity was achieved by using locks on the objects. However, the advent of optimistic concurrency control (OCC) introduced a new approach:
- OCC assigns version numbers to each object. The versioned objects are stored within a local client-side cache.
- Transactions can then be processed locally on the client.
- The client submits the completed transaction to the server. The server validates the transaction by checking if the version numbers of the objects involved still match the cached versions. If the versions match, the transaction is committed; otherwise, it is aborted and the client must retry.
OCC is supported by Thor. Clock synchronization can significantly enhance performance. Synchronized clocks contribute to better cache coherence across clients. More synchronized objects reduce the likelihood of version conflicts, leading to fewer transaction aborts and higher overall transaction commit rates.
Correctness
Clock synchronization is not necessary for the correctness of transaction commit as all commits ultimately occur on the server.
While the underlying cache may exhibit inconsistencies due to its reliance on an imperfect lease mechanism, the application layer, in the context of transaction atomicity, maintains correctness.
Application 5: Commit Windows
Harp is a replicated file system that supports the Virtual File System (VFS) interface and is designed to work with Network File System (NFS). Unlike file systems that support complex transactions (e.g. Transactional NTFS), Harp treats each file operation as an independent atomic unit. This approach effectively transforms the file system into a key-value store (i.e., a NoSQL database) where each file is treated as a key.
Quorum-Based Replication
To ensure data consistency, Harp employs a majority voting mechanism within a quorum of replicas for committing updates to any file.
Membership Protocols and View Changes
Harp operates within a "view" where all replicas are aware of each other. Views may change dynamically to accommodate server failures or additions, but always maintain a majority of the servers. This ensures that each new view includes at least one replica from the previous view, preserving the history of committed transactions.
Upon change, a new view elects a leader, which gathers the latest view information from all replicas before accepting new transactions.
Read Operations
In such quorum-based key-value transactional stores, all reads are single key reads only. The reads can be further classified into two types:
- Single-Key Strong Read (Read from Majority)
- Reads data from a majority of replicas.
- Guarantees linearizability and external consistency, even during view changes.
- If a view change occurs before, the transaction is simply retried.
- Single-Key Relaxed Read (Read from Primary)
- Reads data only from the primary replica.
- Not linearizable due to potential view changes that may leave the primary unaware of updates.
- May return stale data.
External Consistency
External consistency guarantees that the effects of all transactions up to a point T are visible at point T. Harp does not guarantee external consistency, with single-key relaxed reads.Impact of Clock Synchronization
Clock synchronization plays a crucial role in mitigating violations of external consistency. The primary replica can obtain short-lived leases from a sub-majority of backups. When a new view is established, the new primary must wait for all leases from the previous primary to expire before accepting new requests. This ensures that the primary holds valid leases from a sub-majority of backups during read operations.
Correctness
Even with clock synchronization, minor violations of external consistency can still occur.
Bonus: SQL Multi-Key Strong Reads (Transactional Reads)
For multi-key operations, SQL databases typically support transactional reads (also known as multi-key strong reads, snapshot reads, or read-only transactions). These reads guarantee that the values returned for all keys within the transaction reflect the state of the database at a consistent point in time, specifically the point at which the transaction commits.
Synchronized Rates
The paper explores the concept of relying on synchronized rates of time flow, rather than absolute synchronized clocks, to address certain challenges.
Consider an event e occurring on node O (e.g., lease expiration). Another node D (e.g., the client holding the lease) is dependent on the occurrence of event e. For correctness, node D must perceive event e no later than node O in absolute time, i.e., Td <= Te, where Td is the absolute time of event e at node D and Te is its absolute time of occurrence at node O. This principle is crucial for maintaining system consistency. For instance, the client node must relinquish the lease before the server does in absolute time.
Event e is preceded by events g on node D and f on node O, which serve as a common reference point for establishing rate synchronization. Once this reference point is established, a conservative estimate of Td can be made to ensure Td <= Te. The paper provides a straightforward derivation of this estimate.
Rate synchronization is only beneficial when a communication mechanism exists to establish a common reference point. In the absence of such a mechanism, traditional clock synchronization remains necessary.
Paper Takeaways
This paper explores a wide range of topics and presents some intriguing observations.
-
Correct system operation should not be contingent upon perfectly synchronized clocks. Systems must maintain correctness even when clock discrepancies occur or have a compensation for it. This principle applies universally across all systems examined in the paper. Notably, even in scenarios where external correctness is compromised, such as external consistency violations in Harp, internal correctness is always preserved.
-
While transactions effectively ensure the internal consistency of databases, external consistency can only be guaranteed through the use of strong reads (both in single-key and multi-key scenarios). However, strong reads often incur significant performance overhead. Leases offer a potential solution for achieving strong reads, but their effectiveness is inherently tied to the accuracy of clock synchronization.
Comments
Post a Comment