Skip to main content

Paper Insights #7 - Rethink the Sync

This paper, presented at the highly regarded ODSI 2006 conference, represents a significant contribution to the field of computer systems. Authored by Nightingale et al. (University of Michigan), it distinguishes itself by providing empirical evidence of a magnitude of performance improvement in file systems that had previously eluded researchers. Beyond this quantitative achievement, the paper also introduces a paradigm shift by altering fundamental operating system behavior based on external observation. We are again back to discussing what external consistency means and how can one be lazy about ensuring it.

Paper Link

Must ReadThe Design and Implementation of a Log-Structured File System where I introduced several basic file system concepts.

Let's begin with some basic concepts.

Inter Process Communication (IPC) in Linux

Linux provides several mechanisms for processes to communicate with each other:

Pipes

Pipes enable one-way data flow between processes, typically used to chain commands.

Example: ls -l /etc | grep conf

This command pipes the output of ls -l /etc (listing directory contents) to grep conf (filtering for lines containing "conf").

Unix Domain Sockets (UDS)

UDS facilitate local communication between processes on the same system similar to what network sockets offer across systems. They offer various communication modes:
  • SOCK_STREAM (reliable, connection-oriented, similar to TCP)
  • SOCK_DGRAM (connectionless, datagram-based, similar to UDP)
  • SOCK_SEQPACKETS (connection oriented, sequenced packets, similar to SCTP)

Signals

Signals are used to notify processes of events. Common signals:
  • SIGINT (Ctrl+C): Interrupt signal.
  • SIGTERM: Graceful termination request.
  • SIGKILL: Forceful termination.
  • SIGSEGV: Segmentation fault (invalid memory access).
  • SIGHUP: Hangup signal (often used for configuration reloading).

Shared Memory

Shared memory allows multiple processes to access a common memory region, enabling fast data exchange. Synchronization mechanisms (e.g., semaphores, mutexes) are essential to prevent data corruption.

Speculator

The speculator mechanism originated in distributed file systems.

Suppose a read is performed by a client process:

  • Upon a Read() call, a child process is forked. It retrieves data from the local cache, and continues execution from the point where parent left off.
  • The parent process waits for the server to complete the read operation.
  • If the server's response differs from the cached data, the child process is terminated, and the parent continues.
  • If the responses match, the parent process is terminated.

    This optimization leverages the assumption that cached data is typically accurate, minimizing latency. In cases of discrepancies, the child process is terminated to prevent propagation of incorrect values.

    Newer versions of the speculator have expanded its capabilities to include process checkpointing and rollback, analogous to branch prediction in CPUs. This allows for speculative execution based on predicted outcomes. If the prediction is incorrect, the process can be rolled back to a previous checkpoint, facilitating recovery.

    Speculation and Dependency Tracking

    Speculator maintains a record of processes and kernel objects that depend on the success or failure of a speculated execution. As the processes interact with kernel objects (e.g., UDS), the speculator tracks causal dependencies.

    Example: Inter-Process Communication



    Extending our previous client example in distributed file system:
    • If the child process, after a speculative read, sends a message to another process via IPC, the recipient process is also tracked as speculated execution.
    • If the speculative read fails, both processes are terminated or rolled back to a previous state.

      The speculator ensures that speculated states are never visible to external observers (e.g., users), maintaining system consistency.

      Journaling File System

      The hierarchical organization of a file system can be represented as a tree data structure, where file and empty directory nodes constitute the leaf nodes.

      Given two temporal states, T1 and T2, of this tree, the delta between them can be serialized into a journal, a log of structural modifications to the file system's data representation.

      A journaling file system is built upon a base file system. Changes to the file system are initially logged in memory as a journal. When a checkpoint occurs or a sync operation is initiated, these logged changes are permanently written to the underlying disk. After successful disk commitment, the journal entries are removed.

      Popular examples of journaling file systems are NTFS, ext3, ext4, XFS, and JFS.

      Idempotency at Work

      Journaling file systems rely on the idempotency of file system operations. This allows recorded journal entries to be replayed repeatedly until successful, enabling recovery from crashes at any point. The journal, also known as a write-ahead log (WAL), functions similarly to a transaction log by recording changes before they are committed.

      Operation Modes

      Journaling file systems typically offer multiple modes:

      1. Ordered mode: Only metadata (inode blocks) is journaled. Data blocks are written directly to disk.
      2. Journaled mode: Both data and metadata are journaled.

      Metadata integrity is crucial because it provides the pointers to data blocks. Data blocks must be written to disk before their corresponding metadata.

      Everything is a File

      Everything is a file is a core concept in Unix-like operating systems, including Linux. It captures a fundamental design philosophy. A key aspect is the use of file descriptorsThese are integers that represent open files or resources. Because many system resources are treated like files, file descriptors provide a unified way to access them.

      All of the following are treated as a file in Linux and assigned a descriptor when opened:

      Regular Files

      • Text files: These contain plain text data. For example - /etc/passwd, /home/user/document.txt
      • Binary files: These contain executable code or other non-text data. For example - /usr/bin/python3, /usr/share/image.png

      Directories

      Directories themselves are files that contain metadata about the files and subdirectories they hold. You can use commands like ls to list their contents.

      For example - /home, /etc, /var, /tmp

      Device Files

      • Hard drives or storage devices. For example - /dev/sda, /dev/sdb
      • Terminals and pseudo-terminals. For example -  /dev/tty, /dev/pts/*
      • Random number generators. For example - /dev/random, /dev/urandom
      • /dev/null - a special file that discards all data written to it, and provides no data when read from. 

      Unix Domain Sockets

      Represented as file system entries. These allow local inter-process communication. For example - /var/run/my_daemon.sock

      Procfs and Sysfs

      The /proc file system provides information about running processes and the kernel.
      The /sys file system provides information about hardware devices and kernel subsystems.

      Commit Buffering

      When we call Flush(), the bytes are not written immediately. There are several layers of buffering the prevent the commits.

      Commit Buffer Layers

      There are 2 primary buffer layers - the OS layer (page cache) and the disk layer (drive cache).

      Page Cache

      Operating systems implement virtual memory, which maps virtual addresses used by processes to physical addresses in RAM. When RAM is fully utilized, less frequently used pages are swapped out to secondary storage (e.g., disk). Conversely, when RAM has available space, pages from disk can be brought into memory. This mechanism is known as page caching, where disk pages are stored in RAM to accelerate read operations.

      Drive Cache

      Hard drives incorporate an internal cache to buffer write operations. This drive cache reduces rotational latency for both writes and reads. However, the cache's limited size (a few MBs) can cause blocking when it's full, as subsequent operations must wait for rotational latency. This cache forms an additional buffering layer.

      Once writes are buffered in the drive cache, the actual commit to the disk may happen out-of-order. For example, if there were two sequential writes - A followed by B, then, it is possible that B is written first followed by A.

      Immediate Write Reporting: If a drive uses immediate write reporting, it tells the operating system that the write is done as soon as the data is buffered in its cache. This can significantly improve perceived write performance. However, writes may be lost if there is a power failure. In fact, there are several hard drives which don't allow disabling immediate write reporting.

      Battery-Backed Drives: To avoid write lost in the event of power failures, some hard drives have built-in battery packs to ensure that all writes from cache are committed. However, this battery is itself prone to failures.


      Flushing Data - Flush()

      The Flush() system call, or the implicit flush upon file Close(), transfers data from a process's buffers to the operating system's page cache (in-memory buffers). These pages are marked as "dirty", indicating they are not yet committed to disk. While Flush() protects against process crashes, it does not guarantee data persistence in the event of an operating system or power failure.

      Synchronizing File Data - fsync

      The fsync system call, when invoked with a file descriptor, instructs the operating system to commit the contents of in-memory buffers to the physical disk. The operating system, in turn, interacts with the file system, which may employ optimizations. Fortunately, most file systems reliably honor fsync, ensuring data is written to disk. fsync provides resilience against operating system failures but is still vulnerable to power outages.

      Forcing Disk Commit - sync

      Even after fsync, the disk itself maintains a buffer. A power failure can result in data loss from this disk buffer. The sync system call forces the disk to flush its buffer, guaranteeing that all data is physically committed to the storage medium. This provides the highest level of data persistence, mitigating the risk of power failures.

      Consistency in File System

      Consider a disk state, D. Three sequential changes, A, B, and C, are made to the disk. In the event of a crash during these writes, the recovered disk state must be a consistent prefix of these changes:

      • D + {A}
      • D + {A, B}
      • D + {A, B, C}

      An inconsistent state, such as D + {B, C}, is unacceptable. This prefix guarantee is the minimum consistency requirement for an operating system.

      Failure Scenarios

      Two primary failure scenarios impact disk consistency:

      • Operating System Crash: An OS crash should not lead to file system inconsistency. While data loss is inevitable, the file system's structure must remain valid. This is because OS can maintain the sequential order of changes.
      • Power Failure: Power failures pose a significant risk due to drive-level caching and delayed writes. Drives may report writes as complete before they are physically committed. The buffered writes are physically written out-of-order.

      File systems such as Ext3, ReiserFS, and JFS can encounter consistency issues due to these drive-level complexities.

      The External Sync

      The concept of external consistency is important for file systems. Fundamentally, any computer system aims to maintain external consistency. This principle drives numerous operating system optimizations.

      Optimizations Based on Observed Behavior

      Let's begin with some fun, hypothetical examples:

      • Consider the scenario of writing a file. If the operating system can reliably determine that the file will never be accessed again, the commit operation could be omitted! However, if the user shares the file's name with someone else (e.g., via email) who may later want to read the file, the OS would commit the write to storage.
      • Let's imagine another scenario where the computer has a AI webcam which can tell if the user is looking at the screen or not. The OS could delay displaying the output of cat command until the user's attention is directed towards the screen, optimizing resource usage without affecting the observed result!

      The above are all hypothetical examples. But here are some common examples where writes are buffered by real-world OSes:

      • External Drive Ejection (Windows): Serial I/O through USB is often expensive and so OS buffers the writes to external drives in memory. Before removing external drives, they need to be ejected. The eject function forces sync operation, ensuring all buffered writes to the external drive are committed to disk before its removal. This prevents data loss.
      • System Shutdown: Post shutdown, there would be no way for OS to recover all buffered writes. The shutdown process involves flushing and synchronizing uncommitted disk pages, maintaining data integrity. That's why shutdown takes time.

      These optimizations are grounded in the principle of preserving externally observable consistency. The OS does not guarantee that bytes will be committed when an external drive is pulled out suddenly or when the system loses power. All commitments are delayed as much as possible. Only when there is an externally observable expectation of an action, such as ejecting a drive or performing a clean shutdown, are the bytes committed.

      We will see more examples of such optimizations.

      Paper Idea

      This paper aims to identify all instances where disk commitment is strictly required, enabling the operating system to maximize the delay of physical writes. This approach seeks to optimize performance while ensuring external consistency.

      Design Overview

      External synchronous I/O (xsync) refers to the observable behavior of an I/O operation. An I/O operation is considered xsync if its external output is indistinguishable from that of a strictly synchronous I/O operation.

      To ensure correctness, xsync I/O must produce the same output as synchronous I/O, maintaining the same causal order of events. The concept of causal order is critical for preserving consistency. For example, if the user wrote A and then after successful flush wrote B. If a crash happens, after restart, if B is committed, then A must also be committed (A -> B is the causal order).

      User-Centric vs. Application-Centric Consistency

      The paper distinguishes between user-centric and application-centric approaches to consistency. In the application-centric view, a process treats all other processes as external entities. In contrast, the user-centric view considers all process states as internal to the system, with the operating system responsible for maintaining external consistency.

      Re-iterating our first hypothetical example, if a written file is never accessed again, the operating system may defer flushing in-memory buffers. From the application's perspective, the write appears committed. If the application itself reads the file again, it will receive the bytes from the in-memory buffers. However, from the user's perspective, the write is not truly committed until the operating system deems it necessary. The determination of "necessity" is a key aspect of xsync.

      Batching Commitments

      Xsync does not eliminate disk writes; it optimizes them by delaying commits to maximize performance. Modifications are grouped and journaled into a buffer, then atomically applied to the file system. Atomicity can be achieved through various mechanisms, such as checkpointing in LFS.

      Batching offers significant advantages:

      • Amortized Costs: It reduces the overhead of individual commits.
      • Elimination of Transient Writes: Operations on temporary files that are created and deleted before commitment are avoided, minimizing unnecessary disk I/O.

      This approach balances performance optimization with the requirement for externally observable consistency.

      Commit Dependencies

      A commit dependency represents a many-to-many causal relationship between a process's outputs and uncommitted file system modifications. A process with one or more commit dependencies is considered "uncommitted". Uncommitted output is not visible to external observers.

      Outputs and Output-Triggered Commits

      An output is defined as any event that triggers commits when released. And a commit triggered by a released output is termed an output-triggered commit. There are several outputs that can trigger commits.

      1. Screen Output

      echo hello > file.txt
      cat file.txt

      Executing echo followed by cat necessitates the file's contents being displayed on the screen. The operating system must commit the output to satisfy the user's expectation of consistent data visibility (causal relationship). This is a user-centric approach, where the user expects the same data to be available after a system restart, even after a catastrophic failure.

      2. Network Packets

      Network packets are not allowed to be sent until all associated dependencies are committed. This is because the receiving process cannot be determined beforehand (which can be within the same host).

      3. ioctl

      90% of the operating system are device drivers.

      The ioctl system call provides a generic interface for controlling device-specific operations. Device drivers commonly implement ioctl, with each implementation tailored to the hardware. Consequently, ioctl behavior varies significantly; for instance, an ioctl on a network card might trigger network packet transmission.

      Due to the device-specific nature of ioctl and the inability to predict accessed data, all dependencies must be committed before ioctl operations.

      4. /proc Access

      The /proc file system, which provides process information through virtual files, requires all dependencies to be committed before its data is accessed.

      5. Reboot

      All dependencies needs to be committed before shut down.

      6. sync

      All dependencies are committed upon sync.

      Note: Apart from these output events, there are periodic commits from the operating system.

      Challenges

      Since commits happen only before their corresponding output is released, it can lead to challenges:
      • Recovery Complexity: Managing recovery from catastrophic failures becomes complex. For example, if an application commits data and attempts to send a network packet (an output), but the device fails to commit before the packet is sent, conveying the failure to the process after initial acknowledgment is problematic.
      • Cross-Filesystem Atomicity: Achieving atomic commits across multiple file systems is inherently difficult.

      Tracking Dependencies

      The authors use a Speculator to track commit dependencies:
      • When an output needs to be released, the speculator first commits all corresponding buffered output.
      • Multiple outputs and commit dependencies can originate from a single process.
      • A process can be marked "committed" after all its commit dependencies are resolved.

      Inheriting Dependencies Across Processes

      It is also possible for different processes to inherit each other commit dependencies as follows.

      Shared Memory

      When two processes or threads access shared memory, a bidirectional dependency is established.

      If one process has write permission and another has read permission, the reader inherits the writer's dependencies (unidirectional). While more granular dependency tracking was possible, the authors prioritized simplicity.

      Pipes

      When process P1 pipes its output to process P2, P2 inherits P1's dependencies (unidirectional).

      Unix Sockets

      The process that receives the message inherits dependencies from the sender.

      Signals

      Dependency inheritance via signals is unidirectional, similar to shared memory.

      Fork

      The child process inherits all dependencies from the parent process, as they share the same virtual memory map.

      Examples

      Example 1

      Consider two processes sharing memory. If one process writes to and closes a file, immediate disk commitment is not strictly necessary. As long as the other process accesses the file through shared memory or the operating system's in-memory buffers, external consistency is maintained.

      However, if a process commits a file and then transmits a network packet, immediate disk commitment becomes imperative. The network packet could initiate file access by a remote process (which could be within the system itself), requiring commit.

      Example 2




      Suppose a client sends a request to a web server running on a machine with its own database (running locally). As a result of the request, the server is supposed to commit multiple transactions. All these transactions are committed by the database but not immediately flushed to storage by the operating system. They are flushed to storage only when the server sends back the reply packets to the client (the output). As a result, multiple transaction logs can be batched together for commitment.

      Xsync in Journaling File System

      As previously stated, journaling file systems maintain a journal that records modifications.

      Xsync also maintains an undo log for each running process. Each entry in the undo log corresponds to an output. The outputs themselves have many-to-many commit dependency relationships with uncommitted file system modifications.

      For instance, when a file is created, the modification is recorded in the journal, and a corresponding undo log entry is added to delete that file.

      When a process generates output (e.g., screen output or a network packet), all commit dependencies associated with that output are flushed. Subsequently, corresponding entries from the process's undo log are removed.

      Journal entries are grouped into two types: committing and active. Committing entries form a prefix of the journal, indicating that they have been or will be durably written, while active entries represent pending modifications.

      Durability Guarantees

      Consider a scenario where D + {A, B, C} represent journal entries in that order. In journaled mode, if entry C is committed, then entries A and B are guaranteed to be committed as well. However, this guarantee does not hold true in ordered mode. In ordered mode, data blocks are written directly to the disk (instead of journal), which can result in out-of-order commitments to the disks.

      Rethinking Sync

      The sync command serves as a crucial commit point, ensuring all buffered writes are flushed, encompassing both operating system and drive caches. Consequently, sync is a resource-intensive operation. In xsyncfs, extensive dependency inheritance—resulting from numerous buffered writes—can significantly prolong sync completion.

      The architecture of xsyncfs involves the following layers:

      • Virtual File System (VFS): The OS's VFS layer initiates calls to the speculator.
      • Speculator: The speculator acts as an intermediary, managing commit dependencies. Upon receiving a sync command, the speculator commits all pending dependencies. Otherwise, buffered writes remain in their current state.
      • xsyncfs: xsyncfs operates transparently to application developers, functioning as a layer between the OS and the underlying file system. This design allows xsyncfs to be compatible with various file systems.

      Evaluation

      Systems Under Comparisons

      Figure 3 illustrates different levels of data durability:

      • Asynchronous: Offers no durability*. Data may be lost even after an fsync call due to driver caching.
      • Synchronous: Also lacks durability*, same as asynchronous behavior.
      • Synchronous with Write Barriers: Provides durability*. Write barriers enforce a strong synchronization, ensuring data is committed.
      • External Synchrony: Achieves full durability*.

      * - The key to understanding these distinctions lies in the definition of durability: if a remote computer logs a write operation, but the test computer lacks the written data, durability is considered failed. Therefore, "durable" here signifies that data is committed to storage before the remote confirmation packet is sent which is indeed the case in external synchrony.

      Postmark Benchmark (I/O-Bound)

      The Postmark benchmark, an I/O-intensive workload, involves creating 10,000 files, executing 10,000 transactions (reads, writes, creates, deletes), and subsequently removing all files.

      • As depicted in Figure 4 (logarithmic scale), xsyncfs exhibits a 100x performance improvement over sync with barriers, attributed to the substantial overhead of write barriers.
      • xsyncfs also outperforms sync by 10x, as drive cache limitations (e.g., 2MB) can lead to rapid blocking.
      • While xsyncfs is slightly slower than asynchronous I/O, it provides the critical guarantee that all externally observable outputs are committed.

      Apache Benchmark (CPU-Bound)

      • Figure 5 illustrates that xsyncfs performance approaches that of asynchronous I/O.
      • Furthermore, its performance is comparable to RAMFS (in-memory filesystem), indicating minimal I/O blocking for CPU-bound processes.

      MySQL Benchmark

      MySQL employs its own group commit strategy, necessitating the disabling of drive cache for data durability.
      • xsyncfs demonstrates superior performance compared to ext3 with barriers.
      • The performance gap between xsyncfs and other approaches narrows as transaction parallelism increases.

      Specweb99 Benchmark (Network-Intensive)

      Specweb99, a network-intensive benchmark, generates numerous synchronization operations.

      • xsyncfs maintains performance with only an 8% overhead compared to asynchronous I/O.
      • xsyncfs outperforms the barrier based sync.

      Paper Review

      This is one of the most interesting papers that I have read. Needless to say, it awed me. Initially, I was very surprised to learn that something like this could even work. But it's true. It does work and has been implemented. It forced me to start thinking about user-observable behavior whenever I consider optimizations in distributed systems.

      The core principle, If something was done, and I didn't see its result, is it done or not?, encapsulates the paper's central argument: avoid operations that lack observable consequences. This user-centric perspective is crucial when considering optimizations in systems.

      While requiring careful reading to fully grasp its intricacies, this paper offers significant insights into I/O management and system design. I highly recommend it to anyone interested in system optimization.

      Comments

      Popular posts from this blog

      Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

      This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

      Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

      This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

      Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

      This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

      Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

      At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

      Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

      This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

      Paper Insights #5 - The Design and Implementation of a Log-Structured File System

      This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

      Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

      This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

      Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

      Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

      Paper Insights #24 - Spanner: Google's Globally-Distributed Database

      This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...