Paper Insights #7 - Rethink the Sync

This paper, presented at the highly regarded ODSI 2006 conference, represents a significant contribution to the field of computer systems. Authored by Nightingale et al. (University of Michigan), it distinguishes itself by providing empirical evidence of a magnitude of performance improvement in file systems that had previously eluded researchers. Beyond this quantitative achievement, the paper also introduces a paradigm shift by altering fundamental operating system behavior based on external observation. We are again back to discussing what external consistency means and how can one be lazy about ensuring it.

Paper Link

Must Read: The Design and Implementation of a Log-Structured File System where I introduced several basic file system concepts.

Let's begin with some basic concepts.

Inter Process Communication (IPC) in Linux

Linux provides several mechanisms for processes to communicate with each other:

Pipes

Pipes enable one-way data flow between processes, typically used to chain commands.

Example: ls -l /etc | grep conf

This command pipes the output of ls -l /etc (listing directory contents) to grep conf (filtering for lines containing "conf").

Unix Domain Sockets (UDS)

UDS facilitate local communication between processes on the same system similar to what network sockets offer across systems. They offer various communication modes:

SOCK_STREAM (reliable, connection-oriented, similar to TCP)
SOCK_DGRAM (connectionless, datagram-based, similar to UDP)
SOCK_SEQPACKETS (connection oriented, sequenced packets, similar to SCTP)

Signals

Signals are used to notify processes of events. Common signals:

SIGINT (Ctrl+C): Interrupt signal.
SIGTERM: Graceful termination request.
SIGKILL: Forceful termination.
SIGSEGV: Segmentation fault (invalid memory access).
SIGHUP: Hangup signal (often used for configuration reloading).

Shared Memory

Shared memory allows multiple processes to access a common memory region, enabling fast data exchange. Synchronization mechanisms (e.g., semaphores, mutexes) are essential to prevent data corruption.

Speculator

The speculator mechanism originated in distributed file systems.

Suppose a read is performed by a client process:

Upon a Read() call, a child process is forked. It retrieves data from the local cache, and continues execution from the point where parent left off.
The parent process waits for the server to complete the read operation.
If the server's response differs from the cached data, the child process is terminated, and the parent continues.
If the responses match, the parent process is terminated.

This optimization leverages the assumption that cached data is typically accurate, minimizing latency. In cases of discrepancies, the child process is terminated to prevent propagation of incorrect values.

Newer versions of the speculator have expanded its capabilities to include process checkpointing and rollback, analogous to branch prediction in CPUs. This allows for speculative execution based on predicted outcomes. If the prediction is incorrect, the process can be rolled back to a previous checkpoint, facilitating recovery.

Speculation and Dependency Tracking

Speculator maintains a record of processes and kernel objects that depend on the success or failure of a speculated execution. As the processes interact with kernel objects (e.g., UDS), the speculator tracks causal dependencies.

Example: Inter-Process Communication

Extending our previous client example in distributed file system:

If the child process, after a speculative read, sends a message to another process via IPC, the recipient process is also tracked as speculated execution.
If the speculative read fails, both processes are terminated or rolled back to a previous state.

The speculator ensures that speculated states are never visible to external observers (e.g., users), maintaining system consistency.

Journaling File System

The hierarchical organization of a file system can be represented as a tree data structure, where file and empty directory nodes constitute the leaf nodes.

Given two temporal states, T1 and T2, of this tree, the delta between them can be serialized into a journal, a log of structural modifications to the file system's data representation.

A journaling file system is built upon a base file system. Changes to the file system are initially logged in memory as a journal. When a checkpoint occurs or a sync operation is initiated, these logged changes are permanently written to the underlying disk. After successful disk commitment, the journal entries are removed.

Popular examples of journaling file systems are NTFS, ext3, ext4, XFS, and JFS.

Idempotency at Work

Journaling file systems rely on the idempotency of file system operations. This allows recorded journal entries to be replayed repeatedly until successful, enabling recovery from crashes at any point. The journal, also known as a write-ahead log (WAL), functions similarly to a transaction log by recording changes before they are committed.

Operation Modes

Journaling file systems typically offer multiple modes:

Ordered mode: Only metadata (inode blocks) is journaled. Data blocks are written directly to disk.
Journaled mode: Both data and metadata are journaled.

Metadata integrity is crucial because it provides the pointers to data blocks. Data blocks must be written to disk before their corresponding metadata.

Everything is a File

Everything is a file is a core concept in Unix-like operating systems, including Linux. It captures a fundamental design philosophy. A key aspect is the use of file descriptors. These are integers that represent open files or resources. Because many system resources are treated like files, file descriptors provide a unified way to access them.

All of the following are treated as a file in Linux and assigned a descriptor when opened:

Regular Files

Text files: These contain plain text data. For example - /etc/passwd, /home/user/document.txt
Binary files: These contain executable code or other non-text data. For example - /usr/bin/python3, /usr/share/image.png

Directories

Directories themselves are files that contain metadata about the files and subdirectories they hold. You can use commands like ls to list their contents.

For example - /home, /etc, /var, /tmp

Device Files

Hard drives or storage devices. For example - /dev/sda, /dev/sdb
Terminals and pseudo-terminals. For example - /dev/tty, /dev/pts/*
Random number generators. For example - /dev/random, /dev/urandom
/dev/null - a special file that discards all data written to it, and provides no data when read from.

Unix Domain Sockets

Represented as file system entries. These allow local inter-process communication. For example - /var/run/my_daemon.sock

Procfs and Sysfs

The /proc file system provides information about running processes and the kernel.

The /sys file system provides information about hardware devices and kernel subsystems.

Commit Buffering

When we call Flush(), the bytes are not written immediately. There are several layers of buffering the prevent the commits.

Commit Buffer Layers

There are 2 primary buffer layers - the OS layer (page cache) and the disk layer (drive cache).

Page Cache

Operating systems implement virtual memory, which maps virtual addresses used by processes to physical addresses in RAM. When RAM is fully utilized, less frequently used pages are swapped out to secondary storage (e.g., disk). Conversely, when RAM has available space, pages from disk can be brought into memory. This mechanism is known as page caching, where disk pages are stored in RAM to accelerate read operations.

Drive Cache

Hard drives incorporate an internal cache to buffer write operations. This drive cache reduces rotational latency for both writes and reads. However, the cache's limited size (a few MBs) can cause blocking when it's full, as subsequent operations must wait for rotational latency. This cache forms an additional buffering layer.

Once writes are buffered in the drive cache, the actual commit to the disk may happen out-of-order. For example, if there were two sequential writes - A followed by B, then, it is possible that B is written first followed by A.

Immediate Write Reporting: If a drive uses immediate write reporting, it tells the operating system that the write is done as soon as the data is buffered in its cache. This can significantly improve perceived write performance. However, writes may be lost if there is a power failure. In fact, there are several hard drives which don't allow disabling immediate write reporting.

Battery-Backed Drives: To avoid write lost in the event of power failures, some hard drives have built-in battery packs to ensure that all writes from cache are committed. However, this battery is itself prone to failures.

Flushing Data - Flush()

The Flush() system call, or the implicit flush upon file Close(), transfers data from a process's buffers to the operating system's page cache (in-memory buffers). These pages are marked as "dirty", indicating they are not yet committed to disk. While Flush() protects against process crashes, it does not guarantee data persistence in the event of an operating system or power failure.

Synchronizing File Data - fsync

The fsync system call, when invoked with a file descriptor, instructs the operating system to commit the contents of in-memory buffers to the physical disk. The operating system, in turn, interacts with the file system, which may employ optimizations. Fortunately, most file systems reliably honor fsync, ensuring data is written to disk. fsync provides resilience against operating system failures but is still vulnerable to power outages.

Forcing Disk Commit - sync

Even after fsync, the disk itself maintains a buffer. A power failure can result in data loss from this disk buffer. The sync system call forces the disk to flush its buffer, guaranteeing that all data is physically committed to the storage medium. This provides the highest level of data persistence, mitigating the risk of power failures.

Consistency in File System

Consider a disk state, D. Three sequential changes, A, B, and C, are made to the disk. In the event of a crash during these writes, the recovered disk state must be a consistent prefix of these changes:

D + {A}
D + {A, B}
D + {A, B, C}

An inconsistent state, such as D + {B, C}, is unacceptable. This prefix guarantee is the minimum consistency requirement for an operating system.

Failure Scenarios

Two primary failure scenarios impact disk consistency:

Operating System Crash: An OS crash should not lead to file system inconsistency. While data loss is inevitable, the file system's structure must remain valid. This is because OS can maintain the sequential order of changes.

Power Failure: Power failures pose a significant risk due to drive-level caching and delayed writes. Drives may report writes as complete before they are physically committed. The buffered writes are physically written out-of-order.

File systems such as Ext3, ReiserFS, and JFS can encounter consistency issues due to these drive-level complexities.

The External Sync

The concept of external consistency is important for file systems. Fundamentally, any computer system aims to maintain external consistency. This principle drives numerous operating system optimizations.

Optimizations Based on Observed Behavior

Let's begin with some fun, hypothetical examples:

Consider the scenario of writing a file. If the operating system can reliably determine that the file will never be accessed again, the commit operation could be omitted! However, if the user shares the file's name with someone else (e.g., via email) who may later want to read the file, the OS would commit the write to storage.

Let's imagine another scenario where the computer has a AI webcam which can tell if the user is looking at the screen or not. The OS could delay displaying the output of cat command until the user's attention is directed towards the screen, optimizing resource usage without affecting the observed result!

The above are all hypothetical examples. But here are some common examples where writes are buffered by real-world OSes:

External Drive Ejection (Windows): Serial I/O through USB is often expensive and so OS buffers the writes to external drives in memory. Before removing external drives, they need to be ejected. The eject function forces sync operation, ensuring all buffered writes to the external drive are committed to disk before its removal. This prevents data loss.

System Shutdown: Post shutdown, there would be no way for OS to recover all buffered writes. The shutdown process involves flushing and synchronizing uncommitted disk pages, maintaining data integrity. That's why shutdown takes time.

These optimizations are grounded in the principle of preserving externally observable consistency. The OS does not guarantee that bytes will be committed when an external drive is pulled out suddenly or when the system loses power. All commitments are delayed as much as possible. Only when there is an externally observable expectation of an action, such as ejecting a drive or performing a clean shutdown, are the bytes committed.

We will see more examples of such optimizations.

Paper Idea

This paper aims to identify all instances where disk commitment is strictly required, enabling the operating system to maximize the delay of physical writes. This approach seeks to optimize performance while ensuring external consistency.

Design Overview

External synchronous I/O (xsync) refers to the observable behavior of an I/O operation. An I/O operation is considered xsync if its external output is indistinguishable from that of a strictly synchronous I/O operation.

To ensure correctness, xsync I/O must produce the same output as synchronous I/O, maintaining the same causal order of events. The concept of causal order is critical for preserving consistency. For example, if the user wrote A and then after successful flush wrote B. If a crash happens, after restart, if B is committed, then A must also be committed (A -> B is the causal order).

User-Centric vs. Application-Centric Consistency

The paper distinguishes between user-centric and application-centric approaches to consistency. In the application-centric view, a process treats all other processes as external entities. In contrast, the user-centric view considers all process states as internal to the system, with the operating system responsible for maintaining external consistency.

Re-iterating our first hypothetical example, if a written file is never accessed again, the operating system may defer flushing in-memory buffers. From the application's perspective, the write appears committed. If the application itself reads the file again, it will receive the bytes from the in-memory buffers. However, from the user's perspective, the write is not truly committed until the operating system deems it necessary. The determination of "necessity" is a key aspect of xsync.

Batching Commitments

Xsync does not eliminate disk writes; it optimizes them by delaying commits to maximize performance. Modifications are grouped and journaled into a buffer, then atomically applied to the file system. Atomicity can be achieved through various mechanisms, such as checkpointing in LFS.

Batching offers significant advantages:

Amortized Costs: It reduces the overhead of individual commits.
Elimination of Transient Writes: Operations on temporary files that are created and deleted before commitment are avoided, minimizing unnecessary disk I/O.

This approach balances performance optimization with the requirement for externally observable consistency.

Commit Dependencies

A commit dependency represents a many-to-many causal relationship between a process's outputs and uncommitted file system modifications. A process with one or more commit dependencies is considered "uncommitted". Uncommitted output is not visible to external observers.

Outputs and Output-Triggered Commits

An output is defined as any event that triggers commits when released. And a commit triggered by a released output is termed an output-triggered commit. There are several outputs that can trigger commits.

1. Screen Output

echo hello > file.txt

cat file.txt

Executing echo followed by cat necessitates the file's contents being displayed on the screen. The operating system must commit the output to satisfy the user's expectation of consistent data visibility (causal relationship). This is a user-centric approach, where the user expects the same data to be available after a system restart, even after a catastrophic failure.

2. Network Packets

Network packets are not allowed to be sent until all associated dependencies are committed. This is because the receiving process cannot be determined beforehand (which can be within the same host).

3. ioctl

90% of the operating system are device drivers.

The ioctl system call provides a generic interface for controlling device-specific operations. Device drivers commonly implement ioctl, with each implementation tailored to the hardware. Consequently, ioctl behavior varies significantly; for instance, an ioctl on a network card might trigger network packet transmission.

Due to the device-specific nature of ioctl and the inability to predict accessed data, all dependencies must be committed before ioctl operations.

4. /proc Access

The /proc file system, which provides process information through virtual files, requires all dependencies to be committed before its data is accessed.

5. Reboot

All dependencies needs to be committed before shut down.

6. sync

All dependencies are committed upon sync.

Note: Apart from these output events, there are periodic commits from the operating system.

Challenges

Since commits happen only before their corresponding output is released, it can lead to challenges:

Recovery Complexity: Managing recovery from catastrophic failures becomes complex. For example, if an application commits data and attempts to send a network packet (an output), but the device fails to commit before the packet is sent, conveying the failure to the process after initial acknowledgment is problematic.

Cross-Filesystem Atomicity: Achieving atomic commits across multiple file systems is inherently difficult.

Tracking Dependencies

The authors use a Speculator to track commit dependencies:

When an output needs to be released, the speculator first commits all corresponding buffered output.
Multiple outputs and commit dependencies can originate from a single process.
A process can be marked "committed" after all its commit dependencies are resolved.

Inheriting Dependencies Across Processes

It is also possible for different processes to inherit each other commit dependencies as follows.

Shared Memory

When two processes or threads access shared memory, a bidirectional dependency is established.

If one process has write permission and another has read permission, the reader inherits the writer's dependencies (unidirectional). While more granular dependency tracking was possible, the authors prioritized simplicity.

Pipes

When process P1 pipes its output to process P2, P2 inherits P1's dependencies (unidirectional).

Unix Sockets

The process that receives the message inherits dependencies from the sender.

Signals

Dependency inheritance via signals is unidirectional, similar to shared memory.

Fork

The child process inherits all dependencies from the parent process, as they share the same virtual memory map.

Examples

Example 1

Consider two processes sharing memory. If one process writes to and closes a file, immediate disk commitment is not strictly necessary. As long as the other process accesses the file through shared memory or the operating system's in-memory buffers, external consistency is maintained.

However, if a process commits a file and then transmits a network packet, immediate disk commitment becomes imperative. The network packet could initiate file access by a remote process (which could be within the system itself), requiring commit.

Example 2

Suppose a client sends a request to a web server running on a machine with its own database (running locally). As a result of the request, the server is supposed to commit multiple transactions. All these transactions are committed by the database but not immediately flushed to storage by the operating system. They are flushed to storage only when the server sends back the reply packets to the client (the output). As a result, multiple transaction logs can be batched together for commitment.

Xsync in Journaling File System

As previously stated, journaling file systems maintain a journal that records modifications.

Xsync also maintains an undo log for each running process. Each entry in the undo log corresponds to an output. The outputs themselves have many-to-many commit dependency relationships with uncommitted file system modifications.

For instance, when a file is created, the modification is recorded in the journal, and a corresponding undo log entry is added to delete that file.

When a process generates output (e.g., screen output or a network packet), all commit dependencies associated with that output are flushed. Subsequently, corresponding entries from the process's undo log are removed.

Journal entries are grouped into two types: committing and active. Committing entries form a prefix of the journal, indicating that they have been or will be durably written, while active entries represent pending modifications.

Durability Guarantees

Consider a scenario where D + {A, B, C} represent journal entries in that order. In journaled mode, if entry C is committed, then entries A and B are guaranteed to be committed as well. However, this guarantee does not hold true in ordered mode. In ordered mode, data blocks are written directly to the disk (instead of journal), which can result in out-of-order commitments to the disks.

Rethinking Sync

The sync command serves as a crucial commit point, ensuring all buffered writes are flushed, encompassing both operating system and drive caches. Consequently, sync is a resource-intensive operation. In xsyncfs, extensive dependency inheritance—resulting from numerous buffered writes—can significantly prolong sync completion.

The architecture of xsyncfs involves the following layers:

Virtual File System (VFS): The OS's VFS layer initiates calls to the speculator.
Speculator: The speculator acts as an intermediary, managing commit dependencies. Upon receiving a sync command, the speculator commits all pending dependencies. Otherwise, buffered writes remain in their current state.
xsyncfs: xsyncfs operates transparently to application developers, functioning as a layer between the OS and the underlying file system. This design allows xsyncfs to be compatible with various file systems.

Evaluation

Systems Under Comparisons

Figure 3 illustrates different levels of data durability:

Asynchronous: Offers no durability*. Data may be lost even after an fsync call due to driver caching.
Synchronous: Also lacks durability*, same as asynchronous behavior.
Synchronous with Write Barriers: Provides durability*. Write barriers enforce a strong synchronization, ensuring data is committed.
External Synchrony: Achieves full durability*.

* - The key to understanding these distinctions lies in the definition of durability: if a remote computer logs a write operation, but the test computer lacks the written data, durability is considered failed. Therefore, "durable" here signifies that data is committed to storage before the remote confirmation packet is sent which is indeed the case in external synchrony.

Postmark Benchmark (I/O-Bound)

The Postmark benchmark, an I/O-intensive workload, involves creating 10,000 files, executing 10,000 transactions (reads, writes, creates, deletes), and subsequently removing all files.

As depicted in Figure 4 (logarithmic scale), xsyncfs exhibits a 100x performance improvement over sync with barriers, attributed to the substantial overhead of write barriers.
xsyncfs also outperforms sync by 10x, as drive cache limitations (e.g., 2MB) can lead to rapid blocking.
While xsyncfs is slightly slower than asynchronous I/O, it provides the critical guarantee that all externally observable outputs are committed.

Apache Benchmark (CPU-Bound)

Figure 5 illustrates that xsyncfs performance approaches that of asynchronous I/O.
Furthermore, its performance is comparable to RAMFS (in-memory filesystem), indicating minimal I/O blocking for CPU-bound processes.

MySQL Benchmark

MySQL employs its own group commit strategy, necessitating the disabling of drive cache for data durability.

xsyncfs demonstrates superior performance compared to ext3 with barriers.
The performance gap between xsyncfs and other approaches narrows as transaction parallelism increases.

Specweb99 Benchmark (Network-Intensive)

Specweb99, a network-intensive benchmark, generates numerous synchronization operations.

xsyncfs maintains performance with only an 8% overhead compared to asynchronous I/O.
xsyncfs outperforms the barrier based sync.

Paper Review

This is one of the most interesting papers that I have read. Needless to say, it awed me. Initially, I was very surprised to learn that something like this could even work. But it's true. It does work and has been implemented. It forced me to start thinking about user-observable behavior whenever I consider optimizations in distributed systems.

The core principle, If something was done, and I didn't see its result, is it done or not?, encapsulates the paper's central argument: avoid operations that lack observable consequences. This user-centric perspective is crucial when considering optimizations in systems.

While requiring careful reading to fully grasp its intricacies, this paper offers significant insights into I/O management and system design. I highly recommend it to anyone interested in system optimization.

Pico

Paper Insights #7 - Rethink the Sync

Inter Process Communication (IPC) in Linux

Pipes

Unix Domain Sockets (UDS)

Signals

Shared Memory

Speculator

Speculation and Dependency Tracking

Journaling File System

Idempotency at Work

Operation Modes

Everything is a File

Regular Files

Directories

Device Files

Unix Domain Sockets

Procfs and Sysfs

Commit Buffering

Commit Buffer Layers

Page Cache

Drive Cache

Flushing Data - Flush()

Synchronizing File Data - fsync

Forcing Disk Commit - sync

Consistency in File System

Failure Scenarios

The External Sync

Optimizations Based on Observed Behavior

Paper Idea

Design Overview

User-Centric vs. Application-Centric Consistency

Batching Commitments

Commit Dependencies

Outputs and Output-Triggered Commits

1. Screen Output

2. Network Packets

3. ioctl

4. /proc Access

5. Reboot

6. sync

Challenges

Tracking Dependencies

Inheriting Dependencies Across Processes

Shared Memory

Pipes

Unix Sockets

Signals

Fork

Examples

Example 1

Example 2

Xsync in Journaling File System

Durability Guarantees

Rethinking Sync

Evaluation

Systems Under Comparisons

Postmark Benchmark (I/O-Bound)

Apache Benchmark (CPU-Bound)

MySQL Benchmark

Specweb99 Benchmark (Network-Intensive)

Paper Review

Comments

Post a Comment

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Paper Insights #24 - Spanner: Google's Globally-Distributed Database