This paper, presented at the highly regarded ODSI 2006 conference, represents a significant contribution to the field of computer systems. Authored by Nightingale et al. (University of Michigan), it distinguishes itself by providing empirical evidence of a magnitude of performance improvement in file systems that had previously eluded researchers. Beyond this quantitative achievement, the paper also introduces a paradigm shift by altering fundamental operating system behavior based on external observation. We are again back to discussing what external consistency means and how can one be lazy about ensuring it.
Must Read: The Design and Implementation of a Log-Structured File System where I introduced several basic file system concepts.
Let's begin with some basic concepts.
Inter Process Communication (IPC) in Linux
Pipes
Unix Domain Sockets (UDS)
- SOCK_STREAM (reliable, connection-oriented, similar to TCP)
- SOCK_DGRAM (connectionless, datagram-based, similar to UDP)
- SOCK_SEQPACKETS (connection oriented, sequenced packets, similar to SCTP)
Signals
- SIGINT (Ctrl+C): Interrupt signal.
- SIGTERM: Graceful termination request.
- SIGKILL: Forceful termination.
- SIGSEGV: Segmentation fault (invalid memory access).
- SIGHUP: Hangup signal (often used for configuration reloading).
Shared Memory
Speculator
The speculator mechanism originated in distributed file systems.
Suppose a read is performed by a client process:
- Upon a Read() call, a child process is forked. It retrieves data from the local cache, and continues execution from the point where parent left off.
- The parent process waits for the server to complete the read operation.
- If the server's response differs from the cached data, the child process is terminated, and the parent continues.
- If the responses match, the parent process is terminated.
This optimization leverages the assumption that cached data is typically accurate, minimizing latency. In cases of discrepancies, the child process is terminated to prevent propagation of incorrect values.
Newer versions of the speculator have expanded its capabilities to include process checkpointing and rollback, analogous to branch prediction in CPUs. This allows for speculative execution based on predicted outcomes. If the prediction is incorrect, the process can be rolled back to a previous checkpoint, facilitating recovery.
Speculation and Dependency Tracking
Speculator maintains a record of processes and kernel objects that depend on the success or failure of a speculated execution. As the processes interact with kernel objects (e.g., UDS), the speculator tracks causal dependencies.
Example: Inter-Process Communication- If the child process, after a speculative read, sends a message to another process via IPC, the recipient process is also tracked as speculated execution.
- If the speculative read fails, both processes are terminated or rolled back to a previous state.
The speculator ensures that speculated states are never visible to external observers (e.g., users), maintaining system consistency.
Journaling File System
The hierarchical organization of a file system can be represented as a tree data structure, where file and empty directory nodes constitute the leaf nodes.
Given two temporal states, T1 and T2, of this tree, the delta between them can be serialized into a journal, a log of structural modifications to the file system's data representation.
A journaling file system is built upon a base file system. Changes to the file system are initially logged in memory as a journal. When a checkpoint occurs or a sync operation is initiated, these logged changes are permanently written to the underlying disk. After successful disk commitment, the journal entries are removed.
Popular examples of journaling file systems are NTFS, ext3, ext4, XFS, and JFS.
Idempotency at Work
Journaling file systems rely on the idempotency of file system operations. This allows recorded journal entries to be replayed repeatedly until successful, enabling recovery from crashes at any point. The journal, also known as a write-ahead log (WAL), functions similarly to a transaction log by recording changes before they are committed.
Operation Modes
Journaling file systems typically offer multiple modes:
- Ordered mode: Only metadata (inode blocks) is journaled. Data blocks are written directly to disk.
- Journaled mode: Both data and metadata are journaled.
Metadata integrity is crucial because it provides the pointers to data blocks. Data blocks must be written to disk before their corresponding metadata.
Everything is a File
Everything is a file is a core concept in Unix-like operating systems, including Linux.
Regular Files
- Text files: These contain plain text data. For example - /etc/passwd, /home/user/document.txt
- Binary files: These contain executable code or other non-text data. For example - /usr/bin/python3, /usr/share/image.png
Directories
Device Files
- Hard drives or storage devices. For example - /dev/sda, /dev/sdb
- Terminals and pseudo-terminals. For example - /dev/tty, /dev/pts/*
- Random number generators. For example - /dev/random, /dev/urandom
- /dev/null - a special file that discards all data written to it, and provides no data when read from.
Unix Domain Sockets
Procfs and Sysfs
Commit Buffering
Commit Buffer Layers
Page Cache
Drive Cache
Hard drives incorporate an internal cache to buffer write operations. This drive cache reduces rotational latency for both writes and reads. However, the cache's limited size (a few MBs) can cause blocking when it's full, as subsequent operations must wait for rotational latency. This cache forms an additional buffering layer.
Once writes are buffered in the drive cache, the actual commit to the disk may happen out-of-order. For example, if there were two sequential writes - A followed by B, then, it is possible that B is written first followed by A.
Immediate Write Reporting: If a drive uses immediate write reporting, it tells the operating system that the write is done as soon as the data is buffered in its cache. This can significantly improve perceived write performance. However, writes may be lost if there is a power failure. In fact, there are several hard drives which don't allow disabling immediate write reporting.
Battery-Backed Drives: To avoid write lost in the event of power failures, some hard drives have built-in battery packs to ensure that all writes from cache are committed. However, this battery is itself prone to failures.
Flushing Data - Flush()
The Flush() system call, or the implicit flush upon file Close(), transfers data from a process's buffers to the operating system's page cache (in-memory buffers). These pages are marked as "dirty", indicating they are not yet committed to disk. While Flush() protects against process crashes, it does not guarantee data persistence in the event of an operating system or power failure.
Synchronizing File Data - fsync
The fsync system call, when invoked with a file descriptor, instructs the operating system to commit the contents of in-memory buffers to the physical disk. The operating system, in turn, interacts with the file system, which may employ optimizations. Fortunately, most file systems reliably honor fsync, ensuring data is written to disk. fsync provides resilience against operating system failures but is still vulnerable to power outages.
Forcing Disk Commit - sync
Even after fsync, the disk itself maintains a buffer. A power failure can result in data loss from this disk buffer. The sync system call forces the disk to flush its buffer, guaranteeing that all data is physically committed to the storage medium. This provides the highest level of data persistence, mitigating the risk of power failures.
Consistency in File System
Consider a disk state, D. Three sequential changes, A, B, and C, are made to the disk. In the event of a crash during these writes, the recovered disk state must be a consistent prefix of these changes:
- D + {A}
- D + {A, B}
- D + {A, B, C}
An inconsistent state, such as D + {B, C}, is unacceptable. This prefix guarantee is the minimum consistency requirement for an operating system.
Failure Scenarios
Two primary failure scenarios impact disk consistency:
- Operating System Crash: An OS crash should not lead to file system inconsistency. While data loss is inevitable, the file system's structure must remain valid. This is because OS can maintain the sequential order of changes.
- Power Failure: Power failures pose a significant risk due to drive-level caching and delayed writes. Drives may report writes as complete before they are physically committed. The buffered writes are physically written out-of-order.
File systems such as Ext3, ReiserFS, and JFS can encounter consistency issues due to these drive-level complexities.
The External Sync
The concept of external consistency is important for file systems. Fundamentally, any computer system aims to maintain external consistency. This principle drives numerous operating system optimizations.
Optimizations Based on Observed Behavior
- Consider the scenario of writing a file. If the operating system can reliably determine that the file will never be accessed again, the commit operation could be omitted! However, if the user shares the file's name with someone else (e.g., via email) who may later want to read the file, the OS would commit the write to storage.
- Let's imagine another scenario where the computer has a AI webcam which can tell if the user is looking at the screen or not. The OS could delay displaying the output of
cat
command until the user's attention is directed towards the screen, optimizing resource usage without affecting the observed result!
The above are all hypothetical examples. But here are some common examples where writes are buffered by real-world OSes:
- External Drive Ejection (Windows): Serial I/O through USB is often expensive and so OS buffers the writes to external drives in memory. Before removing external drives, they need to be ejected. The eject function forces sync operation, ensuring all buffered writes to the external drive are committed to disk before its removal. This prevents data loss.
- System Shutdown: Post shutdown, there would be no way for OS to recover all buffered writes. The shutdown process involves flushing and synchronizing uncommitted disk pages, maintaining data integrity. That's why shutdown takes time.
These optimizations are grounded in the principle of preserving externally observable consistency. The OS does not guarantee that bytes will be committed when an external drive is pulled out suddenly or when the system loses power. All commitments are delayed as much as possible. Only when there is an externally observable expectation of an action, such as ejecting a drive or performing a clean shutdown, are the bytes committed.
We will see more examples of such optimizations.
Paper Idea
This paper aims to identify all instances where disk commitment is strictly required, enabling the operating system to maximize the delay of physical writes. This approach seeks to optimize performance while ensuring external consistency.
Design Overview
External synchronous I/O (xsync) refers to the observable behavior of an I/O operation. An I/O operation is considered xsync if its external output is indistinguishable from that of a strictly synchronous I/O operation.
To ensure correctness, xsync I/O must produce the same output as synchronous I/O, maintaining the same causal order of events. The concept of causal order is critical for preserving consistency. For example, if the user wrote A and then after successful flush wrote B. If a crash happens, after restart, if B is committed, then A must also be committed (A -> B is the causal order).
User-Centric vs. Application-Centric Consistency
The paper distinguishes between user-centric and application-centric approaches to consistency. In the application-centric view, a process treats all other processes as external entities. In contrast, the user-centric view considers all process states as internal to the system, with the operating system responsible for maintaining external consistency.
Re-iterating our first hypothetical example, if a written file is never accessed again, the operating system may defer flushing in-memory buffers. From the application's perspective, the write appears committed. If the application itself reads the file again, it will receive the bytes from the in-memory buffers. However, from the user's perspective, the write is not truly committed until the operating system deems it necessary. The determination of "necessity" is a key aspect of xsync.
Batching Commitments
Xsync does not eliminate disk writes; it optimizes them by delaying commits to maximize performance. Modifications are grouped and journaled into a buffer, then atomically applied to the file system. Atomicity can be achieved through various mechanisms, such as checkpointing in LFS.
Batching offers significant advantages:
- Amortized Costs: It reduces the overhead of individual commits.
- Elimination of Transient Writes: Operations on temporary files that are created and deleted before commitment are avoided, minimizing unnecessary disk I/O.
This approach balances performance optimization with the requirement for externally observable consistency.
Commit Dependencies
Outputs and Output-Triggered Commits
An output is defined as any event that triggers commits when released. And a commit triggered by a released output is termed an output-triggered commit. There are several outputs that can trigger commits.
1. Screen Output
echo
followed by cat
necessitates the file's contents being displayed on the screen. The operating system must commit the output to satisfy the user's expectation of consistent data visibility (causal relationship). This is a user-centric approach, where the user expects the same data to be available after a system restart, even after a catastrophic failure.2. Network Packets
3. ioctl
ioctl
system call provides a generic interface for controlling device-specific operations. Device drivers commonly implement ioctl
, with each implementation tailored to the hardware. Consequently, ioctl
behavior varies significantly; for instance, an ioctl
on a network card might trigger network packet transmission.Due to the device-specific nature of ioctl and the inability to predict accessed data, all dependencies must be committed before ioctl operations.
4. /proc Access
5. Reboot
6. sync
Challenges
- Recovery Complexity: Managing recovery from catastrophic failures becomes complex. For example, if an application commits data and attempts to send a network packet (an output), but the device fails to commit before the packet is sent, conveying the failure to the process after initial acknowledgment is problematic.
- Cross-Filesystem Atomicity: Achieving atomic commits across multiple file systems is inherently difficult.
Tracking Dependencies
- When an output needs to be released, the speculator first commits all corresponding buffered output.
- Multiple outputs and commit dependencies can originate from a single process.
- A process can be marked "committed" after all its commit dependencies are resolved.
Inheriting Dependencies Across Processes
Shared Memory
When two processes or threads access shared memory, a bidirectional dependency is established.Pipes
When process P1 pipes its output to process P2, P2 inherits P1's dependencies (unidirectional).Unix Sockets
The process that receives the message inherits dependencies from the sender.Signals
Dependency inheritance via signals is unidirectional, similar to shared memory.Fork
Examples
Example 1
Consider two processes sharing memory. If one process writes to and closes a file, immediate disk commitment is not strictly necessary. As long as the other process accesses the file through shared memory or the operating system's in-memory buffers, external consistency is maintained.
However, if a process commits a file and then transmits a network packet, immediate disk commitment becomes imperative. The network packet could initiate file access by a remote process (which could be within the system itself), requiring commit.
Example 2
Xsync in Journaling File System
As previously stated, journaling file systems maintain a journal that records modifications.
Xsync also maintains an undo log for each running process. Each entry in the undo log corresponds to an output. The outputs themselves have many-to-many commit dependency relationships with uncommitted file system modifications.
For instance, when a file is created, the modification is recorded in the journal, and a corresponding undo log entry is added to delete that file.
When a process generates output (e.g., screen output or a network packet), all commit dependencies associated with that output are flushed. Subsequently, corresponding entries from the process's undo log are removed.
Journal entries are grouped into two types: committing and active. Committing entries form a prefix of the journal, indicating that they have been or will be durably written, while active entries represent pending modifications.
Durability Guarantees
Consider a scenario where D + {A, B, C} represent journal entries in that order. In journaled mode, if entry C is committed, then entries A and B are guaranteed to be committed as well. However, this guarantee does not hold true in ordered mode. In ordered mode, data blocks are written directly to the disk (instead of journal), which can result in out-of-order commitments to the disks.
Rethinking Sync
The sync command serves as a crucial commit point, ensuring all buffered writes are flushed, encompassing both operating system and drive caches. Consequently, sync is a resource-intensive operation. In xsyncfs, extensive dependency inheritance—resulting from numerous buffered writes—can significantly prolong sync completion.
The architecture of xsyncfs involves the following layers:
- Virtual File System (VFS): The OS's VFS layer initiates calls to the speculator.
- Speculator: The speculator acts as an intermediary, managing commit dependencies. Upon receiving a sync command, the speculator commits all pending dependencies. Otherwise, buffered writes remain in their current state.
- xsyncfs: xsyncfs operates transparently to application developers, functioning as a layer between the OS and the underlying file system. This design allows xsyncfs to be compatible with various file systems.
Evaluation
Systems Under Comparisons
Figure 3 illustrates different levels of data durability:
- Asynchronous: Offers no durability*. Data may be lost even after an fsync call due to driver caching.
- Synchronous: Also lacks durability*, same as asynchronous behavior.
- Synchronous with Write Barriers: Provides durability*. Write barriers enforce a strong synchronization, ensuring data is committed.
- External Synchrony: Achieves full durability*.
* - The key to understanding these distinctions lies in the definition of durability: if a remote computer logs a write operation, but the test computer lacks the written data, durability is considered failed. Therefore, "durable" here signifies that data is committed to storage before the remote confirmation packet is sent which is indeed the case in external synchrony.
Postmark Benchmark (I/O-Bound)
The Postmark benchmark, an I/O-intensive workload, involves creating 10,000 files, executing 10,000 transactions (reads, writes, creates, deletes), and subsequently removing all files.
- As depicted in Figure 4 (logarithmic scale), xsyncfs exhibits a 100x performance improvement over sync with barriers, attributed to the substantial overhead of write barriers.
- xsyncfs also outperforms sync by 10x, as drive cache limitations (e.g., 2MB) can lead to rapid blocking.
- While xsyncfs is slightly slower than asynchronous I/O, it provides the critical guarantee that all externally observable outputs are committed.
Apache Benchmark (CPU-Bound)
- Figure 5 illustrates that xsyncfs performance approaches that of asynchronous I/O.
- Furthermore, its performance is comparable to RAMFS (in-memory filesystem), indicating minimal I/O blocking for CPU-bound processes.
MySQL Benchmark
- xsyncfs demonstrates superior performance compared to ext3 with barriers.
- The performance gap between xsyncfs and other approaches narrows as transaction parallelism increases.
Specweb99 Benchmark (Network-Intensive)
Specweb99, a network-intensive benchmark, generates numerous synchronization operations.
- xsyncfs maintains performance with only an 8% overhead compared to asynchronous I/O.
- xsyncfs outperforms the barrier based sync.
Paper Review
This is one of the most interesting papers that I have read. Needless to say, it awed me. Initially, I was very surprised to learn that something like this could even work. But it's true. It does work and has been implemented. It forced me to start thinking about user-observable behavior whenever I consider optimizations in distributed systems.
The core principle, If something was done, and I didn't see its result, is it done or not?, encapsulates the paper's central argument: avoid operations that lack observable consequences. This user-centric perspective is crucial when considering optimizations in systems.
While requiring careful reading to fully grasp its intricacies, this paper offers significant insights into I/O management and system design. I highly recommend it to anyone interested in system optimization.
Comments
Post a Comment