Skip to main content

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability.

Paper Link

Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5.

Let's begin with some basic concepts.

File

A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files.

While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions.

Essentially File is also a interface exposing crucial APIs:

class File {

    static void Open(char* path, char* mode);

    void Read(void* buffer, int nbytes);

    void PRead(int pos, void* buffer, int nbytes);

    void Write(void* buffer, int nbytes);

    void PWrite(int pos, void* buffer, int nbytes);

    void Flush();

    void Seek(int pos);

    int Tell();

    void Close();

};

When a program interacts with a file, it uses a File object, which maintains an internal cursor to track the current position within the file (an ordered collection of bytes). The File interface typically provides the following methods:

  • Open(path, mode): Opens the file at the given path in the specified mode. Common modes include "r" (read), "w" (write), and "a" (append), with variants like "w+" (create if it doesn't exist).
  • Read(buffer, nbytes): Reads nbytes from the current position into buffer.
  • PRead(pos, buffer, nbytes): Reads nbytes from the specified pos into buffer.
  • Write(buffer, nbytes): Writes nbytes from buffer at the current position.
  • PWrite(pos, buffer, nbytes): Writes nbytes from buffer at the specified pos.
  • Flush(): Forces all pending writes to disk, committing them. We will see later how much this API has been abused by various file implementations.
  • Seek(pos): Moves the cursor to the specified pos. Subsequent Read() and Write() operations will start from this new position.
  • Tell(): Returns the current cursor position.
  • Close(): Closes the file, releasing associated resources (including the File object itself).

The File interface is one of the most frequently misused interfaces in Linux. Developers often employ unconventional techniques with it. However, as long as the interface's contract is adhered to, such manipulations are sometimes acceptable for achieving performance gains.

File Inode

Files stored on disks are divided into equally-sized blocks. The inode data structure stores the locations of these blocks for each file. Inode data structures vary across file system implementations. The most common approach is a table format, where entries are typically ordered by the file's block numbers. To prevent inode data structure overflow when files use an excessive number of blocks, inodes are often implemented as linked structures.

Apart from the location of blocks, inodes also store file attributes/metadata such as owner, group, mtime (last modified time), etc.

Note: Inodes themselves are also stored on disk. 

Each inode includes a generation number, a unique identifier for each version of a file. If a file is deleted and recreated, the new file receives a distinct generation number, differentiating it from the previous instance.

Directories

Files are typically organized into directories. In BSD-based Linux systems, a directory is itself a special type of file. Crucially, a directory file contains entries (<filename, inode>) that represent the files and subdirectories within it.

Directory entries, or dentries, can be stored as a straightforward table, but this approach leads to inefficient lookups. To address this, file systems use optimized data structures:

Multi-Level Hash Tables

These structures utilize hash functions for fast lookups. The hash tables are organized into multiple levels, with each level containing a growing number of dentry blocks. A lookup starts at level 0, and if the entry is not found, the search continues to the next level. The bits of the hash value guide the search to the appropriate bucket within each level. 

Unbalanced hash tables can result in O(n) worst-case lookup complexity.

B-trees

Alternatively, B-trees or similar balanced tree structures are used. B-trees are self-balancing, maintaining sorted data and providing logarithmic time complexity for searches, sequential access, insertions, and deletions. This makes them highly efficient for large directories.

Reading a File

The root directory has a known, fixed inode location, let's say l. To access data within the file /dir/file.txt, the following steps are taken:

  1. The directory data of root directory is read to find the inode location of the dir directory.
  2. The dir directory's inode is retrieved to determine the disk block locations of the directory's contents.
  3. The directory entries within dir are searched to find the inode location of file.txt.
  4. The file.txt inode is retrieved to determine the disk block locations of the file's data.
  5. The data blocks for file.txt are fetched.

Writing to a File

When a file is written, several updates must be persisted to disk:

  1. The file's data is written to allocated file blocks (possibly new ones).
  2. The file's inode is updated to reflect any new data block locations and other file attributes (e.g., size, mtime, etc).
  3. If the file is new:
    1. A directory entry for the new file is created within its parent directory block.
    2. The parent directory's inode is updated. Creating a new file also updates the parent directory's modification time (mtime). 

Storage Media

Hard disk drives (HDDs) are a prevalent storage solution, offering substantial data capacity at a lower cost than alternatives such as solid-state drives (SSDs). Their operation relies on a rotating magnetic disk and a fixed read/write head that positions itself to access specific locations on the disk.

This article will concentrate on the characteristics and operation of HDDs.

Recommended Read: F2FS: A New File System for Flash Storage to learn more about how SSDs and their file systems work.

File System

A file system defines how files are organized and managed on storage media, specifically how file blocks are arranged on the disk.

Each file system has a superblock at a fixed location on disk that contains static configuration information about the file system including the type. A very basic file system could store the root inode directly within the superblock. From there, the file system could navigate the entire directory tree to locate files.

Storage devices commonly support partitioning, a process that creates multiple logical volumes on a single physical disk. Each partition can host a distinct file system and utilize a defined subset of the disk's capacity. Partitioning details are stored within the superblock.

Several types of file systems exist, including:

  • FAT (File Allocation Table)
  • NTFS (New Technology File System) - a journaling file system
  • ext4 (fourth extended file system) - another journaling file system

FAT File System

The File Allocation Table (FAT) is a relatively simple file system.

Disk Organization

A FAT-formatted disk is organized into three primary regions:

  1. FAT Region: This region stores the File Allocation Tables.
  2. Root Directory Region: This region holds the entries for the root directory.
  3. Data Region: This region contains the data blocks (clusters) for files and directories.

File Allocation Table

Files and directories are divided into fixed-size units called clusters. Clusters belonging to a file are linked together in a chain. The FAT itself is a table that contains entries for each cluster. Each entry indicates one of the following:

  • The cluster number of the next cluster in the chain.
  • A special End-of-Cluster (EOC) marker, indicating the end of the chain.
  • A marker indicating a bad cluster.
  • A zero, signifying an unused cluster.

The FAT functions similarly to an inodes, residing in a fixed, dedicated region on the disk.

File Writes

It's important to note that clusters for a file or directory are not necessarily contiguous. The clusters for a single file can be scattered across the disk. As a result, write operations in the FAT file system are generally random. This is because file data blocks, directory blocks, and FAT entries can be located at disparate positions on the disk.

FAT Variants

FAT has evolved through several variants:

  • FAT12: Uses 12-bit cluster entries, supporting volumes of limited size (e.g., floppy disks).
  • FAT16: Uses 16-bit cluster entries, allowing for larger volumes.
  • FAT32: Uses 32-bit cluster entries, supporting significantly larger volumes and files.

Usage

The FAT file system, particularly FAT32, is widely used on:

  • Removable storage devices (USB flash drives, SD cards) due to its broad compatibility.
  • Embedded systems and devices with limited resources.

Journaling File System

Important note: Journaling file system are not a type of file system but a feature of a file system. 

A journaling file system employs a journal (or log) in addition to the main file system. This journal records pending updates, which greatly aids in recovery after a system crash or power outage.

Common journaling file systems include ext4, ext3, JFS, BRTFS, NTFS, and HFS+.

Recommended Read: Rethink the Sync to learn more about journaling file systems.

Role of Operating Systems

The operating system's primary role in file management is to interact with the file system. This interaction is facilitated by a virtual file system interface, which acts as an abstraction layer. The file system itself provides the concrete implementation of this interface, and all client calls are routed through it.

Everything is a File

The concept everything is a file holds a significant place. This principle extends to terminals as well. 

Terminals, along with other hardware devices, are represented as special files known as device files. These device files reside in the /dev directory. This means that the operating system interacts with terminals using the same file I/O operations (like read and write) that it uses for regular files.

Terminals are specifically classified as character devices because they handle data character by character. This is in contrast to block devices (like hard drives) that handle data in blocks.

When you interact with a terminal, it serves as:

  • Standard input (stdin): Where the system receives input from user.
  • Standard output (stdout): Where the system displays output to user.
  • Standard error (stderr): Where the system displays error messages.

File Descriptor Table

When a process opens a file, it receives a file handle. This file handle, whose implementation is language-specific, contains a file descriptor.

The file descriptor is an integer that serves as an index into the process's file descriptor table. Each process maintains its own independent file descriptor table.

Unix-like systems define three standard file descriptors:

  • 0: stdin
  • 1: stdout
  • 2: stderr

Other file descriptors within the table may reference file inodes located on disk.

File & File System Commands

Beyond standard file operations like Open, Read, Write, Flush, and Close, and file system operations such as MkDir and Create, operating systems provide utilities for maintenance and data integrity.

fsck

fsck is a utility used in Unix-like operating systems (e.g., Linux) to examine and repair file system inconsistencies. It scans for errors, including corrupted data structures, bad blocks, and incorrect metadata. Its purpose is to ensure file system integrity and stability.

fsync

Simply flushing a file (e.g., using a Flush() operation) may not guarantee immediate persistence to disk. Operating systems often optimize performance by initially writing data to in-memory buffers. The fsync command tells the operating system to write all these in-memory buffers to the physical storage device.

sync

sync is similar to fsync but operates across all files. Additionally, the OS will also instruct the disk buffers (described next) to be flushed. This action ensures that all pending write operations are truly committed to disk.

Log-Structured File System

Log-structured file systems (LFS) take a different approach, essentially treating the entire file system as a journal. LFS was developed in the 1990s, a time when:

  • Disk technology was improving in cost and capacity, but not proportionally in transfer bandwidth and access time. Disk seek time, in particular, was a major performance bottleneck. Hard drives rely on a physical read/write head, and the disk must rotate to position the head correctly. Numerous small writes can result in excessive disk seeks, significantly impacting performance. Because disks rotate in one direction, a missed seek requires a full rotation to return to the correct position.

  • To mitigate the cost of disk seeks, disks often included large caches to buffer file writes, enabling more efficient block writes. However, buffered data is vulnerable to loss in the event of a system crash. The Flush() operation is typically used to force the buffered data to disk.

Disk buffers are just one example of buffering at various levels:

  • Program Buffer: The program itself uses a buffer in its memory. Write() operations initially store data in this buffer. Flush() typically transfers data from the program buffer to the OS buffer.
  • OS Buffer: The operating system also maintains a buffer to further optimize disk writes. Files often provide a Fsync() API to flush data from the OS buffer to the disk.
  • Disk Buffer: As mentioned earlier, the disk itself has a buffer. sync command triggers a disk-level synchronization command to ensure all buffered writes are committed to the physical disk.

LFS exemplifies how the standard file contract can be subtly violated. Closing a file in a program does not guarantee that the data is immediately and permanently written to the disk.

A Note on External Consistency

We've hammered on external consistency in my database posts, and it's just as relevant to file systems. In this world, external consistency (or external sync) boils down to making absolutely sure those bytes are on the disk before anything else happens.

Think about Flush(). You might expect data to hit the disk right away, but the OS is often playing a smarter game. It might hold off. But then, say you fire off an email. Believe it or not, the OS might decide now's the time to flush that file data before sending your message. No, I'm not kidding! The OS pulls all sorts of optimization tricks.

Recommended ReadRethink the Sync to learn more about how external consistency is ensured by operating systems.

Random Writes to Sequential Writes

Traditional file writing methods often require multiple disk seeks. For example, the paper asserts that a typical file system (FFS) uses five seeks to write a single file. This is because writes are often random, necessitating multiple physical movements of the disk head.

A key optimization strategy is to convert these random writes into sequential ones. The core idea is to buffer a series of write operations and then commit them to disk in a single, contiguous write. This sequential write requires only one disk seek, a significant improvement. This is precisely the approach taken by LFS.

Segmentation

The disk is organized into segments, and each segment is written sequentially from beginning to end. Segment size is around 512KB to 1 MB. Before a segment can be reused, any live data it contains must be copied elsewhere. Therefore, a segment effectively acts as a write-ahead log, supporting sequential writes. 

A segment is composed of blocks, which are the smallest units of data read from and written to the disk. Block sizes typically range from 512 B to 4 KB. The file blocks, directory blocks, inodes, and other new data structures are logged as blocks on the segments.

We will explore how segments are created in more detail shortly.

Writing to a File - A Straw Man

When a file is written:

  1. The file's data is written to allocated file blocks (possibly new ones), which are then buffered.
  2. The file's inode is updated to reflect the updated block locations within the log. This updated inode is also buffered.
  3. If the file is new:
    1. The directory block is updated to include an entry for the new file and buffered.
    2. The directory's inode is updated to reflect the new directory block location and buffered.

All these buffered updates are then written sequentially in one shot. This write operation translates to a small number of sequential disk writes, depending on how many segments are involved.

However, this initial approach presents a key challenge. Because file and directory inodes are now written to new locations within the log, it becomes necessary to efficiently locate them when needed. How to do that?

Inode Map

Log-structured file systems require new data structures to manage the log-structured writes. A crucial component is the inode map, which tracks the location of inodes within the log. 

Whenever an inode is rewritten to a new location, its corresponding entry in the inode map is updated. The inode map is designed to be compact enough that at least its active portions can reside in memory, avoiding disk lookups. The inode map uses inode numbers (unique file identifiers) as keys and the on-disk locations of inodes as values.

With the inode map in place, directory blocks are also modified. They now contain entries of <filename, inode number>. The root directory's inode is assigned a fixed inode number.

Note that inode map itself is logged to the segments as blocks. The fact that the inode map is itself written to the log creates a new problem: how do we find the inode map's location? We will see later how checkpoint regions help.

Reading a File

Reading a file now involves the following steps:
  1. The current directory's inode number is looked up in the inode map to get its on-disk address.
  2. The directory's inode is retrieved to find the location of its directory blocks.
  3. The directory blocks are read to find the inode number of the target file or subdirectory.
  4. Steps 1-3 needs to be repeated until the desired file's inode number is found.
  5. The file's inode number is looked up in the inode map to get its on-disk address.
  6. The file's inode is retrieved to determine the location of its data blocks.
  7. The file's data blocks are read.
Note that inode map can be stored in-memory to save some costs of lookups.

Writing to a File

Writing a file is also streamlined:
  1. The file's data to is written allocated file blocks (possibly new ones), and then buffered.
  2. The file's inode is updated to reflect the updated block locations within the log. This updated inode is also buffered.
  3. If the file is new:
    1. The directory block is updated to include an entry for the new file and buffered.
    2. The directory's inode is updated to reflect the new directory block location and buffered.
  4. The inode map is updated to reflect the new inode locations and buffer it.
  5. All buffered updates are then written sequentially to segments.

Immutability at Play

The core principle behind LFS is leveraging the immutable nature of committed log entries. Once data is written to the log, it is not overwritten (though obsolete data can be reclaimed through cleaning, as we'll see). Immutability is a powerful concept with broad applications in system design. I highly recommend Pat Helland's Immutability Changes Everything for further exploration of this topic.

A Note on Locality

File system writes can exhibit two types of locality:

  • Logical Locality: This refers to the arrangement of file blocks contiguously on the disk. This allows the entire file content to be read sequentially in a single operation, regardless of the order in which the blocks were originally written. For example, a client might write to block 1 of a file, then write data to other files, and later write to block 2 of the same file. With logical locality, these blocks, though written out of order, are stored adjacent to each other on the disk, enabling sequential reads.

  • Temporal Locality: This refers to the arrangement of file blocks according to the order in which they were written. In the example above, blocks of the same file would be scattered across the disk, reflecting the order of writes. Reading the entire file would then require multiple, non-sequential (random) disk accesses.

LFS prioritizes temporal locality of writes, meaning blocks are written in the order they are received.

Segment Cleaning

Segments are not necessarily stored contiguously on disk; they can be threaded together.

As file blocks, updated inodes, and updated inode map blocks are written to new locations, the older versions become candidates for garbage collection.

Segment cleaning involves identifying a segment containing obsolete data and copying all the live (still in use) blocks within that segment to a new, active segment. This process frees up the original segment, making it available for new writes. Because writes to a segment must be sequential from beginning to end, clean segments are essential for continued operation.

The cleaning process selects M segments for cleaning and results in N clear segments, M > N.

Write Cost

Segment cleaning introduces overhead to the write process. The total write cost isn't just the expense of writing new log data; it also includes the cost of relocating live blocks during segment cleaning.

The authors provide a conservative estimate of this cost based on segment utilization (u). Consider N segments with an average utilization u:

  • Reading the N segments costs N.
  • Writing the live data out to new segments costs N*u.
  • The cleaning process yields N*(1 - u) free space for new data.

The effective write cost is defined as the ratio of total disk reads and writes to the writes of new data only. This represents the write amplification factor.

Write cost = (N + N*u + N*(1-u)) / (N*(1-u)) = 2 / (1-u)

As segment utilization increases, so does the write cost. This formulation omits rotational latency and disk seek time, as these are considered negligible for the sequential reads and writes characteristic of log-structured file systems.

The formula shows that the write cost increases as utilization increases. For example, as shown in Figure 3, the write cost reaches 14 when segment utilization reaches 80%.

Optimizing Write Cost

Files are categorized into two groups based on write frequency:

  • Hot files: The 10% of files that account for 90% of write operations.
  • Cold files: The remaining 90% of files, which account for only 10% of writes.

During segment cleaning, blocks are sorted by age and re-written. Stable blocks (older, less frequently modified) are written to new segments first, followed by newer blocks. This sorting leads to a rough division of segments into two categories:

  • Hot segments: These segments contain recently written data and are more likely to have blocks invalidated frequently.
  • Cold segments: These segments contain stable, valid data blocks.

Hot segments are more likely to contain hot files, while cold segments are more likely to contain cold files.

Now, we will see how segments are selected for cleaning.

Approach 1: Greedy 

One approach to segment cleaning is to use a fixed cleaning threshold. A segment is cleaned if its utilization falls below this predefined threshold.

Figure 5 illustrates this: below the cleaning threshold, there are no segments because those segments would have already been cleaned and freed. The curve slopes downward after the cleaning threshold, indicating that most hot segments tend to have lower utilization (just above the cleaning threshold).

The primary disadvantage of a fixed cleaning threshold is that the utilization of hot segments drops rapidly as data becomes obsolete. These segments quickly reach the cleaning threshold and are cleaned frequently. This high cleaning frequency is inefficient.

Conversely, cold segments experience less rapid utilization decline. Once cleaned, they tend to remain stable and require less frequent cleaning. 

When a cold segment is cleaned, its live blocks are likely moved to another cold segment, which itself will then be less likely to require immediate cleaning. However, when a hot segment is cleaned, its blocks are moved to another hot segment, which will likely also need cleaning soon.

Approach 2: Cost Benefit

We can model the benefit of cleaning a segment as a function of two factors:

  1. The amount of free space in the segment (1-u). More free space translates to a greater benefit from cleaning.
  2. The "age" of the segment, defined as the time elapsed since the most recent modification of any block within that segment. Younger segments are less beneficial to clean.

The cost of cleaning a segment is 1+u (1 represents the cost of reading the segment, and u represents the cost of rewriting the live blocks).

Therefore, the cost-benefit of cleaning a segment can be expressed as: 

Cost-benefit = (1-u) * age / (1+u)

The cost-benefit approach results in a bimodal distribution of segment utilization: hot segments tend to have lower utilization, while cold segments have higher utilization.

The fixed cleaning threshold approach focused solely on maximizing free space (cleaning as much as possible). The cost-benefit approach, however, balances the cost of cleaning with the benefit of the resulting free space.

If cold segments have limited free space, the cost-benefit policy favors cleaning hot segments, even though they have a higher cleaning cost, to maximize the amount of free space gained. Conversely, if cold segments have sufficient free space, they are prioritized for cleaning due to their lower cleaning cost.

This policy achieves a good balance between cleaning cost and benefit. The write cost is typically between 2 and 4 when the overall utilization is around 80%.

Segment Usage Table

All these optimizations require a new data structure called segment usage table to track the usage of each segment (u) and their last write time. The blocks of segment usage table is itself written to logs. It is essential to track these blocks.

Checkpointing

The checkpoint region stores the locations of the inode map and the segment usage table. It also records the last checkpoint written to the log, which is crucial for recovery.

Unlike data blocks, inodes, and the inode map, the checkpoint region resides at a fixed location on the disk.

A checkpoint represents a point in the log where all file system structures are consistent and complete. Checkpoints are written to two distinct locations on disk for redundancy. This replication is essential because checkpoint loss would lead to file system corruption.

Replication also enables atomic checkpoint updates. To write a checkpoint:

  1. All checkpoint blocks are written.
  2. A timestamp is written atomically to a specific block.

Until the timestamp is successfully written, any partial checkpoint data is considered invalid. Upon system restart, the operating system selects the checkpoint with the highest (most recent) timestamp, ensuring that only complete and consistent checkpoints are used. This atomic timestamp update guarantees checkpoint integrity.

Roll Forward Recovery

Checkpoints are written every 30 seconds. Consequently, a crash might occur before the latest checkpoint is written, leaving up to 30 seconds of log data untracked. It's sometimes possible to recover some of this lost information through a process called roll-forward recovery.

Segment Summary

The segment summary block describes the contents of its associated segment, including file numbers (inode numbers) and offsets for each block within the segment. Critically, the segment summary block itself is written to the log, not stored at a fixed disk location.

The segment summary is essential for identifying all live (valid) blocks within a segment.

Directory Operation Log

A directory operation log, a special record within the log, captures each directory modification. This record contains an operation code, the location of the directory entry, the entry's content (name and inode number), and the updated reference count for the named inode. Like other log entries, the directory operation log is stored within the log itself.

Crucially, the directory operation log is written before the corresponding file inode or directory block. This ordering facilitates reference fixing during recovery.

Recovery Steps

The recovery process for an LFS involves the following steps:

  1. The recovery program scans all segment summary blocks (by reading segments). When a summary block reveals a new inode, that inode's information is added to the inode map. The segment usage table is also updated based on the information in the segment summary blocks, reflecting the current state of each segment. If an inode for a file was never written, its associated data blocks are ignored during recovery.

  2. Each inode maintains a reference count, indicating the number of directories referencing it. A key challenge is ensuring consistency between directory entries (filename - inode number mappings) and the newly recovered inodes. If a crash occurred after writing an inode but before updating its corresponding directory entry, the directory operation log comes into play. These logs are used to correct the contents of directory blocks, ensuring they accurately reflect the current state of files.

The recovery program then appends the modified directory blocks, their inodes, the file inodes, the updated inode map, and the updated segment usage table to the log. Finally, a new checkpoint region is written to include these changes.

Similar recovery processes exist in other file systems. This explains, in part, why the operating system boot process can take a noticeable amount of time.

Summary of Disk Organization in LFS


Apart from the superblock, checkpoint regions are located at the beginning and end of the disk.

The remaining disk space is divided into equal-sized segments, linked together in a list. Each segment contains blocks that can be classified as live (in use), obsolete (no longer needed), or free.

Segment cleaning is the process of identifying and consolidating live blocks to reclaim space. This concept is analogous to garbage collection in programming: Valid file/directory blocks are those reachable from the root, just as valid objects in a program are those reachable from root objects, stack globals, or the stack itself.

Summary of New Data Structures in LFS

Log-structured file systems introduce several key data structures:

  • Inode map: Maps inode numbers to the inode's on-disk locations.
  • Segment summary: Describes the contents of a segment, including file numbers and block offsets.
  • Segment usage table: Tracks the amount of live data within each segment and the last write time for data in those segments.
  • Checkpoint region*: Stores the locations of the inode map and segment usage table.
  • Directory change log: Records directory operations to maintain consistency of inode reference counts.
* - Located at fixed location.

Evaluation

The evaluation of LFS performance reveals the following:

  • Small File Performance: LFS demonstrates a significant performance advantage for small file operations, exhibiting up to a 10x speedup for new file writes. Reads are also accelerated due to the sequential arrangement of file blocks for multiple small files.

  • Large File Performance:

    • Sequential writes to large files are faster in LFS. While sequential file writes in traditional file systems can lead to random disk writes (due to inode and directory updates), LFS converts all writes to sequential operations.
    • Random writes to large files are, as expected, considerably faster with LFS.
    • Sequential and random reads of large files experience a performance degradation because LFS does not preserve logical locality after writes.
  • Cleaning Overhead: The observed write cost is significantly lower than predicted by simulations. Measured write costs ranged from 1.2 to 1.6, compared to the simulated range of 2.5 to 3. The bimodal distribution of segment utilization is highly skewed, with peaks at the extreme ends. This indicates the presence of many extremely cold segments that remain unmodified.

  • Recovery Time: Recovering 50,000 1 KB files required a maximum of 132 seconds. The recovery time increases by approximately one hour for every 70-second increase in the checkpoint interval.

Paper Review

This paper capitalized on the prevailing trend of the time, addressing the bottleneck of slow disk access speeds. Such papers often enjoy initial popularity but may fade as technology evolves. However, LFS proved its value and enduring relevance, as will be discussed in the analysis of the next paper.

This is a well-written paper with clever ideas, fundamentally based on the concept of immutability. I highly recommend it as an excellent introduction to file system concepts.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...