Skip to main content

Paper Insights #6 - F2FS: A New File System for Flash Storage

This 2015 paper, authored by researchers at Samsung, Korea, was presented at the highly regarded Usenix FAST conference, which focuses on file and storage technologies. This paper is of particular interest for several reasons: it was featured in Stanford's CS240 curriculum, it provides valuable insights into contemporary SSD-optimized file systems, and it highlights the increasing importance of SSDs in an era of escalating DDR5 costs and exponential data growth.

Paper Link

Must Read: The Design and Implementation of a Log-Structured File System where I introduced several basic file system concepts.

Let's begin with some basic concepts.

Computer Memory

Computer memory can be broadly divided into two main types: volatile and non-volatile. Volatile memory, exemplified by Random Access Memory (RAM), requires continuous power to maintain stored data; data is lost upon power interruption. Conversely, non-volatile memory retains data persistently, even when power is removed.

Non-Volatile Memory

Non-volatile memory can be further categorized into distinct types:

  • EEPROM (Electrically Erasable Programmable Read-Only Memory): Data can be electrically written and erased, providing flexibility.
  • EPROM (Erasable Programmable Read-Only Memory): Data is electrically written but requires ultraviolet light exposure for erasure.
  • PROM (Programmable Read-Only Memory): Data can be written only once, typically by the end-user.
  • ROM (Read-Only Memory): Data is permanently written during manufacturing and cannot be erased.

Flash Memory

Flash memory is a type of EEPROM that has become ubiquitous in modern technology, finding applications in solid-state drives (SSDs), mobile phones, memory cards, and USB flash drives. EEPROM, in turn, is built upon MOSFET technology.

Memory Format

Flash Cells

A flash cell is the fundamental unit for storing a bit of data. Flash cells can store varying numbers of bits:

  • SLC (Single-Level Cell): 1 bit per cell
  • MLC (Multi-Level Cell): 2 bits per cell
  • TLC (Triple-Level Cell): 3 bits per cell
  • QLC (Quad-Level Cell): 4 bits per cell   
  • PLC (Penta-Level Cell): 5 bits per cell
Higher bit densities per cell increase storage capacity but generally result in slower read speeds.

Pages

A page is a collection of flash cells, typically ranging from 2 to 4 KB in size. The page is the smallest unit that can be read and programmed (written) in flash memory.

Blocks

A block is a group of pages, typically containing 32 to 128 pages. The block is the smallest unit that can be erased in flash memory. Erase is required before re-writes.

Planes

Blocks are organized into planes. Reads and writes can be performed concurrently across different planes, increasing parallelism.

Dies

A die consists of one or more planes. Multiple dies are placed on a circuit board. A controller circuit is integrated to manage the storage area and communicate with the device driver.

Packages

A package combines multiple dies into a single physical unit.

NOR vs. NAND Flash Memory

Flash memory is broadly categorized into two primary types: NOR and NAND. These types differ in their underlying logic gate arrangements, resulting in variations in operational performance.

While a detailed discussion of the architectural differences is beyond the scope of this article, the following key distinctions can be made:

  • NOR flash typically offers lower storage density compared to NAND flash, meaning NAND can store more data in the same physical space.
  • NOR flash is generally more expensive than NAND flash.
  • NOR flash excels at random read operations, allowing fast access to individual memory locations.
  • NOR flash exhibits slower data erasure speeds compared to NAND flash.
  • NOR flash is commonly used for program storage, where random read access is crucial. NAND flash is predominantly used for data storage, where high density and sequential write/read performance are prioritized.

NAND Flash

NAND flash memory is a cost-effective and widely adopted form of flash storage. However, it exhibits two significant limitations:

  • Block-Level Erasure: Erase operations can only be performed at the block level. Data is erased at the block level, followed by write operations at the page level.
  • Limited Endurance: SSDs have a finite write endurance, typically expressed as a capacity multiplier (e.g., 600x). Overwrite operations contribute to wear and tear, eventually leading to device failure.

Repeated writes to the same logical block can accelerate wear. To mitigate this, SSD controllers employ wear-leveling techniques, distributing write operations across different physical blocks. Consequently, the logical block addresses presented to the operating system differ from the actual physical block addresses.

This concept bears similarity to the log-structured file system (LFS) approach.

To manage the dynamic mapping between logical and physical block addresses, a translation mechanism is required. This mechanism, often implemented within the SSD controller's firmware, maintains a mapping table to track the current physical location of each logical block.

Flash Translation Layer (FTL)

The Flash Translation Layer (FTL) is firmware responsible for dynamically translating logical block addresses (LBAs), as seen by the operating system, into physical block addresses (PBAs). This translation is performed to:

  • Avoid wear-leveling, by distributing write operations evenly across the flash memory.
  • Ensure that write operations are directed to free physical blocks, which have been previously erased and are ready for data.

Similar to memory, the FTL could have been a mapping of logical page numbers (LPNs) to physical page numbers (PPNs). However, due to the memory overhead associated with maintaining page-level mappings (e.g., 4 bytes per entry), FTLs often implement block-level mappings:

FTL: Logical Block Number (LBN) -> Physical Block Number (PBN)

This approach is conceptually similar to the use of superpages in operating systems, where large contiguous page regions are mapped with a single page table entry.

Log Flash Blocks

While the FTL effectively manages wear-leveling, it introduces write amplification. To rewrite any page within a logical block, the FTL must create a new physical block, involving the relocation of valid data from the original block. This process is performed by the SSD's hardware.

To address the performance impact of block-level erasure during page overwrites, SSD controllers utilize log flash blocks. When a page within a block is modified, the updated page is written to a log flash block. Periodically, the contents of the log flash block are committed to their corresponding physical blocks.

The mapping between log flash blocks and physical flash blocks employs different associativity schemes:

  • Block Associative: A log flash block is dedicated to a specific physical flash block.
  • Fully Associative: A log flash block can be used for any physical flash block.
  • Set Associative: A log flash block is dedicated to a set of physical flash blocks.

Fully associative or set associative schemes are typically employed to optimize performance under random write workloads.

SATA v/s PCIe

SSDs utilize different hardware interfaces for data communication, notably SATA and PCIe

SATA (Serial Advanced Technology Attachment) is a legacy interface, originally developed for hard disk drives, that was subsequently adopted for SSDs to maintain backward compatibility with existing operating systems.

PCIe (Peripheral Component Interconnect Express) represents a more modern interface, providing increased data transfer rates through the use of four parallel data lanes, unlike the single lane of SATA.

Virtual Memory

Virtual memory provides an abstraction of physical memory resources, allowing programs to operate as if they have access to a much larger, contiguous memory space. 

The operating system, in conjunction with hardware like the Memory Management Unit, handles the translation between virtual addresses (used by programs) and physical addresses (actual memory locations). This system allows for the sharing of memory between processes, improves security by isolating memory spaces, and enables programs to utilize more memory than is physically installed, typically through techniques like paging or segmentation.

Memory v/s Disk Pages

Virtual memory uses "pages" to divide both physical memory (RAM) and disk storage. A virtual memory pages are of two types:

  • Memory Pages:
    • These pages hold actively used data and program instructions. They provide fast access for the CPU. 
    • Primary location is RAM. However, when RAM is full, less frequently used memory pages may be paged out (written) to disk.
    • If a program tries to access a page that has been paged out, a page fault occurs. The operating system then retrieves the page from disk and loads it back into RAM.
  • Disk Pages:
    • These pages store data that has been committed to the disk.
    • Primary location is disk. However, disk pages can be loaded into RAM iff free RAM space is available.
RAM is the primary location for actively used, uncommitted data, while disk is the primary location for committed data. Memory pages are only written to disk when RAM is full, and this is done to free ram, not to commit data. Disk pages are loaded into RAM when free ram is available, increasing performance.

%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22Disk%20Page%22%20style%3D%22rounded%3D0%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23dae8fc%3BstrokeColor%3D%236c8ebf%3BfontFamily%3DTimes%20New%20Roman%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22400%22%20y%3D%22400%22%20width%3D%22120%22%20height%3D%2220%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E
%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22Disk%20Page%22%20style%3D%22rounded%3D0%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23dae8fc%3BstrokeColor%3D%236c8ebf%3BfontFamily%3DTimes%20New%20Roman%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22400%22%20y%3D%22400%22%20width%3D%22120%22%20height%3D%2220%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E

Page Cache

The portion of RAM that stores the disk pages is called a page cache or a disk cache. It significantly speeds up access to frequently used data on disk.

SSD Performance

Random Reads: Maximum random read throughput is achieved with 64 KB read sizes. Throughput decreases for read sizes smaller than 64 KB.

Sequential Writes: Optimal linear write performance is observed with write sizes of 64 KB or greater. Write performance degrades significantly for write sizes of 8 KB or less, as these writes fall below the page size, necessitating read-modify-write operations.

Random Writes: Random write performance exhibits a sharp decline for write sizes below 4 MB. This degradation is primarily attributed to write amplification, which arises from the need to locate and manage free blocks for smaller, scattered writes. Typically, 4MB is a block size, therefore writes below that size cause many block operations.

Linear writes are more efficient because they write data sequentially to a block. Once a block is erased, it can be filled completely with linear write data, minimizing overwrite operations. Random writes, conversely, can result in numerous small writes distributed across multiple blocks. This leads to substantial write amplification, as even a small write to a block may trigger extensive block-level operations.

LFS at Rescue!

Log-Structured File Systems (LFS) improve write performance by treating the disk as a sequential log. This conversion of random writes to sequential writes is particularly advantageous for SSDs. Although this paper diverges from a strict LFS implementation, it adapts certain LFS concepts to better suit SSD workloads.

The following section requires familiarity with LFS and so I would recommend reading about it in my previous post

Flash-Friendly File System

F2FS (Flash-Friendly File System) is based on the principles of LFS, with specific adaptations for flash memory.

Disk Organization

It partitions the disk into two distinct areas - sequential and random.

Sequential Write Area (Main Area)

This area is dedicated to log-structured writes of data, including file blocks and inodes. All writes are converted into sequential log entries. Segments within this area are categorized into:
  • Node Segments: Store inodes (aka "node blocks").
  • Data Segments: Store file/directory data blocks.

Random Write Area (Metadata Area)

This area stores critical file system metadata at fixed logical addresses:
  • Superblock: Contains fundamental partitioning information.
  • Checkpoint: Similar to LFS checkpoints, replicated twice, and stores the addresses of the Node Address Table (NAT) and Segment Information Table (SIT).
  • Segment Information Table (SIT): Analogous to the segment usage table in LFS, it tracks the number of valid blocks and a bitmap of valid blocks within the main area, facilitating segment cleaning.
  • Segment Summary Area (SSA): It maps data blocks to inodes.
The metadata residing in the random write area has fixed logical addresses. This can lead to potential wear-leveling issues on the physical blocks where metadata is stored. However, the FTL effectively mitigates this by translating these fixed logical writes into writes to new physical blocks, ensuring wear-leveling. 

To further reduce repetitive writes, the NAT and SIT incorporate journaling mechanisms.

Also note that, the NAT, SIT, and Checkpoint data are written to two distinct locations on the disk, providing redundancy. During checkpoint creation, the updated version is committed to disk. The SSA, written to a single location, can be reconstructed from other available information.

Reading a File

Reading a file is similar to that in LFS, except that the inode number is looked up in the NAT instead of inode map:
  1. The current directory's inode number is looked up in the NAT to obtain its on-disk address.
  2. The directory's inode is retrieved to find the location of its directory blocks.
  3. The directory blocks are read to locate the inode number of the target file or subdirectory.
  4. Steps 1-3 are repeated until the desired file's inode number is found.
  5. The file's inode number is looked up in the NAT to get its on-disk address.
  6. The file's inode is retrieved to determine the location of its data blocks.
  7. The file's data blocks are read.

Writing to a File

Writing to a file is also similar to that in LFS, with the notable exception that NAT and SIT updates are journaled and buffered until a checkpoint is performed:
  1. The file's data to is written allocated file blocks (possibly new ones) is buffered.
  2. The file's inode is updated to reflect the updated block locations within the log. This updated inode is also buffered.
  3. If the file is new:
    1. The directory block is updated to include an entry for the new file and buffered.
    2. The directory's inode is updated to reflect the new directory block location and buffered.
  4. A journal entry is added to update the NAT and SIT and buffered.
  5. All buffered updates in step 1-3 are then written to segments. Data blocks go to data segments while node blocks go to node segments.

fsync

F2FS utilizes the page cache, where file data is initially written to in-memory disk pages upon a flush operation. However, these in-memory changes are not persistently committed to disk until fsync is called on the file.

Even after fsync, only the file's data and inode are written to their respective data and node segments. Crucially, the NAT and SIT updates, which are maintained as journal entries, remain buffered. These vital metadata updates are only committed to disk during checkpoint operations.

Inodes

An inode contains:

  • File attributes and metadata, including the file name, inode number, size, and modification time (mtime).
  • Direct pointers to the file's data blocks or, for small files, inline data.
  • Indirect pointers, which point to blocks containing further pointers to the file's data blocks.

Given a disk page size of 4 KB, small inodes are inefficient. To maximize space utilization, data for files smaller than 3692 bytes is inlined within the inode.

Problem: Wandering Tree Problem in LFS

The inode map in LFS only tracks the location of the inode itself, not its associated indirect blocks. The addresses of these indirect blocks are solely maintained within the inode.

This design introduces a critical challenge: when a new entry is added to an indirect block, the block must be written to the log, resulting in a change of its on-disk address. Consequently, the inode, which stores the indirect block's address, must also be updated and written to the log. Again, writing the inode to the log causes its own address to change, necessitating an update to the inode map. This, in turn, requires the inode map to be written to the log, creating a cascading update problem.

Solution: Inode Numbers for Indirect Blocks

In F2FS, indirect pointer blocks are also assigned inode numbers, requiring translation to physical addresses via the NAT.

The use of inode numbers for indirect blocks, registered within the NAT, addresses the wandering tree problem associated with LFS. When an entry is added to an indirect block, only the NAT needs to be updated; the inode itself remains unchanged.

Directory Structure

A 4 KB directory block (also known as a dentry block) is organized as follows:

  • Bitmaps (27 bytes/216 bits): These bitmaps indicate the validity of each slot within the directory block.
  • Dentries (216 slots x 11 bytes per slot): Each slot contains:
    • Hash value
    • Inode number
    • Name length and name
    • Type (file or directory)
  • Overflow File Names (216 slots x 8 bytes per slot): This section accommodates file names that exceed the space allocated within the dentry itself.

These directory blocks are structured as a multi-level hash table, which exhibits O(N) worst-case complexity. In contrast, file systems like XFS employ B-tree structures, offering a more efficient O(log N) worst-case complexity.

Segment v/s Sections

Data is organized into segments, each a 2 MB contiguous block group, serving as the primary management unit.

A section is composed of segments and is a unit for garbage collection and cleaning. It functions similarly to segments in LFS. In essence, a one-to-one mapping between segments and sections implements an LFS-like structure.

Logging

Multi-head Logging

SSDs, unlike HDDs, eliminate the mechanical head, enabling parallel writes. Consequently, multiple segments can be written concurrently, with sequential writes occurring within each segment.

F2FS enhances data management by classifying segments into three categories: hot, warm, and cold, expanding upon LFS's two-category approach.

This classification impacts node and data blocks as follows:

  • Node Blocks:
    • Direct node blocks for directories are classified as hot.
    • Direct node blocks for files are classified as warm.
    • Indirect blocks are classified as cold.
  • Data Blocks:
    • Directory entry blocks are classified as hot.
    • Newly created data blocks are classified as warm, analogous to the hot category in LFS.
    • Data blocks moved during cleaning are classified as cold, analogous to the cold category in LFS.

The log, along with its corresponding segments, is partitioned into six distinct sections, each managing the log for one of the above categories. To maximize parallelism, F2FS maintains six open segments concurrently, one from each category (multi-head logging).

F2FS maps each of these six partitions to separate zones. While the paper lacks specific details, these zones likely correspond to distinct planes within the SSD. This zoning allows the log flash block to maintain set associativity, ensuring that its blocks map to a single zone.

Adaptive Logging

When the file system exceeds 95% capacity, F2FS initiates threaded logging, writing directly to invalid blocks before cleaning can free sections. Since all blocks are uniform in size, external fragmentation is avoided.

Threaded logging introduces random writes which can overwrite logical blocks. FTL saves the logical block overwrites by mapping them to new physical blocks. FTL mitigates the performance penalties of cleaning near full capacity by trading-off the write performance. This trade-off was deemed beneficial by the authors.

Cleaning

Section cleaning in F2FS is divided into foreground and background processes, each following these steps:

  • Victim Selection:
    • Foreground cleaning employs a greedy strategy, selecting the section with the fewest valid blocks. The number of valid blocks is obtained from the SIT.
    • Background cleaning uses a cost-benefit curve, choosing sections based on the number of valid blocks and their age. This information also comes from the SIT.
  • Valid Block Identification and Movement:
    • The SIT is used to locate all valid blocks.
    • The SSA is then utilized to identify valid node blocks associated with the data blocks. The SSA ensures that when a data block is cleaned, the corresponding node blocks are updated to reflect the data block's new location.
    • The updated locations of the node blocks are buffered into the NAT and SIT journals.
    • Lazy Migration: Flushing valid blocks doesn't immediately write to new sections. Instead, they are buffered into the page cache, organized into the disk pages of the new section, and marked as dirty. The operating system then flushes the entire new section to disk.
  • Mark Victim as Pre-free:
    • The victim section is not immediately freed. It is marked as pre-free and remains unavailable until a checkpoint is written.
    • The checkpoint ensures that all updates to the NAT and SIT are committed (in order to honor fsync). Without this, the NAT and SIT might still reference outdated locations. Therefore, the section is only freed after the checkpoint completes.

Checkpointing and Recovery

It's crucial to reiterate that fsync only guarantees the persistence of data blocks and node blocks. Updates to the NAT and SIT are recorded as journal entries within a buffer, not written directly to disk.

A checkpoint operation then performs the following:

  • All dirty node and dentry blocks residing in the page cache are flushed to disk. I assume that all dirty file data blocks are also flushed at this stage.
  • Metadata modifications are temporarily suspended to ensure consistency.
  • The NAT and SIT are written to dedicated, non-active regions on the disk. SSA is also written.
  • The Checkpoint Pack (CP) region is written, containing:
    • NAT and SIT bitmaps, along with any buffered journal entries.
    • The location of the SSA blocks.
    • Orphaned inodes: These represent inodes for files that were opened but not properly closed, requiring later garbage collection. This can occur, for example, when a file is opened by multiple processes and deleted by one before the others close it.
The validity of a checkpoint is determined by examining its header and footer regions. These regions contain information that indicates if the checkpoint was completed successfully. 

Roll-back Recovery

Roll-back recovery proceeds as follows:

  • The most recent valid checkpoint region is located.
  • The active NAT and SIT are retrieved from this checkpoint.
  • Any journal entries are applied to the retrieved NAT and SIT.
  • Orphaned inodes and their associated data blocks are deleted.
  • A new checkpoint is created to finalize the recovery process.

Roll-forward Recovery

Rollback recovery alone is insufficient for complete data integrity. While it establishes a consistent checkpoint state, it fails to account for writes committed via fsync that occurred after that checkpoint. This is because the journaled and buffered updates to the NAT and SIT are only committed during checkpoints, leaving fsync writes potentially unrecorded in the restored metadata.

Therefore, a roll-forward recovery process is required to recover these post-checkpoint fsync writes. The objective is to identify all node blocks written after the established checkpoint, as the corresponding data blocks can be recovered from them.

To facilitate this identification, each node block contains a special marker, typically the checkpoint number. By scanning the log for node blocks with a marker value greater than the checkpoint number, the system can identify and recover the necessary post-checkpoint writes.

Evaluation

This study compares the performance of four file systems: F2FS, EXT4 (journaling), BTRFS (journaling), and NILFS2 (LFS implementation), across mobile and server workloads. 

Btrfs, a modern Linux file system, utilizes a Copy-on-Write (COW) mechanism, enhancing data integrity through snapshot capabilities. 

TRIM, a command for SSD optimization, plays a role in the evaluation.

Mobile Systems:

  • Inzone:
    • F2FS outperformed EXT4 by 3.1x due to its ability to convert random writes to sequential writes.
    • F2FS surpassed NILFS2 by avoiding synchronous NAT and SIT writes.
    • Btrfs also outperformed EXT4 and NILFS2, leveraging sequential writes during COW, but it did not match F2FS's performance.
    • Read and sequential write performance was consistent across all file systems.
  • SQLite:
    • F2FS excelled due to its log-structured nature, ideal for SQLite's sequential WAL log writes.
    • Btrfs exhibited the poorest performance due to significant copy overhead.

Server Systems (SATA and PCIe):

  • SATA:
    • Sequential read/write performance was consistent (e.g., videoserver).
    • EXT4's numerous small TRIM commands hindered performance, while F2FS's section-based cleanup was more efficient (fileserver).
    • Log-structured file systems (F2FS, NILFS2) benefited varmail's frequent sync operations.
  • PCIe:
    • Performance mirrored SATA results, except in fileserver, where concurrent buffered writes normalized performance across file systems.

TRIM Behavior (Figure 5 Analysis):

  • Figure 5 was hard to read due to the small plots, and magnified sections. It showed writes as blue crosses, reads as red circles, and discards as gray squares.
  • F2FS and NILFS2 demonstrated increasing logical block addresses with each write.
  • EXT4 and Btrfs exhibited higher TRIM command frequencies.

Multi-Head Logging (Figure 6 Analysis):

  • With 6 heads, F2FS achieved a bimodal distribution in cleaning efficiency (note that Figure 6 is a cumulative distribution).
  • F2FS requires at least 2 heads (data and nodes) and so 1-node analysis was not conducted.

Adaptive Logging:

  • F2FS demonstrated superior random write performance at 94% device fill.
  • F2FS's adaptive logging helped maintain performance parity with EXT4 at 100% device fill.

Paper Review

One important thing that I learned from this paper is that there is always a tussle between the hardware and software communities. The hardware community keeps innovating, and the software is forced to adapt to the current hardware trends (they kind of own the supply chain, after all). I am pretty sure that several future works on computer systems will be based on the ongoing hardware innovations. Hopefully, the quantum era hits soon and provides a chance for us computer systems engineers to build software for them.

Samsung owns both the hardware and software sides of its business. The engineers at Samsung already understand the nitty-gritty details of SSDs. It was already well-positioned to come up with a great file system for their storage.

The paper provides a good introduction to SSDs, and I would recommend reading it. It is pretty easy to read through the contents.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...