Skip to main content

Paper Insights #8 - The Google File System

Having explored a range of file systems in previous discussions, we now turn our attention to their distributed counterparts. Specifically, we will examine the Google File System (GFS). This paper was written by Sanjay Ghemawat and his colleagues. Sanjay Ghemawat is a Google Fellow and a highly respected engineer. His contributions extend far beyond GFS, encompassing core libraries and systems such as MapReduce, making him an indispensable part of Google's technological foundation. It is difficult to envision the current state of Google's infrastructure without his significant contributions.

The paper, presented in SOSP '03, marked the introduction of a system that has since evolved into the critical infrastructure component known as Colossus. This system supports a vast array of Google's operations and is of immense importance to the company's overall functionality. It is crucial to acknowledge that the architecture of Colossus has undergone significant transformations since its initial conception, developing into a highly sophisticated and robust platform.

Despite its current complexity, the initial design of its predecessor, GFS, prioritized simplicity in its large-scale distributed architecture. This focus on simplification was deliberate, aimed at achieving performance and efficient operations. By making specific simplifying assumptions, the system was able to accelerate its processes and handle the demands of Google's growing data needs.

Paper Link

Extended Read: Colossus under the hood: a peek into Google’s scalable storage system

Recommended Read: Paper Insights - The Design and Implementation of a Log-Structured File System where I introduced several basic file system.

Let's begin!

Network File System

Sun Microsystems pioneered the concept of network file systems with their influential paper - Design and Implementation of Sun Network Filesystem (NFS).

At its core, NFS enables access to files across a network. A server maintains the physical storage, and clients interact with it via Remote Procedure Calls (RPC).

From the client's perspective, network file system operations are designed to be seamless. In Linux environments, for instance, NFS implements the virtual file system interface, allowing them to be mounted without the user necessarily being aware of their network-based nature.


NFS was widely adopted for file sharing in the past. However, with the emergence of distributed file systems like GFS and HDFS (Hadoop Distributed File System), their prevalence has diminished for large scale workloads.

Network File System v/s Distributed File System

NFS centralizes file storage on a single server. Client requests are routed over the network to this server. This creates a shared environment. However, it is not a distributed one offering fault tolerance. Consequently, if the server fails, the entire file system becomes inaccessible.

In contrast, a distributed file system distributes the file system itself across multiple servers. Data is replicated and spread across these servers, providing inherent fault tolerance. If one server fails, the system continues to operate using the replicated data on other servers.

The Google File System

Disclaimer: In this article, the discussion centers on the original GFS and not its successor, Colossus. It's important to note that the original GFS paper refers to the metadata server as the master. In Colossus, this component has been redesigned and is now known as the curator, leveraging Bigtable for persistent storage. I acknowledge that the terminology used in the original paper may be considered insensitive and apologize for any discomfort it may cause.

GFS is built on several key principles. Before diving in, let's highlight two: immutability and simplicity.

Immutability

Echoing the same concept found in log-structured file systems, GFS operates on an append-only model - data, once written, whether successfully or not, cannot be modified. Files accept only append-writes. Additionally, a file can only accept append-writes if it is not frozen (immutable).

Important Note: While the paper mentions file overwrites, GFS has evolved to strictly enforce appending to the files only. In this article, we will also focus only on appends and so all mutations corresponds to append-writes only.

Simplicity

GFS stands out as a remarkably simple large-scale distributed system, making it an ideal introductory topic in any distributed system course.

This simplicity stems from deliberate design choices, such as -

  • Consistency over Availability - GFS enforces a single master instance, sacrificing availability for consistency. Even then, GFS only offers eventual consistency which greatly simplifies design.
  • No Atomicity - GFS employs a straightforward replication strategy during append-writes - data is synchronously replicated across all replicas. If any (but not all) replica fails to commit the append, it results in an "inconsistent" (not fully committed) state of the corresponding byte range. This inconsistent state is permanent and needs to be handled by applications.
  • No Isolation - GFS readers may be able to read uncommitted data while some append-write is still in progress. GFS does not provide isolation guarantees, and therefore cannot be externally consistent.

These choices contrast with systems like ZooKeeper and Spanner, which do provide high availability, atomicity, and external consistency. However, the choices allowed for a much simpler implementation.

Important Note: It is worth noting that Colossus has evolved significantly. Unlike the original GFS, modern Colossus achieves high availability and supports several replication schemes.

File Structure

GFS organizes files into chunks and sub-chunks:

  • Chunks: Files are segmented into 64MB chunks. Each chunk is replicated across multiple chunk servers (described later) for fault tolerance. Chunk servers store these chunks within their local Linux file systems, potentially as individual files on disk.
  • Sub-Chunks: Each 64MB chunk is further divided into 64KB sub-chunks. Checksums are calculated and stored at the sub-chunk level. These checksums are crucial for detecting data corruption.

Choice of Chunk Size

Chunk size, a configurable parameter, was set to 64 MB based on workload analysis. This decision reflects a trade-off:

  • Metadata Efficiency: Larger chunks reduce metadata operations during reads and mutations, and minimize master metadata storage requirements.
  • Hotspot Potential: Conversely, larger chunks concentrate data on fewer servers, increasing the risk of hotspots, particularly during tail reads.

Choice of Replication Factor

The replication factor significantly impacts performance. A low replication factor (e.g., 3) may limit read scalability under heavy workloads. Conversely, a high replication factor (5 or more) can slow down mutations and significantly impact tail latency.

Architecture

GFS employs a multi-tiered architecture.

Global Setup

GFS deployments consist of independent clusters distributed globally. These clusters operate autonomously, with no inter-cluster communication. Therefore, the core distributed aspects of GFS are primarily confined to the cluster level. The remainder of this article will focus on the internal workings of a single GFS cluster.

Cluster Setup

Like any file system, GFS manages both file data (chunks) and metadata. In traditional file systems, metadata is often stored in inodes. GFS adopts a similar concept but distributes it across two key components:

  1. Master: This server manages the file metadata, including file attributes and chunk location information.
  2. Chunk Servers: These servers store the raw file data, divided into fixed-size chunks.

The Master

The master manages the file metadata, ensuring the integrity and availability of the file system.

Namespace Management

The master stores the entire directory structure. Internally, GFS represents the namespace as a flat key-value store and not as a traditional tree. For example, if there is a file /a/b/file.txt, then the keys would be /a, /a/b, and /a/b/file.txt.

Values associated with these keys include:
  • File/directory metadata (attributes).
  • Read/write locks. Acquiring locks on a file requires read locking all parent directories.
For files, the value also contains information about it's chunks. For each chunk, it has:
  • The chunk sequence number.
  • The latest version number (described later) for the chunk.
  • Address of the chunk servers holding the  latest version number of the chunk.
Bonus: In Colossus, Bigtable is used as the backing store for the key-value store (which itself uses Colossus).

Replication and Logging

Master state is replicated for reliability using a distributed log, periodically flushed to disk. Only one active master handles mutations and all background tasks, and hence is also responsible for logging all the metadata modifications to the log. Shadow masters serve read requests based on the logs that the active master generates.

Distributed logs provide a mechanism for global serialization as well, say, when there are multiple modifications on the same file.

Example: Say there are 2 requests, one to create a new file /a/b/file.txt and another to rename the file to /a/b/file2.txt, the distributed log would look something like:

create /a/b/file.txt
rename /a/b/file.txt -> /a/b/file2.txt

While executing the first instruction, it will hold read locks on /a and /a/b and write lock on /a/b/file.txt. After the operation has been executed, it will release the locks. Locks helps in concurrent execution of the next instruction while the previous one is in progress. In this example, the locks for the second mutation will be held only after the first mutation completes.

Note: Chunk information are not materialized to the logs. They are dynamically rebuilt upon master startup by polling the chunk servers and using heart beat messages (described later).

Checkpointing

In addition to the distributed log, the master's state is periodically snapshotted (serialized B-tree data structure) and stored on disk. During recovery, snapshots are loaded, and subsequent log entries are applied.

Recommended ReadPaper Insights - ZooKeeper: Wait-free coordination for Internet-scale systems where I discuss how applying updates on snapshot work.

Failure Recovery

Upon primary master failure, an external monitoring system initiates a new primary master. The new primary recovers by reading the logs, catching up to the latest state, and then polling chunk servers to rebuild its chunk location information.

GFS guarantees that there would only be one and only one master at a time, sacrificing availability during change.

Bonus: Colossus replaces the GFS master with Curator, a sharded system offering higher availability.

The Chunk Servers

Each chunk has a designated primary chunk server. All other replicas for the chunk are secondaries. The primary orders all mutations to that chunk, ensuring consistency across secondary replicas.

The master maintains a dynamic view of chunk server availability through periodic heartbeat messages. Additionally, upon startup, the master server reconstructs chunk location information by polling all chunk servers. 

Chunk Placement Policy

Data centers often organize machines into racks, each with a shared network switch (Top-of-the-Rack switch). Failure of a rack switch can render all chunk servers within that rack unavailable.

To mitigate rack-level failures, GFS strategically places replicas of the same chunk on chunk servers residing in different racks. This ensures higher fault tolerance against correlated failures, such as network switch failures.

Chunk Version Number

For each chunk, the chunk servers track the chunk version number. Although not directly mentioned in the paper, it is implicit that the version number of a chunk on the chunk server is incremented whenever the chunk is modified.

The heartbeat messages also contain the version number of all the chunks on a chunk server. This version number is used by the master to distinguish between stale and up-to-date replicas for each chunk.

Consistency Model

This and the next section are the core of the GFS design discussion and so it is very important to understand them. The rest of the system's design builds upon these foundations.

Metadata Consistency

Metadata operations are atomic. This atomicity is ensured by the single master server, which handles all metadata operations sequentially.

Chunk Consistency

A range of bytes in a chunk can be:
  • Consistent: All replicas of the chunk contain the same data for the byte range.
  • Inconsistent: Different replicas of the chunk contain different data for the byte range.
If a mutation to a chunk succeed, the byte range achieves a consistent state across all replicas.

However, a failed mutation, for instance, if the data was appended to a subset of replicas (at least one), can result in an inconsistent byte range within the chunk. Such mutations are not fully committed from the client's perspective. Note that once an append-write to a byte range fails on any (but not all) replica, that byte range is permanently inconsistent. The client's next retry will append to the subsequent byte range.

Readers may perform relaxed reads from a single replica, where uncommitted data might be visible. GFS does not guarantee external consistency, acknowledging the possibility of such inconsistencies. Clients are expected to handle such inconsistencies.

GFS provides a specific, albeit limited, guarantee regarding mutations: at least once atomic-append. This guarantee means that if a client repeatedly attempts to append data to a file, at least one of those append operations will eventually succeed and be applied atomically.

Leases and Mutation Order

As previously discussed, each chunk replica has a designated primary server. The primary is granted a lease by the GFS master. This primary server is the central coordinator for all mutations to the chunk, responsible for establishing a total order of these operations.

The lease has a default timeout of 60 seconds. To maintain its lease, the primary can request extensions, which can be piggybacked into the regular heartbeat messages exchanged with the master.

To enforce a total order, the primary assigns a serial number to each mutation. This serial number is a unique sequence number generated by the primary. This serial number provides a globally consistent ordering mechanism across all mutations applied to the chunk, guaranteeing that all replicas converge on the same final state.

Fault Tolerance for Chunk Servers

While we've established the master's single-instance availability through external monitoring, the availability of chunk servers presents a different set of challenges. Chunk servers can experience various failures, including -

  • Crashes, leading to redundancy loss for all the chunks.
  • Data corruption for specific chunks, for example, when disks fail.
These incidents render affected chunks temporarily unavailable for mutations, causing operations to fail continuously until recovery. The master server responds to these scenarios with specific actions for each affected chunk.

Primary Chunk Server Failure

When a primary chunk server fails, the master must select a new primary. A new primary is elected as follows:

  • Upon primary failure, the master queries remaining replicas for their chunk version numbers.
  • Among the replicas with the latest version (up-to-date replicas), the master selects a new primary.
  • The chunk version number is then incremented, and this updated version number is passed to all up-to-date replicas.

The chunk becomes available for append-writes only after all the steps above are complete.

Secondary Chunk Server Failure

When a secondary chunk server fails, the master assigns the affected chunks to a new chunk server. This triggers a re-replication process (described next), where the new server fetches the current chunk state from other replicas.

After the chunk state is fetched by the new chunk server, the chunk becomes available again for mutations.

Re-replication and Rebalancing

Re-replication is the process of restoring lost replicas and is crucial for maintaining data redundancy. When a new chunk server is assigned responsibility for previously unowned chunks, it retrieves the latest chunk data from existing replicas.

Re-replication is subject to resource constraints:

  • Bandwidth limitations: There's a limit on the bandwidth a chunk server can dedicate to re-replication.
  • Prioritization: Re-replication prioritizes chunks based on importance:
    •  Chunks actively requested by clients taking precedence to restore mutation availability quickly.
    • Chunks with higher redundancy loss are prioritized over others.
Similar to replica restoration, chunks are moved across chunk servers to balance load and optimize disk space.

Stale Replica Removal

Stale replicas arise when a chunk server fails to receive mutations, for instance, when having been offline for an extended period. This scenario occurs when the master reassigns the chunk to another chunk server, and subsequent mutations occur while the original server is still offline.

Consequently, the returning server's replica will lack the current chunk version number. The master leverages this version number to identify and remove these outdated replicas.

Reading a File

Reading a file in GFS is straightforward:

  •  The client first obtains chunk information, including the latest version number, from the master and caches it.
  • Then, using the replica locations, the client selects the closest chunk server for data retrieval. Closeness is determined by IP address proximity, leveraging cluster's network topology to minimize latency.

Read Failure Scenarios

During read operations, two potential issues can arise: 

  • Stale Replicas: A replica might be stale, meaning it lacks the most recent mutations.
  • Inconsistent Byte Range: Alternatively, it might contain a mutation that is either partially committed or failed completely resulting in an inconsistent byte range.

Stale replicas are relatively easy to handle. When the client refreshes metadata from the master, it will obtain the locations of up-to-date replicas and retrieve the data.

However, handling uncommitted mutations is more complex. Readers must employ application-level mechanisms like checksums to validate byte ranges and detect duplicates (which can occur if a writer retries a failed append-write). Data record formats, such as Record I/O, can automate these checks and handle these situations without requiring manual intervention from the user.

Writing a File

Writing to a file in GFS is a multi-stage operation. An append-write can target a single chunk or span multiple chunks, with the process repeating for each affected chunk.

Chunk Replica Identification (Steps 1 - 2)

The client first requests the primary and secondary replica locations for the target chunk from the master server. The master, upon receiving this request, checks for an existing primary. If none exists (for instance, when it is a new chunk), it grants a lease to one of the replicas before responding to the client.

Data Upload (Step 3)

The client then uploads the data to all chunk server replicas of the chunk. This step ensures all replicas have the data before the commit phase begins.

Notably, the client uploads the data only once. The data is replicated in a linear chain across the replicas to optimize bandwidth utilization. Each chunk server in the chain forwards the data to the next closest server based on network topology, likely determined by IP address prefixes, allowing for hop count inference.

Time to Upload

Let's analyze the time required to upload data in GFS, considering factors like bandwidth, latency, and replication. 

Assume B bytes are uploaded, T is the available bandwidth in the cluster fabric, L is the latency between any two chunk servers, and R is the replication factor.

Time to establish connections along the replication chain = R*LTime to transfer the B bytes is B/TTherefore, the total time is approximated by B/T + R*L.

It's important to note that this calculation is a simplification. It doesn't account for network congestion, which can significantly impact actual bandwidth utilization. In real-world scenarios, the effective bandwidth may be lower than the theoretical maximum.

For example, with a 100 Mbps bandwidth and a 1 ms latency, uploading 1 MB of data would ideally take 80 ms. However, network congestion and other factors could increase this time.

LRU Buffer Cache

The received data is temporarily stored in an LRU buffer cache on each replica, not yet committed to the chunk. These cached entries are eventually evicted after a timeout if the mutation is not finalized.

Commit (Step 4 - 7)

The client sends a commit request to the primary replica. The primary assigns a serial number to the mutation and commits the bytes. It is important for primary to commit the bytes first because the primary must maintain the latest chunk version.

Then the primary forwards the request along with serial number to the secondary replicas. Each secondary replica independently applies the mutation to its copy of the chunk and sends a confirmation back to the primary. Once the primary receives acknowledgments from all secondaries, it sends a final acknowledgment back to the client.

Write Failure Scenarios

A failed append-writes may result in an inconsistent state for the byte range to which the append-write was supposed to be committed.

What if some secondary replica fails?

If a secondary replica fails during the data upload, the entire mutation can be retried by the client. All good, no inconsistency till this point.

If it fails after receiving a mutation request from the primary, then it results in an inconsistent state for the byte range. Any readers of the byte range may be able to see the appended bytes (if they were in contact with another secondary which may have appended the bytes successfully) or see garbage. Eventually the failure would be detected by the master and the chunk will be assigned to a new chunk server. Re-replication will be performed. Only then will the mutations start succeeding again.

What if the primary replica fails?

If the primary fails at any point before forwarding the request to secondaries, the entire process can be simply retried by the client. All good, no inconsistency till this point.

If the primary fails after sending the request to secondaries, there will be an inconsistent byte range. The next successful retry (after primary re-assignment) will append bytes to a new range.

What if the client is interacting with an old primary while a new primary has been assigned?

A clock drift between the primary and the master can create a scenario where the primary believes its lease is still valid while it has actually expired.

If a mutation is received by an old primary, it will not be able to commit it on any of the secondary replica as the chunk version number would have already been incremented. Eventually, it will be detected as stale and removed.

Miscellaneous Details

Client Library

While Linux local file systems adhere to the VFS interface, GFS does not. Instead, GFS provides a client library for applications to interact with the file system. The client library internally has necessary checks for usages and 

File Snapshots

GFS file snapshots employ a copy-on-write mechanism similar to Linux's fork(2) system call. In fork(2), a child process initially has exactly the same state as parent. Instead of a fully copying all memory pages for child, the parent's pages are marked write-protected. When either process modifies a shared page, a copy is created, and the modification occurs on the copy, effectively diverging the process states.

Similarly, a GFS file snapshot creates a new file that initially shares the existing file's underlying chunks. When the chunks are modified for any of the files, a copy-on-write approach is used.

Here's the detailed process:

  1. Lease Revocation: Upon receiving a snapshot request, the master revokes all leases on the target file. This prevents any further modifications, akin to write-protecting memory pages.
  2. Metadata Duplication: The master creates a new file entry, replicating all metadata and chunk information from the original file. For each shared chunk, a reference counter is incremented.
  3. Copy-on-Write: Any modification to a chunk with a reference count greater than one triggers a copy. The new chunk copy is then leased, its metadata is updated, and its location is provided to the client. To speed up the copy process, the new chunk is typically created on the same chunk server.

Garbage Collection

GFS employs a soft-delete mechanism for files. When a file is deleted, the master server doesn't immediately remove its data. Instead, it renames the file to a hidden name with an old timestamp, marking it as deleted. This allows for potential recovery of accidentally deleted files.

Subsequently, a background scan by the master server identifies and removes the metadata associated with these deleted files. Chunk servers, during their regular heartbeat communications, receive a list of unknown chunks from the master, prompting them to delete those chunks.

While this approach might seem slow, it offers several advantages:

  • Eventual Consistency: This method ensures eventual consistency in a straightforward manner. Periodic scanning is a common technique in distributed systems for synchronizing state across different components, ensuring that all states eventually converge. This method avoids the complexity of tracking real time diffs, which can be unreliable when the two components are external systems which are directly accessed.
  • Accidental Deletion Recovery: The inherent delay in garbage collection provides a window for rolling back accidental file deletions.

Checksum Verification

During append-writes, if the operation encounters a partially filled sub-chunk at the end of the file, the new data is appended after the existing content. The checksum for this partial sub-chunk is then incrementally updated to reflect the added data.


This mechanism also serves as a corruption detection tool; if the partial sub-chunk was previously corrupted, the updated checksum will not match, alerting the operation to the inconsistency.

Paper Review

The GFS paper is a comprehensive exploration of key distributed systems concepts, and its impact is undeniable, boasting over 10,000 citations. This system stands as one of Google's most significant engineering accomplishments.

Despite its depth, the paper is simple and provides a fascinating insight into distributed file system design. For anyone passionate about distributed systems, I highly recommend reading it. It's a staple of distributed systems literature and a frequent topic of discussion in paper reading groups.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...