Paper Insights #8 - The Google File System

Having explored a range of file systems in previous discussions, we now turn our attention to their distributed counterparts. Specifically, we will examine the Google File System (GFS). This paper was written by Sanjay Ghemawat and his colleagues. Sanjay Ghemawat is a Google Fellow and a highly respected engineer. His contributions extend far beyond GFS, encompassing core libraries and systems such as MapReduce, making him an indispensable part of Google's technological foundation. It is difficult to envision the current state of Google's infrastructure without his significant contributions.

The paper, presented in SOSP '03, marked the introduction of a system that has since evolved into the critical infrastructure component known as Colossus. This system supports a vast array of Google's operations and is of immense importance to the company's overall functionality. It is crucial to acknowledge that the architecture of Colossus has undergone significant transformations since its initial conception, developing into a highly sophisticated and robust platform.

Despite its current complexity, the initial design of its predecessor, GFS, prioritized simplicity in its large-scale distributed architecture. This focus on simplification was deliberate, aimed at achieving performance and efficient operations. By making specific simplifying assumptions, the system was able to accelerate its processes and handle the demands of Google's growing data needs.

Paper Link

Extended Read: Colossus under the hood: a peek into Google’s scalable storage system

Recommended Read: Paper Insights - The Design and Implementation of a Log-Structured File System where several basic file systems concepts were introduced.

Let's begin!

1. Network File System

Sun Microsystems pioneered the concept of network file systems with their influential paper - Design and Implementation of Sun Network Filesystem (NFS).

At its core, NFS enables access to files across a network. A server maintains the physical storage, and clients interact with it via Remote Procedure Calls (RPC).

From the client's perspective, network file system operations are designed to be seamless. In Linux environments, for instance, NFS implements the virtual file system interface (shown in illustration 1), allowing them to be mounted without the user necessarily being aware of their network-based nature.

Illustration 1: Virtual file system interface.

NFS was widely adopted for file sharing in the past. However, with the emergence of distributed file systems like GFS and Hadoop Distributed File System (HDFS), their prevalence has diminished for large scale workloads.

1.1. Network File System v/s Distributed File System

NFS centralizes file storage on a single server. Client requests are routed over the network to this server. This creates a shared environment. However, it is not a distributed one offering fault tolerance. Consequently, if the server fails, the entire file system becomes inaccessible.

In contrast, a distributed file system distributes the file system itself across multiple servers. Data is replicated and spread across these servers, providing inherent fault tolerance. If one server fails, the system continues to operate using the replicated data on other servers.

2. The Google File System

Disclaimer: In this article, the discussion centers on the original GFS and not its successor, Colossus. It's important to note that the original GFS paper refers to the metadata server as the master. In Colossus, this component has been redesigned and is now known as the curator, leveraging Bigtable for persistent storage.

GFS is built on several key principles: immutability and simplicity.

2.1. Immutability

Echoing the same concept found in log-structured file systems, GFS operates on an append-only model - data, once written, whether successfully or not, cannot be modified. Files accept only append-writes. Additionally, a file can only accept append-writes if it is not frozen (immutable).

Important Note: While the paper mentions file overwrites, GFS has evolved to strictly enforce appending to the files only. In this article, we will also focus only on appends and so all mutations corresponds to append-writes only.

2.2. Simplicity

GFS stands out as a remarkably simple large-scale distributed system, making it an ideal introductory topic in any distributed system course.

This simplicity stems from deliberate design choices, such as -

Consistency over Availability - GFS enforces a single master instance, sacrificing availability for consistency. Even then, GFS only offers eventual consistency which greatly simplifies design.

No Atomicity - GFS employs a straightforward replication strategy during append-writes - data is synchronously replicated across all replicas. If any (but not all) replica fails to commit the append, it results in an "inconsistent" (not fully committed) state of the corresponding byte range. This inconsistent state is permanent and needs to be handled by applications.

No Isolation - GFS readers may be able to read uncommitted data while some append-write is still in progress. GFS does not provide isolation guarantees, and therefore cannot be externally consistent.

These choices contrast with systems like ZooKeeper and Spanner, which do provide high availability, atomicity, and external consistency. However, the choices allowed for a much simpler implementation.

Note: Colossus has evolved significantly. Unlike the original GFS, modern Colossus achieves high availability and supports several replication schemes.

3. File Structure

GFS organizes files into chunks and sub-chunks, as shown in illustration 2:

Chunks: Files are segmented into 64 MB chunks. Each chunk is replicated across multiple chunk servers (described later) for fault tolerance. Chunk servers store these chunks within their local Linux file systems, potentially as individual files on disk.

Sub-Chunks: Each 64 MB chunk is further divided into 64 KB sub-chunks. Checksums are calculated and stored at the sub-chunk level. These checksums are crucial for detecting data corruption.

Illustration 2: File structure in GFS.

3.1. Choice of Chunk Size

Chunk size, a configurable parameter, is set to 64 MB based on workload analysis. This decision reflects a trade-off:

Metadata Efficiency: Larger chunks reduce metadata operations during reads and mutations, and minimize master metadata storage requirements.

Hotspot Potential: Conversely, larger chunks concentrate data on fewer servers, increasing the risk of hotspots, particularly during tail reads.

3.2. Choice of Replication Factor

The replication factor significantly impacts performance. A low replication factor (e.g., 3) may limit read scalability under heavy workloads. Conversely, a high replication factor (5 or more) can slow down mutations and significantly impact tail latency.

4. Architecture

GFS employs a multi-tiered architecture.

4.1. Global Setup

GFS deployments consist of independent clusters distributed globally. These clusters operate autonomously, with no inter-cluster communication. Therefore, the core distributed aspects of GFS are primarily confined to the cluster level. The remainder of this article will focus on the internal workings of a single GFS cluster.

4.2. Cluster Setup

Like any file system, GFS manages both file data (chunks) and metadata. In traditional file systems, metadata is often stored in inodes. GFS adopts a similar concept but distributes it across two key components:

Master: This server manages the file metadata, including file attributes and chunk location information.
Chunk Servers: These servers store the raw file data, divided into fixed-size chunks.

4.3. The Master

The master manages the file metadata, ensuring the integrity and availability of the file system.

4.3.1. Namespace Management

The master stores the entire directory structure. Internally, GFS represents the namespace as a flat key-value store and not as a traditional tree. For example, if there is a file /a/b/file.txt, then the keys would be /a, /a/b, and /a/b/file.txt.

Values associated with these keys include:

File/directory metadata (attributes).
Read/write locks. Acquiring locks on a file requires read locking all parent directories.

For files, the value also contains information about its chunks. For each chunk, it has:

The chunk sequence number.
The latest version number (described later) for the chunk.
Address of the chunk servers holding the latest version number of the chunk.

Bonus: In Colossus, Bigtable is used as the backing store for the key-value store (which itself uses Colossus).

4.3.2. Replication and Logging

Master state is replicated for reliability using a distributed log, periodically flushed to disk. Only one active master handles mutations and all background tasks, and hence is also responsible for logging all the metadata modifications to the log. Shadow masters serve read requests based on the logs that the active master generates.

Distributed logs provide a mechanism for global serialization as well, say, when there are multiple modifications on the same file.

Example: Say there are 2 requests, one to create a new file /a/b/file.txt and another to rename the file to /a/b/file2.txt, the distributed log would look something like:

create /a/b/file.txt

rename /a/b/file.txt -> /a/b/file2.txt

While executing the first instruction, it will hold read locks on /a and /a/b and write lock on /a/b/file.txt. After the operation has been executed, it will release the locks. Locks help in concurrent execution of the next instruction while the previous one is in progress. In this example, the locks for the second mutation will be held only after the first mutation completes.

Note: Chunk information is not materialized to the logs. They are dynamically rebuilt upon master startup by polling the chunk servers and using heart beat messages (described later).

4.3.3. Checkpointing

In addition to the distributed log, the master's state is periodically snapshotted (serialized B-tree data structure) and stored on disk. During recovery, snapshots are loaded, and subsequent log entries are applied.

Recommended Read - Paper Insights - ZooKeeper: Wait-free coordination for Internet-scale systems where we discuss how applying updates on snapshots work.

4.3.4. Failure Recovery

Upon primary master failure, an external monitoring system initiates a new primary master. The new primary recovers by reading the logs, catching up to the latest state, and then polling chunk servers to rebuild its chunk location information.

GFS guarantees that there would only be one and only one master at a time, sacrificing availability during change.

Bonus: Colossus replaces the GFS master with Curator, a sharded system offering higher availability.

4.4. The Chunk Servers

Each chunk has a designated primary chunk server. All other replicas for the chunk are secondaries. The primary orders all mutations to that chunk, ensuring consistency across secondary replicas.

The master maintains a dynamic view of chunk server availability through periodic heartbeat messages. Additionally, upon startup, the master server reconstructs chunk location information by polling all chunk servers.

4.4.1. Chunk Placement Policy

Data centers often organize machines into racks, each with a shared network switch (Top-of-the-Rack switch). Failure of a rack switch can render all chunk servers within that rack unavailable.

To mitigate rack-level failures, GFS strategically places replicas of the same chunk on chunk servers residing in different racks. This ensures higher fault tolerance against correlated failures, such as network switch failures.

4.4.2. Chunk Version Number

For each chunk, the chunk servers track the chunk version number. Although not directly mentioned in the paper, it is implicit that the version number of a chunk on the chunk server is incremented whenever the chunk is modified.

The heartbeat messages also contain the version number of all the chunks on a chunk server. This version number is used by the master to distinguish between stale and up-to-date replicas for each chunk.

5. Consistency Model

This and the next section are the core of the GFS design discussion and so it is very important to understand them. The rest of the system's design builds upon these foundations.

5.1. Metadata Consistency

Metadata operations are atomic. This atomicity is ensured by the single master server, which handles all metadata operations sequentially.

5.2. Chunk Consistency

A range of bytes in a chunk can be:

Consistent: All replicas of the chunk contain the same data for the byte range.
Inconsistent: Different replicas of the chunk contain different data for the byte range.

If a mutation to a chunk succeeds, the byte range achieves a consistent state across all replicas.

However, a failed mutation, for instance, if the data was appended to a subset of replicas (at least one), can result in an inconsistent byte range within the chunk. Such mutations are not fully committed from the client's perspective. Note that once an append-write to a byte range fails on any (but not all) replica, that byte range is permanently inconsistent. The client's next retry will append to the subsequent byte range.

Readers may perform relaxed reads from a single replica, where uncommitted data might be visible. GFS does not guarantee external consistency, acknowledging the possibility of such inconsistencies. Clients are expected to handle such inconsistencies.

GFS provides a specific, albeit limited, guarantee regarding mutations: at-least-once atomic-append. This guarantee means that if a client repeatedly attempts to append data to a file, at least one of those append operations will eventually succeed and be applied atomically.

6. Leases and Mutation Order

As previously discussed, each chunk replica has a designated primary server. The primary is granted a lease by the GFS master. This primary server is the central coordinator for all mutations to the chunk, responsible for establishing a total order of these operations.

The lease has a default timeout of 60 seconds. To maintain its lease, the primary can request extensions, which can be piggybacked into the regular heartbeat messages exchanged with the master.

To enforce a total order, the primary assigns a serial number to each mutation. This serial number is a unique sequence number generated by the primary. This serial number provides a globally consistent ordering mechanism across all mutations applied to the chunk, guaranteeing that all replicas converge on the same final state.

7. Fault Tolerance for Chunk Servers

The availability of chunk servers presents a different set of challenges. Chunk servers can experience various failures, including -

Crashes, leading to redundancy loss for all the chunks.
Data corruption for specific chunks, for example, when disks fail.

These incidents render affected chunks temporarily unavailable for mutations, causing operations to fail continuously until recovery. The master server responds to these scenarios with specific actions for each affected chunk.

7.1. Primary Chunk Server Failure

When a primary chunk server fails, the master must select a new primary. A new primary is elected as follows:

Upon primary failure, the master queries remaining replicas for their chunk version numbers.
Among the replicas with the latest version (up-to-date replicas), the master selects a new primary.
The chunk version number is then incremented, and this updated version number is passed to all up-to-date replicas.

The chunk becomes available for append-writes only after all the steps above are complete.

7.2. Secondary Chunk Server Failure

When a secondary chunk server fails, the master assigns the affected chunks to a new chunk server. This triggers a re-replication process (described next), where the new server fetches the current chunk state from other replicas.

After the chunk state is fetched by the new chunk server, the chunk becomes available again for mutations.

7.3. Re-replication and Rebalancing

Re-replication is the process of restoring lost replicas and is crucial for maintaining data redundancy. When a new chunk server is assigned responsibility for previously unowned chunks, it retrieves the latest chunk data from existing replicas.

Re-replication is subject to resource constraints:

Bandwidth limitations: There's a limit on the bandwidth a chunk server can dedicate to re-replication.
Prioritization: Re-replication prioritizes chunks based on importance:

Chunks actively requested by clients take precedence to restore mutation availability quickly.
Chunks with higher redundancy loss are prioritized over others.

Similar to replica restoration, chunks are moved across chunk servers to balance load and optimize disk space.

7.4. Stale Replica Removal

Stale replicas arise when a chunk server fails to receive mutations, for instance, when having been offline for an extended period. This scenario occurs when the master reassigns the chunk to another chunk server, and subsequent mutations occur while the original server is still offline.

Consequently, the returning server's replica will lack the current chunk version number. The master leverages this version number to identify and remove these outdated replicas.

8. Reading a File

Reading a file in GFS is straightforward:

The client first obtains chunk information, including the latest version number, from the master and caches it.
Then, using the replica locations, the client selects the closest chunk server for data retrieval. Closeness is determined by IP address proximity, leveraging cluster's network topology to minimize latency.

8.1. Read Failure Scenarios

During read operations, two potential issues can arise:

Stale Replicas: A replica might be stale, meaning it lacks the most recent mutations.
Inconsistent Byte Range: Alternatively, it might contain a mutation that is either partially committed or failed completely resulting in an inconsistent byte range.

Stale replicas are relatively easy to handle. When the client refreshes metadata from the master, it will obtain the locations of up-to-date replicas and retrieve the data.

However, handling uncommitted mutations is more complex. Readers must employ application-level mechanisms like checksums to validate byte ranges and detect duplicates (which can occur if a writer retries a failed append-write). Data record formats, such as Record I/O, can automate these checks and handle these situations without requiring manual intervention from the user.

9. Writing a File

Writing to a file in GFS is a multi-stage operation, as shown in figure 2. An append-write can target a single chunk or span multiple chunks, with the process repeating for each affected chunk.

Figure 2: Copy of figure 2 from the paper.

9.1. Chunk Replica Identification (Steps 1 - 2)

The client first requests the primary and secondary replica locations for the target chunk from the master server. The master, upon receiving this request, checks for an existing primary. If none exists (for instance, when it is a new chunk), it grants a lease to one of the replicas before responding to the client.

9.2. Data Upload (Step 3)

The client then uploads the data to all chunk server replicas of the chunk. This step ensures all replicas have the data before the commit phase begins.

Notably, the client uploads the data only once. The data is replicated in a linear chain across the replicas to optimize bandwidth utilization. Each chunk server in the chain forwards the data to the next closest server based on network topology, likely determined by IP address prefixes, allowing for hop count inference.

9.2.1. Time to Upload

Let's analyze the time required to upload data in GFS, considering factors like bandwidth, latency, and replication.

Assume B bytes are uploaded, T is the available bandwidth in the cluster fabric, L is the latency between any two chunk servers, and R is the replication factor.

Time to establish connections along the replication chain = R*L.
Time to transfer the B bytes is B/T.
Therefore, the total time is approximated by B/T + R*L.

It's important to note that this calculation is a simplification. It doesn't account for network congestion, which can significantly impact actual bandwidth utilization. In real-world scenarios, the effective bandwidth may be lower than the theoretical maximum.

For example, with a 100 Mbps bandwidth and a 1ms latency, uploading 1 MB of data would ideally take 80 ms. However, network congestion and other factors could increase this time.

9.2.2. LRU Buffer Cache

The received data is temporarily stored in an LRU buffer cache on each replica, not yet committed to the chunk. These cached entries are eventually evicted after a timeout if the mutation is not finalized.

9.3. Commit (Step 4 - 7)

The client sends a commit request to the primary replica. The primary assigns a serial number to the mutation and commits the bytes. It is important for primary to commit the bytes first because the primary must maintain the latest chunk version.

Then the primary forwards the request along with the serial number to the secondary replicas. Each secondary replica independently applies the mutation to its copy of the chunk and sends a confirmation back to the primary. Once the primary receives acknowledgments from all secondaries, it sends a final acknowledgment back to the client.

9.4. Write Failure Scenarios

A failed append-write may result in an inconsistent state for the byte range to which the append-write was supposed to be committed.

9.4.1. What if some secondary replica fails?

If a secondary replica fails during the data upload, the entire mutation can be retried by the client. All good, no inconsistency till this point.

If it fails after receiving a mutation request from the primary, then it results in an inconsistent state for the byte range. Any readers of the byte range may be able to see the appended bytes (if they were in contact with another secondary which may have appended the bytes successfully) or see garbage. Eventually the failure would be detected by the master and the chunk will be assigned to a new chunk server. Re-replication will be performed. Only then will the mutations start succeeding again.

9.4.2. What if the primary replica fails?

If the primary fails at any point before forwarding the request to secondaries, the entire process can be simply retried by the client. All good, no inconsistency till this point.

If the primary fails after sending the request to secondaries, there will be an inconsistent byte range. The next successful retry (after primary re-assignment) will append bytes to a new range.

9.4.3. What if the client is interacting with an old primary while a new primary has been assigned?

A clock drift between the primary and the master can create a scenario where the primary believes its lease is still valid while it has actually expired.

If a mutation is received by an old primary, it will not be able to commit it on any of the secondary replicas as the chunk version number would have already been incremented. Eventually, it will be detected as stale and removed.

10. Miscellaneous Details

10.1. Client Library

While Linux local file systems adhere to the VFS interface, GFS does not. Instead, GFS provides a client library for applications to interact with the file system.

10.2. File Snapshots

GFS file snapshots employ a copy-on-write mechanism similar to Linux's fork(2) system call. In fork(2), a child process initially has exactly the same state as parent. Instead of fully copying all memory pages for child, the parent's pages are marked write-protected. When either process modifies a shared page, a copy is created, and the modification occurs on the copy, effectively diverging the process states.

Similarly, a GFS file snapshot creates a new file that initially shares the existing file's underlying chunks. When the chunks are modified for any of the files, a copy-on-write approach is used.

Here's the detailed process:

Lease Revocation: Upon receiving a snapshot request, the master revokes all leases on the target file. This prevents any further modifications, akin to write-protecting memory pages.
Metadata Duplication: The master creates a new file entry, replicating all metadata and chunk information from the original file. For each shared chunk, a reference counter is incremented.
Copy-on-Write: Any modification to a chunk with a reference count greater than one triggers a copy. The new chunk copy is then leased, its metadata is updated, and its location is provided to the client. To speed up the copy process, the new chunk is typically created on the same chunk server.

10.3. Garbage Collection

GFS employs a soft-delete mechanism for files. When a file is deleted, the master server doesn't immediately remove its data. Instead, it renames the file to a hidden name with an old timestamp, marking it as deleted. This allows for potential recovery of accidentally deleted files.

Subsequently, a background scan by the master server identifies and removes the metadata associated with these deleted files. Chunk servers, during their regular heartbeat communications, receive a list of unknown chunks from the master, prompting them to delete those chunks.

While this approach might seem slow, it offers several advantages:

Eventual Consistency: This method ensures eventual consistency in a straightforward manner. Periodic scanning is a common technique in distributed systems for synchronizing state across different components, ensuring that all states eventually converge. This method avoids the complexity of tracking real time diffs, which can be unreliable when the two components are external systems which are directly modified.

Accidental Deletion Recovery: The inherent delay in garbage collection provides a window for rolling back accidental file deletions.

10.4. Checksum Verification

During append-writes, if the operation encounters a partially filled sub-chunk at the end of the file, the new data is appended after the existing content. The checksum for this partial sub-chunk is then incrementally updated to reflect the added data, as shown in illustration 3.

Illustration 3: Incremental checksum.

This mechanism also serves as a corruption detection tool; if the partial sub-chunk was previously corrupted, the updated checksum will not match.

11. Paper Review

The GFS paper is a comprehensive exploration of key distributed systems concepts, and its impact is undeniable, boasting over 10,000 citations. This system stands as one of Google's most significant engineering accomplishments.

Despite its depth, the paper is simple and provides a fascinating insight into distributed file system design. For anyone passionate about distributed systems, I highly recommend reading it. It's a staple of distributed systems literature and a frequent topic of discussion in paper reading groups.

Pico