Having explored a range of file systems in previous discussions, we now turn our attention to their distributed counterparts. Specifically, we will examine the Google File System (GFS). This paper was written by Sanjay Ghemawat and his colleagues. Sanjay Ghemawat is a Google Fellow and a highly respected engineer. His contributions extend far beyond GFS, encompassing core libraries and systems such as MapReduce, making him an indispensable part of Google's technological foundation. It is difficult to envision the current state of Google's infrastructure without his significant contributions.
The paper, presented in SOSP '03, marked the introduction of a system that has since evolved into the critical infrastructure component known as Colossus. This system supports a vast array of Google's operations and is of immense importance to the company's overall functionality. It is crucial to acknowledge that the architecture of Colossus has undergone significant transformations since its initial conception, developing into a highly sophisticated and robust platform.
Despite its current complexity, the initial design of its predecessor, GFS, prioritized simplicity in its large-scale distributed architecture. This focus on simplification was deliberate, aimed at achieving performance and efficient operations. By making specific simplifying assumptions, the system was able to accelerate its processes and handle the demands of Google's growing data needs.
Extended Read: Colossus under the hood: a peek into Google’s scalable storage system
Recommended Read: Paper Insights - The Design and Implementation of a Log-Structured File System where I introduced several basic file system.
Let's begin!
Network File System
Network File System v/s Distributed File System
NFS centralizes file storage on a single server. Client requests are routed over the network to this server. This creates a shared environment. However, it is not a distributed one offering fault tolerance. Consequently, if the server fails, the entire file system becomes inaccessible.
In contrast, a distributed file system distributes the file system itself across multiple servers. Data is replicated and spread across these servers, providing inherent fault tolerance. If one server fails, the system continues to operate using the replicated data on other servers.
The Google File System
Disclaimer: In this article, the discussion centers on the original GFS and not its successor, Colossus. It's important to note that the original GFS paper refers to the metadata server as the master. In Colossus, this component has been redesigned and is now known as the curator, leveraging Bigtable for persistent storage. I acknowledge that the terminology used in the original paper may be considered insensitive and apologize for any discomfort it may cause.
GFS is built on several key principles. Before diving in, let's highlight two: immutability and simplicity.
Immutability
Echoing the same concept found in log-structured file systems, GFS operates on an append-only model - data, once written, whether successfully or not, cannot be modified. Files accept only append-writes. Additionally, a file can only accept append-writes if it is not frozen (immutable).
Important Note: While the paper mentions file overwrites, GFS has evolved to strictly enforce appending to the files only. In this article, we will also focus only on appends and so all mutations corresponds to append-writes only.
Simplicity
GFS stands out as a remarkably simple large-scale distributed system, making it an ideal introductory topic in any distributed system course.
This simplicity stems from deliberate design choices, such as -
- Consistency over Availability - GFS enforces a single master instance, sacrificing availability for consistency. Even then, GFS only offers eventual consistency which greatly simplifies design.
- No Atomicity - GFS employs a straightforward replication strategy during append-writes - data is synchronously replicated across all replicas. If any (but not all) replica fails to commit the append, it results in an "inconsistent" (not fully committed) state of the corresponding byte range. This inconsistent state is permanent and needs to be handled by applications.
- No Isolation - GFS readers may be able to read uncommitted data while some append-write is still in progress. GFS does not provide isolation guarantees, and therefore cannot be externally consistent.
These choices contrast with systems like ZooKeeper and Spanner, which do provide high availability, atomicity, and external consistency. However, the choices allowed for a much simpler implementation.
Important Note: It is worth noting that Colossus has evolved significantly. Unlike the original GFS, modern Colossus achieves high availability and supports several replication schemes.
File Structure
GFS organizes files into chunks and sub-chunks:
- Chunks: Files are segmented into 64MB chunks. Each chunk is replicated across multiple chunk servers (described later) for fault tolerance. Chunk servers store these chunks within their local Linux file systems, potentially as individual files on disk.
- Sub-Chunks: Each 64MB chunk is further divided into 64KB sub-chunks. Checksums are calculated and stored at the sub-chunk level. These checksums are crucial for detecting data corruption.
Choice of Chunk Size
Chunk size, a configurable parameter, was set to 64 MB based on workload analysis. This decision reflects a trade-off:
- Metadata Efficiency: Larger chunks reduce metadata operations during reads and mutations, and minimize master metadata storage requirements.
- Hotspot Potential: Conversely, larger chunks concentrate data on fewer servers, increasing the risk of hotspots, particularly during tail reads.
Choice of Replication Factor
Architecture
GFS employs a multi-tiered architecture.
Global Setup
GFS deployments consist of independent clusters distributed globally. These clusters operate autonomously, with no inter-cluster communication. Therefore, the core distributed aspects of GFS are primarily confined to the cluster level. The remainder of this article will focus on the internal workings of a single GFS cluster.
Cluster Setup
Like any file system, GFS manages both file data (chunks) and metadata. In traditional file systems, metadata is often stored in inodes. GFS adopts a similar concept but distributes it across two key components:
- Master: This server manages the file metadata, including file attributes and chunk location information.
- Chunk Servers: These servers store the raw file data, divided into fixed-size chunks.
The Master
The master manages the file metadata, ensuring the integrity and availability of the file system.
Namespace Management
The master stores the entire directory structure. Internally, GFS represents the namespace as a flat key-value store and not as a traditional tree. For example, if there is a file /a/b/file.txt, then the keys would be/a
, /a/b
, and /a/b/file.txt
.Values associated with these keys include:
- File/directory metadata (attributes).
- Read/write locks. Acquiring locks on a file requires read locking all parent directories.
- The chunk sequence number.
- The latest version number (described later) for the chunk.
- Address of the chunk servers holding the latest version number of the chunk.
Replication and Logging
Master state is replicated for reliability using a distributed log, periodically flushed to disk. Only one active master handles mutations and all background tasks, and hence is also responsible for logging all the metadata modifications to the log. Shadow masters serve read requests based on the logs that the active master generates.Checkpointing
Failure Recovery
The Chunk Servers
Chunk Placement Policy
Data centers often organize machines into racks, each with a shared network switch (Top-of-the-Rack switch). Failure of a rack switch can render all chunk servers within that rack unavailable.To mitigate rack-level failures, GFS strategically places replicas of the same chunk on chunk servers residing in different racks. This ensures higher fault tolerance against correlated failures, such as network switch failures.
Chunk Version Number
For each chunk, the chunk servers track the chunk version number. Although not directly mentioned in the paper, it is implicit that the version number of a chunk on the chunk server is incremented whenever the chunk is modified.
The heartbeat messages also contain the version number of all the chunks on a chunk server. This version number is used by the master to distinguish between stale and up-to-date replicas for each chunk.
Consistency Model
Metadata Consistency
Metadata operations are atomic. This atomicity is ensured by the single master server, which handles all metadata operations sequentially.Chunk Consistency
- Consistent: All replicas of the chunk contain the same data for the byte range.
- Inconsistent: Different replicas of the chunk contain different data for the byte range.
Leases and Mutation Order
As previously discussed, each chunk replica has a designated primary server. The primary is granted a lease by the GFS master. This primary server is the central coordinator for all mutations to the chunk, responsible for establishing a total order of these operations.
The lease has a default timeout of 60 seconds. To maintain its lease, the primary can request extensions, which can be piggybacked into the regular heartbeat messages exchanged with the master.
To enforce a total order, the primary assigns a serial number to each mutation. This serial number is a unique sequence number generated by the primary. This serial number provides a globally consistent ordering mechanism across all mutations applied to the chunk, guaranteeing that all replicas converge on the same final state.
Fault Tolerance for Chunk Servers
While we've established the master's single-instance availability through external monitoring, the availability of chunk servers presents a different set of challenges. Chunk servers can experience various failures, including -
- Crashes, leading to redundancy loss for all the chunks.
- Data corruption for specific chunks, for example, when disks fail.
Primary Chunk Server Failure
When a primary chunk server fails, the master must select a new primary. A new primary is elected as follows:
- Upon primary failure, the master queries remaining replicas for their chunk version numbers.
- Among the replicas with the latest version (up-to-date replicas), the master selects a new primary.
- The chunk version number is then incremented, and this updated version number is passed to all up-to-date replicas.
The chunk becomes available for append-writes only after all the steps above are complete.
Secondary Chunk Server Failure
When a secondary chunk server fails, the master assigns the affected chunks to a new chunk server. This triggers a re-replication process (described next), where the new server fetches the current chunk state from other replicas.
After the chunk state is fetched by the new chunk server, the chunk becomes available again for mutations.
Re-replication and Rebalancing
Re-replication is the process of restoring lost replicas and is crucial for maintaining data redundancy. When a new chunk server is assigned responsibility for previously unowned chunks, it retrieves the latest chunk data from existing replicas.
Re-replication is subject to resource constraints:
- Bandwidth limitations: There's a limit on the bandwidth a chunk server can dedicate to re-replication.
- Prioritization: Re-replication prioritizes chunks based on importance:
- Chunks actively requested by clients taking precedence to restore mutation availability quickly.
- Chunks with higher redundancy loss are prioritized over others.
Stale Replica Removal
Stale replicas arise when a chunk server fails to receive mutations, for instance, when having been offline for an extended period. This scenario occurs when the master reassigns the chunk to another chunk server, and subsequent mutations occur while the original server is still offline.
Consequently, the returning server's replica will lack the current chunk version number. The master leverages this version number to identify and remove these outdated replicas.
Reading a File
Reading a file in GFS is straightforward:
- The client first obtains chunk information, including the latest version number, from the master and caches it.
- Then, using the replica locations, the client selects the closest chunk server for data retrieval. Closeness is determined by IP address proximity, leveraging cluster's network topology to minimize latency.
Read Failure Scenarios
During read operations, two potential issues can arise:
- Stale Replicas: A replica might be stale, meaning it lacks the most recent mutations.
- Inconsistent Byte Range: Alternatively, it might contain a mutation that is either partially committed or failed completely resulting in an inconsistent byte range.
Stale replicas are relatively easy to handle. When the client refreshes metadata from the master, it will obtain the locations of up-to-date replicas and retrieve the data.
However, handling uncommitted mutations is more complex. Readers must employ application-level mechanisms like checksums to validate byte ranges and detect duplicates (which can occur if a writer retries a failed append-write). Data record formats, such as Record I/O, can automate these checks and handle these situations without requiring manual intervention from the user.
Writing a File
Writing to a file in GFS is a multi-stage operation. An append-write can target a single chunk or span multiple chunks, with the process repeating for each affected chunk.
Chunk Replica Identification (Steps 1 - 2)
The client first requests the primary and secondary replica locations for the target chunk from the master server. The master, upon receiving this request, checks for an existing primary. If none exists (for instance, when it is a new chunk), it grants a lease to one of the replicas before responding to the client.
Data Upload (Step 3)
The client then uploads the data to all chunk server replicas of the chunk. This step ensures all replicas have the data before the commit phase begins.
Notably, the client uploads the data only once. The data is replicated in a linear chain across the replicas to optimize bandwidth utilization. Each chunk server in the chain forwards the data to the next closest server based on network topology, likely determined by IP address prefixes, allowing for hop count inference.
Time to Upload
Let's analyze the time required to upload data in GFS, considering factors like bandwidth, latency, and replication.
Assume B bytes are uploaded, T is the available bandwidth in the cluster fabric, L is the latency between any two chunk servers, and R is the replication factor.
Time to establish connections along the replication chain = R*L. Time to transfer the B bytes is B/T. Therefore, the total time is approximated by B/T + R*L.
It's important to note that this calculation is a simplification. It doesn't account for network congestion, which can significantly impact actual bandwidth utilization. In real-world scenarios, the effective bandwidth may be lower than the theoretical maximum.
For example, with a 100 Mbps bandwidth and a 1 ms latency, uploading 1 MB of data would ideally take 80 ms. However, network congestion and other factors could increase this time.
LRU Buffer Cache
The received data is temporarily stored in an LRU buffer cache on each replica, not yet committed to the chunk. These cached entries are eventually evicted after a timeout if the mutation is not finalized.
Commit (Step 4 - 7)
The client sends a commit request to the primary replica. The primary assigns a serial number to the mutation and commits the bytes. It is important for primary to commit the bytes first because the primary must maintain the latest chunk version.
Then the primary forwards the request along with serial number to the secondary replicas. Each secondary replica independently applies the mutation to its copy of the chunk and sends a confirmation back to the primary. Once the primary receives acknowledgments from all secondaries, it sends a final acknowledgment back to the client.
Write Failure Scenarios
What if some secondary replica fails?
If it fails after receiving a mutation request from the primary, then it results in an inconsistent state for the byte range. Any readers of the byte range may be able to see the appended bytes (if they were in contact with another secondary which may have appended the bytes successfully) or see garbage. Eventually the failure would be detected by the master and the chunk will be assigned to a new chunk server. Re-replication will be performed. Only then will the mutations start succeeding again.
What if the primary replica fails?
What if the client is interacting with an old primary while a new primary has been assigned?
If a mutation is received by an old primary, it will not be able to commit it on any of the secondary replica as the chunk version number would have already been incremented. Eventually, it will be detected as stale and removed.
Miscellaneous Details
Client Library
File Snapshots
GFS file snapshots employ a copy-on-write mechanism similar to Linux's fork(2)
system call. In fork(2)
, a child process initially has exactly the same state as parent. Instead of a fully copying all memory pages for child, the parent's pages are marked write-protected. When either process modifies a shared page, a copy is created, and the modification occurs on the copy, effectively diverging the process states.
Similarly, a GFS file snapshot creates a new file that initially shares the existing file's underlying chunks. When the chunks are modified for any of the files, a copy-on-write approach is used.
Here's the detailed process:
- Lease Revocation: Upon receiving a snapshot request, the master revokes all leases on the target file. This prevents any further modifications, akin to write-protecting memory pages.
- Metadata Duplication: The master creates a new file entry, replicating all metadata and chunk information from the original file. For each shared chunk, a reference counter is incremented.
- Copy-on-Write: Any modification to a chunk with a reference count greater than one triggers a copy. The new chunk copy is then leased, its metadata is updated, and its location is provided to the client. To speed up the copy process, the new chunk is typically created on the same chunk server.
Garbage Collection
GFS employs a soft-delete mechanism for files. When a file is deleted, the master server doesn't immediately remove its data. Instead, it renames the file to a hidden name with an old timestamp, marking it as deleted. This allows for potential recovery of accidentally deleted files.
Subsequently, a background scan by the master server identifies and removes the metadata associated with these deleted files. Chunk servers, during their regular heartbeat communications, receive a list of unknown chunks from the master, prompting them to delete those chunks.
While this approach might seem slow, it offers several advantages:
- Eventual Consistency: This method ensures eventual consistency in a straightforward manner. Periodic scanning is a common technique in distributed systems for synchronizing state across different components, ensuring that all states eventually converge. This method avoids the complexity of tracking real time diffs, which can be unreliable when the two components are external systems which are directly accessed.
- Accidental Deletion Recovery: The inherent delay in garbage collection provides a window for rolling back accidental file deletions.
Checksum Verification
Paper Review
The GFS paper is a comprehensive exploration of key distributed systems concepts, and its impact is undeniable, boasting over 10,000 citations. This system stands as one of Google's most significant engineering accomplishments.
Despite its depth, the paper is simple and provides a fascinating insight into distributed file system design. For anyone passionate about distributed systems, I highly recommend reading it. It's a staple of distributed systems literature and a frequent topic of discussion in paper reading groups.
Comments
Post a Comment