After going through a bunch of file systems in the previous papers, it is time to go through a distributed version of them. The paper Google File System has been authored by Sanjay Ghemawat et. al. Sanjay is one of the early and well-known scientist at Google. As of 2025, he is serving as a Google Fellow and one of the most renowned engineeer at Google. It is hard to imagine Google without Sanjay. He has been a core contributor to several core libraries and system at Google including MapReduce, The Google File System, etc.
The paper was written back in 2003 and presented at the prestiguous SOSP. Since then GFS has been renamed to Colossus and I will also use the new name throughout the article. GFS is one of the most important systems at Google supporting all other Google's infra. It has immense importance. Needless to say, the architecture has greatly evolved over the years and it has become super sophisticated.
It is also a very simply large scale distributed system that makes lots of simplifying assumptions to speed up.
Paper Link
I don't have too much basic concepts to cover for this article. If you would like to understand file systems in general refer to my previous posts.
Immutability at Work
We previously touched upon immutability when we were discussiong log-structured file system. Colossus is built on similar principles as well. In particular Colossus as a file system only supports:
- Append-only files: Once bytes have been written, they cannot be modified. Bytes can only be appended to non-frozen files.
Network File System
Network File System v/s a Distributed File System
Single Store File System v/s a Distributed File System
Need to write the following:
- Mention that it is an append only file system.
- Global setup.
- Per cluster setup.
- There would at most be one curator at a time. Curator (unlike master) is sharded. As a result, there is an availability loss to maintain consistency.
- With accepting this availability loss (and minizing it as much as possible), the system is simplified a lot. Several other systems like ZooKeeper and Spanner would struggle to complicate the system
- If any chunk servers fail. Doesn't matter - the write simply fails. Temporarily each chunk server may not be in sync. But primary decides the write position. Eventually master syncs everyone.
- If primary chunk server fails. eventually master detects it and assigns a new primary for the chunks. The new leader will be the one which has committed most recent updates for the chunk. The global mutation order for each chunk is - lease grant order (for primary) + serial number assigned by primary. The one that is ahead of all becomes the new leader.
What if primary becomes unreachable and assumes behaving as master? Doesn't matter. Client always fetch the current primary information from master. And there is one and only one master at a time. Thus, client will always contact the latest primary known to the master.
What if client is interacting with a old primary?
Doesn't matter. any new primary is either upto date to the latest known point so that it knows about all updates sent by old primary. Or it didn't hear about some latest update, the write will fail because the new primary won't recognize it.
Remember all replicas needs to commit before write is successful. So, it doesn't matter if by mistake a replica is seleected which is not caught up to latest head known to last primary (which may still be alive). All partial writes that may have happened would be visible to readers (inconsistency). But the write would have failed. Total compromise on consistency as well.
Colossus
Note that the architecture of Colossus has evolved over the years and in this article, I will describe the latest architecture for the same.
In any file system, we need to store the file chunks - the raw bytes for the files and the file metadata - which are file attributes + chunk locations, etc. In a non-distributed version of a file system, the file metadata is stored in data structures called inodes.
Colossus also takes a similar approach - but, in a distributed manner. There are 2 main components in Colossus:
1. chunk server - storing the file chunks also known as the D servers (D=disk).
2. Curators - storing the file metadata. In the backend, curators make use of Bigtable.
Note: for those who understand Bigtable design may notice that Bigtable itself stores the NoSQL data in SSTable format on GFS.
Client Library
All local file systems on Linux are expected to implement VFS interface. But that is not the case with Colossus. Colossus instead, provides a client library to the applications which can help client interact with the file systems. The File object returned by the
File Structure
The files are divided into chunks. Each chunk has a fixed size - 64 MB. The chunks are replicated and each replica of a chunk is stored on a different chunk server for fault tolerance.
The D servers themselves store the chunks on a regular Linux file system. May be each chunk in Colossus is itself a file on the Linux file system.
Choice of a replication factor?
Tradeoff - Large replication factor can help manage hotspots. However, writes needs to be replicated to so many places.
File Metadata
The file metadata is stored on the curator. The curator stored the file names in Bigtable.
Briefly, Bigtable is a NoSQL store storing key-value pairs. The path for each file on Colossus is the key and the value is its metadata.
The file metadata consists of the following:
1. Namespace - Colossus is divided into different namespaces. This field determines the namespace to which a file belongs.
2. Chunks - The chunks corresponding to the file and the addresses of the replicas storing those chunks. As the chunks are replicated, each chunk will belong to multiple D servers.
Reading a File
Let's take a step-by-step look at what happens when we read a file:
- The curator is looked up to find the file and the chunk index. The chunk index can be computed using the position that the client is trying to read from and to.
- The chunks are fetched from the D servers.
Note that to further optimize the read path, the client may ask for multiple chunks in the same request.
Since, chunk locations are founds from a different server altogether (involving a network call), it makes sense to have larger chunk size that can reduce the number of calls to be made to curators.
As simple as reading the file is, writing to a file is fairly involved process. Before introducing writes, we need to understand more detailed mechanism.
Consistency Model
There are some guarantees provided by Colossus:
- All file metadata operations are atomic.
-
Data mutations
As described before, Colossus writes are immutable. There are two modes available - write and append.
Leases and Mutation Order
For each replica of a chunk, there is a degined primary. The primary for a chunk is granted lease by the curator. All mutations to a chunk goes through the primary and the primary is responsible for determining.
Leases are imperfect (depending on time bounds where distributed systems have no notion of time).
The lease timeout for a chunk is set to 60 seconds. The primary can request extensions through and the request and grants are piggybacked on heartbeat messages.
It is possible that the primary may assume that the lease has not expired even when lease has actually expired on curator. This can happen when clocks drift apart.
The curator may then assign the lease to another replica.
Evaluation
I will skip the evaluation section of this paper as the paper is quite old and the numbers no longer hold true. Colossus has greatly evolved today and I am doubtful if any of the numbers discussed in the paper would make sense anymore.
Paper Review
The paper is vastly dense and discusses a lot of ideas. It is one of the landmark paper in dsitributed system with over 10k citations. The system has been Google's one of the most marvelous achievement. "Colossus" in literal term mean "an enormous statue that was supposed to have guarded the ancient Greek island and city of Rhodes". That's what Colossus is for Google.
This paper is pretty easy to follow through and indeed very intersting if you are interested in file system. I would highly recommend this paper to all distributed system enthusiatas. Needless to say, this paper has been widely discussed in almost every distributed system paper reading group.
Comments
Post a Comment