Skip to main content

Paper Insights #8 - Google File System

After going through a bunch of file systems in the previous papers, it is time to go through a distributed version of them. The paper Google File System has been authored by Sanjay Ghemawat et. al. Sanjay is one of the early and well-known scientist at Google. As of 2025, he is serving as a Google Fellow and one of the most renowned engineeer at Google. It is hard to imagine Google without Sanjay. He has been a core contributor to several core libraries and system at Google including MapReduce, The Google File System, etc.

The paper was written back in 2003 and presented at the prestiguous SOSP. Since then GFS has been renamed to Colossus and I will also use the new name throughout the article. GFS is one of the most important systems at Google supporting all other Google's infra. It has immense importance. Needless to say, the architecture has greatly evolved over the years and it has become super sophisticated.


It is also a very simply large scale distributed system that makes lots of simplifying assumptions to speed up.

Paper Link

I don't have too much basic concepts to cover for this article. If you would like to understand file systems in general refer to my previous posts.

Immutability at Work

We previously touched upon immutability when we were discussiong log-structured file system. Colossus is built on similar principles as well. In particular Colossus as a file system only supports:

- Append-only files: Once bytes have been written, they cannot be modified. Bytes can only be appended to non-frozen files.

Network File System

Network File System v/s a Distributed File System

Single Store File System v/s a Distributed File System


Need to write the following:

- Mention that it is an append only file system.
- Global setup.
- Per cluster setup.
- There would at most be one curator at a time. Curator (unlike master) is sharded. As a result, there is an availability loss to maintain consistency. 
- With accepting this availability loss (and minizing it as much as possible), the system is simplified a lot. Several other systems like ZooKeeper and Spanner would struggle to complicate the system 

- If any chunk servers fail. Doesn't matter - the write simply fails. Temporarily each chunk server may not be in sync. But primary decides the write position. Eventually master syncs everyone.

- If primary chunk server fails. eventually master detects it and assigns a new primary for the chunks. The new leader will be the one which has committed most recent updates for the chunk. The global mutation order for each chunk is - lease grant order (for primary) + serial number assigned by primary. The one that is ahead of all becomes the new leader. 

What if primary becomes unreachable and assumes behaving as master? Doesn't matter. Client always fetch the current primary information from master. And there is one and only one master at a time. Thus, client will always contact the latest primary known to the master.

What if client is interacting with a old primary?

Doesn't matter. any new primary is either upto date to the latest known point so that it knows about all updates sent by old primary. Or it didn't hear about some latest update, the write will fail because the new primary won't recognize it.

Remember all replicas needs to commit before write is successful. So, it doesn't matter if by mistake a replica is seleected which is not caught up to latest head known to last primary (which may still be alive). All partial writes that may have happened would be visible to readers (inconsistency). But the write would have failed. Total compromise on consistency as well.



Colossus

Note that the architecture of Colossus has evolved over the years and in this article, I will describe the latest architecture for the same.

In any file system, we need to store the file chunks - the raw bytes for the files and the file metadata - which are file attributes + chunk locations, etc. In a non-distributed version of a file system, the file metadata is stored in data structures called inodes.

Colossus also takes a similar approach - but, in a distributed manner. There are 2 main components in Colossus:

1. chunk server - storing the file chunks also known as the D servers (D=disk).

2. Curators - storing the file metadata. In the backend, curators make use of Bigtable.

Note: for those who understand Bigtable design may notice that Bigtable itself stores the NoSQL data in SSTable format on GFS.

Client Library

All local file systems on Linux are expected to implement VFS interface. But that is not the case with Colossus. Colossus instead, provides a client library to the applications which can help client interact with the file systems. The File object returned by the 

File Structure

The files are divided into chunks. Each chunk has a fixed size - 64 MB. The chunks are replicated and each replica of a chunk is stored on a different chunk server for fault tolerance. 

The D servers themselves store the chunks on a regular Linux file system. May be each chunk in Colossus is itself a file on the Linux file system.

Choice of a replication factor?

Tradeoff - Large replication factor can help manage hotspots. However, writes needs to be replicated to so many places.

File Metadata

The file metadata is stored on the curator. The curator stored the file names in Bigtable.

Briefly, Bigtable is a NoSQL store storing key-value pairs. The path for each file on Colossus is the key and the value is its metadata.

The file metadata consists of the following:

1. Namespace - Colossus is divided into different namespaces. This field determines the namespace to which a file belongs.
2. Chunks - The chunks corresponding to the file and the addresses of the replicas storing those chunks. As the chunks are replicated, each chunk will belong to multiple D servers.

Reading a File

Let's take a step-by-step look at what happens when we read a file:

- The curator is looked up to find the file and the chunk index. The chunk index can be computed using the position that the client is trying to read from and to.
- The chunks are fetched from the D servers.

Note that to further optimize the read path, the client may ask for multiple chunks in the same request.

Since, chunk locations are founds from a different server altogether (involving a network call), it makes sense to have larger chunk size that can reduce the number of calls to be made to curators.

As simple as reading the file is, writing to a file is fairly involved process. Before introducing writes, we need to understand more detailed mechanism.

Consistency Model

There are some guarantees provided by Colossus:

- All file metadata operations are atomic.

Data mutations

As described before, Colossus writes are immutable. There are two modes available - write and append.

Leases and Mutation Order

For each replica of a chunk, there is a degined primary. The primary for a chunk is granted lease by the curator. All mutations to a chunk goes through the primary and the primary is responsible for determining.

Leases are imperfect (depending on time bounds where distributed systems have no notion of time).

The lease timeout for a chunk is set to 60 seconds. The primary can request extensions through and the request and grants are piggybacked on heartbeat messages. 

It is possible that the primary may assume that the lease has not expired even when lease has actually expired on curator. This can happen when clocks drift apart.

The curator may then assign the lease to another replica.

Evaluation

I will skip the evaluation section of this paper as the paper is quite old and the numbers no longer hold true. Colossus has greatly evolved today and I am doubtful if any of the numbers discussed in the paper would make sense anymore.


Paper Review

The paper is vastly dense and discusses a lot of ideas. It is one of the landmark paper in dsitributed system with over 10k citations. The system has been Google's one of the most marvelous achievement. "Colossus" in literal term mean "an enormous statue that was supposed to have guarded the ancient Greek island and city of Rhodes". That's what Colossus is for Google.

This paper is pretty easy to follow through and indeed very intersting if you are interested in file system. I would highly recommend this paper to all distributed system enthusiatas. Needless to say, this paper has been widely discussed in almost every distributed system paper reading group.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...