Skip to main content

Paper Insights #17 - The Eternal Tussle: Exploring the Role of Centralization in IPFS

I'd like to delve into a technology I have research experience with: the InterPlanetary File System (IPFS). While the original IPFS white paper - IPFS - Content Addressed, Versioned, P2P File System - was influential, it is not peer reviewed. Therefore, I'll focus on a related paper presented at the NSDI '24 conference, a prestigious venue in distributed systems research. This paper was also discussed in Stanford's CS244b class.

Paper Link

For background, I recommend reading Dynamo: Amazon's Highly Available Key-value Store, where I discuss consistent hashing and its techniques.

Now, let's explore some fundamental concepts of IPFS.

Decentralized Web

Traditional websites are like 1800 numbers - rely solely on the website owner to bear the costs of hosting and computation.

In contrast, decentralized web leverage a distributed network of nodes. This means that the computational burden, and therefore the associated costs, can be shared among numerous participants, regardless of the website's traffic load.

Decentralized web is popularly called Web3.

Decentralized File System

A decentralized file system represents each file block by its cryptographic hash. This hash uniquely identifies the block's content. Subsequently, these hashes are mapped to locations on a decentralized network of servers.

InterPlanetary File System (IPFS)

IPFS is a decentralized file that:

  1. Divides files into blocks.
  2. Generates a unique Content Identifier (CID) for each block using cryptographic hashing.
  3. Stores the mapping of CIDs to their locations on a Distributed Hash Table (DHT).
Note: IPFS also supports directories. The directory contents (e.g., list of files) are hashed and stored with a unique CID on IPFS.



In the diagram above, clients initially retrieve directory content from IPFS. This content includes CIDs for the desired files. Subsequently, clients fetch the complete files by looking up these CIDs. This graph is also called Merkle Directed Acyclic Graph (DAG).

IPFS DHT Entries

Two key entries within the IPFS DHT are:

  • Provider Records: Map CIDs to the PeerIDs of nodes that possess the corresponding blocks.
  • Peer Records: Map PeerIDs to their respective Multiaddresses. Multiaddresses enable flexible p2p communication across various protocols (e.g., IPv4, IPv6, HTTP).

IPFS Lookup

To retrieve a file, the BitSwap protocol (described later) is used.

Kademlia Hash

This section provides a high-level overview of Kademlia, a DHT designed for decentralized p2p networks. See Dynamo: Amazon's Highly Available Key-value Store for a deeper dive into consistent hashing and different hashing algorithms.

Kademlia, developed by P. Maymounkov and D. Mazières in 2002, has gained prominence through its use in trackerless BitTorrent and currently serves as the DHT for IPFS.

Kademlia arranges the virtual nodes into leaves of a full binary tree. Note that the same physical machine may be placed at multiple leafs of the tree in the form of virtual nodes.


In the example above, all leaf positions are occupied by nodes. However, in reality, not all leaf positions may have a node present.

A key x is stored on the node that minimizes the XOR distance to x. XOR distance measures the number of differing bits between two identifiers. Kademlia employs a replication factor of 20, ensuring data redundancy by storing each key on the 20 closest nodes based on XOR distance.

XOR Properties

XOR was chosen as the distance metric between node IDs due to its desirable properties:
  • The XOR distance between a node and itself is zero.
  • Symmetry: The XOR distance between any two nodes (A and B) is the same regardless of the order (i.e., distance(A, B) = distance(B, A)).
  • Triangle Inequality: For any three nodes (A, B, and C), the XOR distance between A and B is less than or equal to the sum of the XOR distances between A and C, and C and B.

Routing Algorithm

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a routing table with entries based on its identifier's binary representation:

b1': Entries for nodes whose first bit differs from the current node.

b1b2': Entries for nodes whose second bit differs.

b1b2b3': Entries for nodes whose third bit differs, and so on.

When a node receives a request for a specific hash, it checks if it owns the hash. If not, it forwards the request to a neighboring node that is "closer" in terms of their identifier's binary representation (i.e., differs in fewer bits).

The routing table has (O(log N)). Finding the responsible node requires traversing the routing table, resulting in logarithmic lookup time (O(log N)).

XOR Trie

A binary trie is a trie data structure specifically designed to store binary strings.

Example: A binary trie containing the strings "001", "101", and "111" would have a structure where each node represents a bit (0 or 1), and the paths to the leaf nodes represent the stored strings.



The same tree can be represented in a more compact form. Then it becomes a XOR trie


XOR tries have the following key properties:
  • All binary strings inserted into the trie must be of same length.
  • Both child branches of a node are either both empty (nil) or both non-empty.
  • If both child branches are non-empty, the node itself has no key associated with it.
  • If both child branches are leaf nodes, they both must have keys associated with them.
When inserting a new string into an XOR trie, the goal is often to minimize the increase in the tree's depth. In our example above, one strategy to achieve this is to insert a new node to the right of the leftmost node at depth 1.


Node Discovery

While IPFS is a decentralized system, it relies on a set of special nodes known as bootstrap nodes. Bootstrap nodes play a crucial role in network discovery by:

  1. They maintain a list of other nodes within the IPFS network.
  2. When an IPFS client starts, it connects to one or more bootstrap nodes to obtain a list of potential peers.
  3. The client then establishes connections with a subset of these peers, typically ranging from 500 to 1000 nodes.

This initial peer connectivity ensures that requests can be effectively routed and forwarded within the decentralized network.

Content Retrieval

  1. Local and Connected Peer Lookup:

    • Upon receiving requests for CIDs, the node first attempts to locate each block locally.
    • If not found locally, it queries its connected peers using BitSwap (see 3) to see if any of them possess the desired blocks. This is opportunistic lookup.
  2. DHT Lookup (Kademlia):

    • If a block is not found among connected peers, the node initiates a DHT lookup using the Kademlia algorithm.
    • This DHT lookup helps locate nodes within the network that are likely to have the requested block.
  3. BitSwap:

    • Want-List: The requesting node sends a list of desired CIDs (the "want-list") to its connected peers (for opportunistic lookup) or to the peer owning the block (found after DHT lookup).
    • Block Sharing: Each node tracks the blocks its peers have requested. When a node receives a new block, it checks if any of its peers are seeking that specific block.
    • Want-Have Responses:
      • If a peer possesses a CID on the requesting node's want-list, it responds with a "want-have" message.
      • If a peer does not have the CID, it responds with a "dont-have" message.
    • Block Retrieval: Upon receiving a "want-have" response, the requesting node sends a "want-block" message to the peer, triggering the transfer of the requested block.

Content Caching

In addition to retrieving content, IPFS nodes cache both file blocks and routing information locally. This caching mechanism significantly improves subsequent content retrieval performance

Challenges with Decentralized IPFS

The authors highlight several challenges in implementing a decentralized IPFS system:

1. Slow Data Publishing: The primary bottleneck in data publishing is the time-consuming process of replicating data records across 20 nodes in the DHT. The 95th percentile publishing time is a significant 66.73 seconds, primarily due to the overhead of DHT lookups to locate the 20 closest nodes for each CID. This slow publishing speed is unsuitable for real-time data streaming applications, although it may be acceptable for static data.

2. Slow Data Retrieval: Similar to publishing, data retrieval is also hindered by the time taken for DHT lookups. The average retrieval time is 4.42 seconds, which is considered too slow for many web3 applications that require rapid data access.

3. Complexity of Setup: Setting up and using IPFS currently requires a degree of technical expertise. Given that 58% of internet traffic originates from mobile devices and 92% of users access the web via smartphones, the complexity of IPFS setup presents a significant barrier to widespread adoption.

Towards Centralization

The authors propose a solution that introduces a degree of centralization to address the performance limitations of decentralized IPFS. It's crucial to emphasize that this centralization is limited in scope and focuses solely on performance improvements. The core decentralized nature of the IPFS network remains intact. Users can still bypass these centralized components if desired.

Centralized Components

  • InterPlanetary Network Indexers: These accelerate content publication and retrieval by optimizing data indexing and lookup processes.
  • Hydra Boosters: These specialized nodes improve performance by reliably hosting a large number of provider records, enhancing the efficiency of locating data within the network.
  • HTTP Gateways: These gateways provide an HTTP interface to IPFS, facilitating easier integration and adoption by simplifying access for developers and applications.

The authors also performed a massive surgery on the IPFS client code to integrate all these centralized components.

InterPlanetary Network Indexers

InterPlanetary Network Indexers are high-performance key-value stores that efficiently index provider records. The key is CID. The value is the physical addresses of peers that possess the content, along with the protocols to be used. By offloading the storage of provider records from the DHT, Indexers significantly reduce network overhead and improve query performance.

Pebble (by Cockroach DB Labs), a high-performance key-value store, is used as the underlying data store.

Advertisement

To be visible on the network, a peer must advertise its available content to the Indexers. This is achieved through a chain of immutable advertisements – a data structure that records the publication or deletion of content by a given provider.

Providers maintain these advertisement chains locally. Information about these advertisements is disseminated to Indexers via gossip-based announcement messages. Indexers enable efficient bulk publishing of content availability information.

Client Queries

Indexer identities are publicly available. Clients can query Indexers with one or more CIDs. Indexers respond with a list of provider records that match the query.

Hydra Boosters

Hydra introduces shortcuts within the IPFS routing space to improve performance. It consists of:

  • Hydra Head Nodes: These nodes are strategically placed within the DHT itself.
  • Shared Hydra Database: This database, implemented using Amazon DynamoDB, stores mappings between CIDs and peer addresses. Hydra Head nodes have access to this database.
The Hydra database is distributed across multiple servers within Amazon's infrastructure, enhancing performance. However, it's important to note that this database is not fully decentralized.

Picking Node IDs for Hydra Heads

Hydra Heads are strategically placed within the DHT to ensure efficient routing. To do this, the PeerIDs of Hydra Heads are tracked in a XOR trie. When adding a new ndoe: 

  • Two random node IDs are generated (power of two choices).
  • The node ID that keeps the depth of the trie more regular is selected for the Hydra Head.

Power of Two Choices

Consider load balancing requests across N symmetric servers.

  • Option 1 (Random): Distributing requests randomly across servers.

  • Option 2 (Two Choices): Selecting two random servers and routing the request to the server with the shorter queue.

  • Option 3 (Three Choices): Selecting three random servers and routing to the server with the lowest load.

Option 2 exhibits a significant performance improvement over Option 1. However, the improvement gained by moving from Option 2 to Option 3 is relatively minor.

HTTP Gateway

HTTP Gateways provide a bridge between the IPFS network and the standard HTTP protocol. These gateways leverage NGINX, a high-performance web server, to cache frequently accessed IPFS content. NGINX employs a Least Recently Used (LRU) cache replacement policy.

Evaluation

Indexer Performance

  • The indexer currently stores a massive 173 billion records, a 100x increase compared to the DHT.
  • This data is primarily contributed by 604 major publishers utilizing NFTs or dedicated storage services.
  • The indexer handles 2.5k lookup requests per second, while 5k requests still rely on DHT lookups.
  • Requires a significant capital investment ($10k) and monthly operating costs ($1k) per indexer.
  • The indexer cache has a 65.22% hit rate, significantly accelerating lookups.
  • Even for the 34.78% of requests that miss the cache, latency remains lower than direct DHT lookups.

Hydra Booster Performance

  • Hydra boosters contribute to improved lookup times, although the benefits are not uniform across all regions. Regions like eu-central, already exhibiting good performance, did not experience significant improvements.
  • The placement strategy effectively positions Hydra Heads, ensuring that 96.6% of peers have at least one Hydra Head within their 20-proximity.
  • Lookup speed is improved by -3.3% to 36.5% with Hydra Boosters. Performance at the tail of the distribution is still comparable to direct DHT lookups.
  • This proximity significantly reduces DHT hops, often to a single hop.

HTTP Gateway Performance

  • Cache hits, with 50.3% hit rate, achieve a 24ms p95 latency.
  • An additional 27.6% of requests are resolved by contacting the first node directly.
  • For the remaining requests, performance may degrade slightly due to the extra hop introduced by the gateway servers.

Problems

This centralized approach presents several critical issues:

  • Censorship Risk: Indexers can potentially facilitate censorship of content within the IPFS network.
  • Hydra Booster Issues:
    • The underlying Kademlia DHT is already vulnerable to malicious actors.
    • Hydra Boosters, while intended to improve performance, exacerbate this vulnerability by enabling malicious nodes to strategically position themselves close to several other nodes.
    • There are additional performance concerns.
    • Indeed Hydra Boosters have now been turned down by IPFS.
  • Gateway Security Weaknesses: HTTP gateways compromise the end-to-end cryptographic validation process, introducing security risks.
  • False and slow advertisements can significantly degrade the user experience.

Paper Review

This return to a more centralized model, with components like Hydra Boosters and Indexers, resembles the Napster era, where a central node exerted significant control over file sharing. This contradicts the core principle of decentralization. This inherent tension is aptly captured by the paper's title, "The Eternal Tussle". We seem to constantly oscillate between centralization and decentralization, searching for the optimal balance.

For those interested in learning more about IPFS, I would like to introduce my paper Distributed Tracing for InterPlanetary File System which explains internal workings of IPFS protocols :).

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...