Skip to main content

Paper Insights #17 - The Eternal Tussle: Exploring the Role of Centralization in IPFS

I'd like to delve into a technology I have research experience with: the InterPlanetary File System (IPFS). While the original IPFS white paper - IPFS - Content Addressed, Versioned, P2P File System - was influential, it is not peer reviewed. Therefore, I'll focus on a related paper presented at the NSDI '24 conference, a prestigious venue in distributed systems research. This paper was also discussed in Stanford's CS244b class.

Paper Link

For background, I recommend reading Dynamo: Amazon's Highly Available Key-value Store, where I discuss consistent hashing and its techniques.

Now, let's explore some fundamental concepts of IPFS.

Decentralized Web

Traditional websites are like 1800 numbers - rely solely on the website owner to bear the costs of hosting and computation.

In contrast, decentralized web leverage a distributed network of nodes. This means that the computational burden, and therefore the associated costs, can be shared among numerous participants, regardless of the website's traffic load.

Decentralized web is popularly called Web3.

Decentralized File System

A decentralized file system represents each file block by its cryptographic hash. This hash uniquely identifies the block's content. Subsequently, these hashes are mapped to locations on a decentralized network of servers.

InterPlanetary File System (IPFS)

IPFS is a decentralized file that:

  1. Divides files into blocks.
  2. Generates a unique Content Identifier (CID) for each block using cryptographic hashing.
  3. Stores the mapping of CIDs to their locations on a Distributed Hash Table (DHT).
Note: IPFS also supports directories. The directory contents (e.g., list of files) are hashed and stored with a unique CID on IPFS.



In the diagram above, clients initially retrieve directory content from IPFS. This content includes CIDs for the desired files. Subsequently, clients fetch the complete files by looking up these CIDs. This graph is also called Merkle Directed Acyclic Graph (DAG).

IPFS DHT Entries

Two key entries within the IPFS DHT are:

  • Provider Records: Map CIDs to the PeerIDs of nodes that possess the corresponding blocks.
  • Peer Records: Map PeerIDs to their respective Multiaddresses. Multiaddresses enable flexible p2p communication across various protocols (e.g., IPv4, IPv6, HTTP).

IPFS Lookup

To retrieve a file, the BitSwap protocol (described later) is used.

Kademlia Hash

This section provides a high-level overview of Kademlia, a DHT designed for decentralized p2p networks. See Dynamo: Amazon's Highly Available Key-value Store for a deeper dive into consistent hashing and different hashing algorithms.

Kademlia, developed by P. Maymounkov and D. Mazières in 2002, has gained prominence through its use in trackerless BitTorrent and currently serves as the DHT for IPFS.

Kademlia arranges the virtual nodes into leaves of a full binary tree. Note that the same physical machine may be placed at multiple leafs of the tree in the form of virtual nodes.


In the example above, all leaf positions are occupied by nodes. However, in reality, not all leaf positions may have a node present.

A key x is stored on the node that minimizes the XOR distance to x. XOR distance measures the number of differing bits between two identifiers. Kademlia employs a replication factor of 20, ensuring data redundancy by storing each key on the 20 closest nodes based on XOR distance.

XOR Properties

XOR was chosen as the distance metric between node IDs due to its desirable properties:
  • The XOR distance between a node and itself is zero.
  • Symmetry: The XOR distance between any two nodes (A and B) is the same regardless of the order (i.e., distance(A, B) = distance(B, A)).
  • Triangle Inequality: For any three nodes (A, B, and C), the XOR distance between A and B is less than or equal to the sum of the XOR distances between A and C, and C and B.

Routing Algorithm

Each node is assigned a unique identifier (e.g., a hash value).

Each node maintains a routing table with entries based on its identifier's binary representation:

b1': Entries for nodes whose first bit differs from the current node.

b1b2': Entries for nodes whose second bit differs.

b1b2b3': Entries for nodes whose third bit differs, and so on.

When a node receives a request for a specific hash, it checks if it owns the hash. If not, it forwards the request to a neighboring node that is "closer" in terms of their identifier's binary representation (i.e., differs in fewer bits).

The routing table has (O(log N)). Finding the responsible node requires traversing the routing table, resulting in logarithmic lookup time (O(log N)).

XOR Trie

A binary trie is a trie data structure specifically designed to store binary strings.

Example: A binary trie containing the strings "001", "101", and "111" would have a structure where each node represents a bit (0 or 1), and the paths to the leaf nodes represent the stored strings.



The same tree can be represented in a more compact form. Then it becomes a XOR trie


XOR tries have the following key properties:
  • All binary strings inserted into the trie must be of same length.
  • Both child branches of a node are either both empty (nil) or both non-empty.
  • If both child branches are non-empty, the node itself has no key associated with it.
  • If both child branches are leaf nodes, they both must have keys associated with them.
When inserting a new string into an XOR trie, the goal is often to minimize the increase in the tree's depth. In our example above, one strategy to achieve this is to insert a new node to the right of the leftmost node at depth 1.


Node Discovery

While IPFS is a decentralized system, it relies on a set of special nodes known as bootstrap nodes. Bootstrap nodes play a crucial role in network discovery by:

  1. They maintain a list of other nodes within the IPFS network.
  2. When an IPFS client starts, it connects to one or more bootstrap nodes to obtain a list of potential peers.
  3. The client then establishes connections with a subset of these peers, typically ranging from 500 to 1000 nodes.

This initial peer connectivity ensures that requests can be effectively routed and forwarded within the decentralized network.

Content Retrieval

  1. Local and Connected Peer Lookup:

    • Upon receiving requests for CIDs, the node first attempts to locate each block locally.
    • If not found locally, it queries its connected peers using BitSwap (see 3) to see if any of them possess the desired blocks. This is opportunistic lookup.
  2. DHT Lookup (Kademlia):

    • If a block is not found among connected peers, the node initiates a DHT lookup using the Kademlia algorithm.
    • This DHT lookup helps locate nodes within the network that are likely to have the requested block.
  3. BitSwap:

    • Want-List: The requesting node sends a list of desired CIDs (the "want-list") to its connected peers (for opportunistic lookup) or to the peer owning the block (found after DHT lookup).
    • Block Sharing: Each node tracks the blocks its peers have requested. When a node receives a new block, it checks if any of its peers are seeking that specific block.
    • Want-Have Responses:
      • If a peer possesses a CID on the requesting node's want-list, it responds with a "want-have" message.
      • If a peer does not have the CID, it responds with a "dont-have" message.
    • Block Retrieval: Upon receiving a "want-have" response, the requesting node sends a "want-block" message to the peer, triggering the transfer of the requested block.

Content Caching

In addition to retrieving content, IPFS nodes cache both file blocks and routing information locally. This caching mechanism significantly improves subsequent content retrieval performance

Challenges with Decentralized IPFS

The authors highlight several challenges in implementing a decentralized IPFS system:

1. Slow Data Publishing: The primary bottleneck in data publishing is the time-consuming process of replicating data records across 20 nodes in the DHT. The 95th percentile publishing time is a significant 66.73 seconds, primarily due to the overhead of DHT lookups to locate the 20 closest nodes for each CID. This slow publishing speed is unsuitable for real-time data streaming applications, although it may be acceptable for static data.

2. Slow Data Retrieval: Similar to publishing, data retrieval is also hindered by the time taken for DHT lookups. The average retrieval time is 4.42 seconds, which is considered too slow for many web3 applications that require rapid data access.

3. Complexity of Setup: Setting up and using IPFS currently requires a degree of technical expertise. Given that 58% of internet traffic originates from mobile devices and 92% of users access the web via smartphones, the complexity of IPFS setup presents a significant barrier to widespread adoption.

Towards Centralization

The authors propose a solution that introduces a degree of centralization to address the performance limitations of decentralized IPFS. It's crucial to emphasize that this centralization is limited in scope and focuses solely on performance improvements. The core decentralized nature of the IPFS network remains intact. Users can still bypass these centralized components if desired.

Centralized Components

  • InterPlanetary Network Indexers: These accelerate content publication and retrieval by optimizing data indexing and lookup processes.
  • Hydra Boosters: These specialized nodes improve performance by reliably hosting a large number of provider records, enhancing the efficiency of locating data within the network.
  • HTTP Gateways: These gateways provide an HTTP interface to IPFS, facilitating easier integration and adoption by simplifying access for developers and applications.

The authors also performed a massive surgery on the IPFS client code to integrate all these centralized components.

InterPlanetary Network Indexers

InterPlanetary Network Indexers are high-performance key-value stores that efficiently index provider records. The key is CID. The value is the physical addresses of peers that possess the content, along with the protocols to be used. By offloading the storage of provider records from the DHT, Indexers significantly reduce network overhead and improve query performance.

Pebble (by Cockroach DB Labs), a high-performance key-value store, is used as the underlying data store.

Advertisement

To be visible on the network, a peer must advertise its available content to the Indexers. This is achieved through a chain of immutable advertisements – a data structure that records the publication or deletion of content by a given provider.

Providers maintain these advertisement chains locally. Information about these advertisements is disseminated to Indexers via gossip-based announcement messages. Indexers enable efficient bulk publishing of content availability information.

Client Queries

Indexer identities are publicly available. Clients can query Indexers with one or more CIDs. Indexers respond with a list of provider records that match the query.

Hydra Boosters

Hydra introduces shortcuts within the IPFS routing space to improve performance. It consists of:

  • Hydra Head Nodes: These nodes are strategically placed within the DHT itself.
  • Shared Hydra Database: This database, implemented using Amazon DynamoDB, stores mappings between CIDs and peer addresses. Hydra Head nodes have access to this database.
The Hydra database is distributed across multiple servers within Amazon's infrastructure, enhancing performance. However, it's important to note that this database is not fully decentralized.

Picking Node IDs for Hydra Heads

Hydra Heads are strategically placed within the DHT to ensure efficient routing. To do this, the PeerIDs of Hydra Heads are tracked in a XOR trie. When adding a new ndoe: 

  • Two random node IDs are generated (power of two choices).
  • The node ID that keeps the depth of the trie more regular is selected for the Hydra Head.

Power of Two Choices

Consider load balancing requests across N symmetric servers.

  • Option 1 (Random): Distributing requests randomly across servers.

  • Option 2 (Two Choices): Selecting two random servers and routing the request to the server with the shorter queue.

  • Option 3 (Three Choices): Selecting three random servers and routing to the server with the lowest load.

Option 2 exhibits a significant performance improvement over Option 1. However, the improvement gained by moving from Option 2 to Option 3 is relatively minor.

HTTP Gateway

HTTP Gateways provide a bridge between the IPFS network and the standard HTTP protocol. These gateways leverage NGINX, a high-performance web server, to cache frequently accessed IPFS content. NGINX employs a Least Recently Used (LRU) cache replacement policy.

Evaluation

Indexer Performance

  • The indexer currently stores a massive 173 billion records, a 100x increase compared to the DHT.
  • This data is primarily contributed by 604 major publishers utilizing NFTs or dedicated storage services.
  • The indexer handles 2.5k lookup requests per second, while 5k requests still rely on DHT lookups.
  • Requires a significant capital investment ($10k) and monthly operating costs ($1k) per indexer.
  • The indexer cache has a 65.22% hit rate, significantly accelerating lookups.
  • Even for the 34.78% of requests that miss the cache, latency remains lower than direct DHT lookups.

Hydra Booster Performance

  • Hydra boosters contribute to improved lookup times, although the benefits are not uniform across all regions. Regions like eu-central, already exhibiting good performance, did not experience significant improvements.
  • The placement strategy effectively positions Hydra Heads, ensuring that 96.6% of peers have at least one Hydra Head within their 20-proximity.
  • Lookup speed is improved by -3.3% to 36.5% with Hydra Boosters. Performance at the tail of the distribution is still comparable to direct DHT lookups.
  • This proximity significantly reduces DHT hops, often to a single hop.

HTTP Gateway Performance

  • Cache hits, with 50.3% hit rate, achieve a 24ms p95 latency.
  • An additional 27.6% of requests are resolved by contacting the first node directly.
  • For the remaining requests, performance may degrade slightly due to the extra hop introduced by the gateway servers.

Problems

This centralized approach presents several critical issues:

  • Censorship Risk: Indexers can potentially facilitate censorship of content within the IPFS network.
  • Hydra Booster Issues:
    • The underlying Kademlia DHT is already vulnerable to malicious actors.
    • Hydra Boosters, while intended to improve performance, exacerbate this vulnerability by enabling malicious nodes to strategically position themselves close to several other nodes.
    • There are additional performance concerns.
    • Indeed Hydra Boosters have now been turned down by IPFS.
  • Gateway Security Weaknesses: HTTP gateways compromise the end-to-end cryptographic validation process, introducing security risks.
  • False and slow advertisements can significantly degrade the user experience.

Paper Review

This return to a more centralized model, with components like Hydra Boosters and Indexers, resembles the Napster era, where a central node exerted significant control over file sharing. This contradicts the core principle of decentralization. This inherent tension is aptly captured by the paper's title, "The Eternal Tussle". We seem to constantly oscillate between centralization and decentralization, searching for the optimal balance.

For those interested in learning more about IPFS, I would like to introduce my paper Distributed Tracing for InterPlanetary File System which explains internal workings of IPFS protocols :).

Comments

Popular Posts

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...