Skip to main content

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan  et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years.

Paper Link

Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information.

OASIS (2006)

OASIS, developed by M. Freedman, K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres, elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client.

Prior to OASIS

Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons.

OASIS Architecture

OASIS introduced a two-tier architecture. OASIS maintains replicas that redirect a client to the nearest replica of any service. 

The OASIS replicas:

  1. Possess global membership information about all service replicas.
  2. Employ epidemic gossiping to detect and handle failures.
  3. Utilize consistent hashing for efficient replica placement.
The 2000s witnessed a surge in decentralized protocols, such as those employing consistent hashing. OASIS is an example of such an approach.

OASIS Algorithm

Step 1

The client's IP address is converted to an IP prefix. An IP prefix, or network prefix, is a group of IP addresses that identifies a network. IP prefixes help organize IP addresses and the devices connected to the Internet.

18.26.4.9 -> 18.0.0.0/8

18.0.0.0/8 is a CIDR notation representing a network address in routing. The /n suffix in CIDR indicates the number of bits in the subnet mask that are set to 1. In this specific case, /8 signifies a subnet mask of 255.0.0.0. To derive the network prefix from an IP address, we need to perform a bitwise AND operation between the IP address and the subnet mask.

Step 2

This IP prefix is mapped to a unique geographic proximity. There is a strong correlation between geographic proximity and RTT, the OASIS replica redirects the client to the geographically closest service replica.

If multiple replicas of service reside in the same geographic region, the client probes only those replicas to select the one with the lowest RTT, significantly reducing the probe space.

Integration and Impact

OASIS was seamlessly integrated into existing protocols like DNS and HTTP, demonstrating its practicality and potential for widespread adoption.

Google CDN Architecture

Google's CDN architecture shares similarities with OASIS, but with a key distinction: it utilizes the IP prefix of the DNS server initiating the request for redirection, rather than the client's IP address.

Goal

The authors contend that solely relying on RTT (a.k.a. end-to-end path information) for redirection, may not consistently deliver the best possible Quality of Service (QoS) to the client.

They delve into various factors that can contribute to suboptimal QoS, beyond simple RTT. The paper then explores methods for identifying and mitigating these contributing factors.

Before delving deeper into their proposed techniques, it's crucial to establish a common understanding of the key network terminologies employed throughout the paper.

Networking Architecture

Routing

Computers within a Local Area Network (LAN), such as your school network or a cluster of machines in a data center, communicate directly. These networks typically employ network switches to efficiently forward packets to their intended destinations based on their unique Media Access Control (MAC) addresses, which operate at the Ethernet layer.

However, when a packet needs to traverse beyond the confines of a local network, routing becomes essential. By examining the destination IP address of each packet, routers consult their routing tables to determine the optimal next hop in the network, sending the packet closer to its ultimate destination. These routing tables are dynamically calculated and maintained through sophisticated distributed algorithms.

Point of Presence (PoP)

A PoP is a designated entry point where end users can connect to an Internet Service Provider (ISP). Each ISP maintains a network of PoPs.

Within a single PoP, multiple routers may be present, allowing for various paths for data packets to travel from the end users to PoP.

Autonomous System (AS)

An AS is a collection of interconnected networks under the control of a single administrative entity. Each AS operates with its own distinct routing policies, determining how traffic flows within its boundaries. Common examples of entities operating ASes include:

  • ISPs
  • Large corporations (like Google)
  • Universities
  • Government agencies
Google's internal network, connecting its vast infrastructure (like the Borg cluster), functions as an AS with its own internal routing policies.

Border Gateway Routers (BGP Routers) are specialized routers reside at the edges of an AS, responsible for exchanging routing information with other ASes.

Traceroute

Traceroute is a widely used network diagnostic tool that reveals the sequence of routers traversed by packets traveling from a source to a destination. The router hops displayed by traceroute signifies the number of routing tables consulted along the path, providing insights into network topology.

It's important to note that the performance of Internet traffic can be impacted by the stability and condition of routers along the path, some of which may be older or less reliable.

Traceroute operates by sending a series of Internet Control Message Protocol (ICMP) packets with increasing Time-to-Live (TTL) values.


A fundamental characteristic of the Internet is that the path taken by packets from source A to destination B may differ significantly from the return path from B to A. For example:
  • A -> X -> Y -> B
  • B -> Y -> Z -> X -> A

Traceroute can only measure the RTT for each hop, not the one-way travel time. Thus, while traceroute effectively reveals the forward path, determining the reverse path requires more complex analysis.

iPlane

iPlane is a system designed to predict path properties between any two points on the Internet. iPlane uses traceroutes to all Internet prefixes from numerous vantage points (e.g., PlanetLab servers).

By analyzing the collected traceroute information, iPlane clusters routers into PoPs. Routers are grouped together if they exhibit the following characteristics:

  • Respond to traceroutes with the same source IP address.
  • Display similar RTT values across different vantage points.
This clustering technique enables iPlane to effectively model and predict Internet path characteristics, providing valuable insights into network topology and performance.

Latency Cause Detection Techniques

This paper explores several techniques for detecting latency issues in Google's CDN:

1. Redirection

Issue: Clients may experience increased latency when they are not redirected to the geographically closest CDN node. This can occur, for example, when the closest node is overloaded.

Detection: CDN nodes themselves can detect mismatched redirection by flagging IP prefixes that they are not responsible for serving.

2. Prefix Latency Inflation

Issue: Even when multiple IP prefixes are served by the same CDN node, some prefixes may exhibit significantly higher latency than others.

Detection: This is identified by performing traceroute analysis. For instance, if the route path is:

CDN -> R1 (1ms) -> R2 (2ms) -> R3 (100ms) -> Client

A significant delay between R2 and R3 (100ms) is considered anomalous, especially when the delay between CDN and R2 is only 2ms and the expected inter-hop delay is typically less than 40ms anywhere on earth. This suggests a circuitous path back from R3.

3. Queueing Delays:

Issue: Even within a single prefix, a significant difference between the median and minimum RTT observed by clients can indicate either different paths or queueing delays.

Observation: RTT inflation within a prefix is unaffected by:
  • Changes in routes throughout the day.
  • Changes in the PoP paths (determined using iPlane data).
  • Changes in the AS path.
The authors conclude that persistent RTT inflation is likely caused by queueing delays within the network.

Only (1) and (2) are addressed by the authors. In summary, (1) identifies anomalous prefixes facing redirection, and (2) identifies prefixes within a CDN node facing anomalously high RTT.

Mitigation

The paper explores several approaches to address the observed inflated RTT issue:

  • Direct Peering: Establishing direct peering connections between Google and the relevant ASes for the affected prefixes. This solution is exemplified in Case 1, where a new peering link was added between Google and PhilISP1.
  • Increased Peering Capacity: Enhancing the capacity of existing peering links between Google and ASes. This is demonstrated in Case 3, where a peering connection between Google and PhilISP2 existed but suffered from insufficient capacity.
  • Routing Configuration Optimization: Correcting routing configurations on border routers within Google's or the AS's network. Case 4 illustrates this approach, where Google's network administrators adjusted routing configurations to enable shorter reverse paths between Google and a JapanISP.
  • Traffic Engineering Techniques: Implementing various traffic engineering approaches to optimize network traffic flow.

Coral CDN 

Among the related works discussed in the paper, Coral CDN stands out. While I recall using Coral CDN in the 2000s, I never delved into its internal workings. Developed by the authors of OASIS, Coral CDN was a free, designed to mirror web content.

Coral CDN's primary use case was to mitigate "slashdotting", a phenomenon where a website, particularly smaller ones, experiences a sudden and overwhelming surge in traffic after being linked by a popular website.

The most distinctive feature of Coral CDN is its utilization of a Distributed Sloppy Hash Table (DSHT). Unlike traditional Distributed Hash Tables (DHTs) that rely on a single hash ring, DSHT employs multiple concentric Chord hash rings arranged hierarchically. DSHT can be adapted to work with other hashing protocols like Kademlia.

Coral CDN Interface

Coral provides a simple interface for higher-level applications:

  • put (key, val, ttl): Inserts a key-value mapping with a specified TTL for the entry.
  • get (key): Retrieves a subset of values associated with the given key.

Hierarchical Implementation

The Coral CDN implementation utilizes a three-tiered hierarchical structure:

  • Regional Layer: Connects nodes within the same geographical region.
  • Continental Layer: Connects nodes within a continent.
  • Global Layer: Connects nodes globally.

Nodes have the same Id across layers, but their assigned ranges vary due to variation in the hashing modulo. During the put operation, the value is replicated across all three layers, ensuring redundancy. Conversely, during the get operation, the system prioritizes retrieving the value from the local regional layer. 

Paper Review

The paper does introduce readers to design concepts related to CDNs. Additionally, fixing the issues discovered by these methods could have a very significant impact, potentially saving Google millions of dollars in network expenditure. However, the paper does not introduce any truly novel concepts. Rather, it essentially provides a strong approach to Internet analysis. I especially loved the part where they determined the queueing delay and circuitous reverse paths.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...