Skip to main content

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan  et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years.

Paper Link

Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information.

OASIS (2006)

OASIS, developed by M. Freedman, K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres, elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client.

Prior to OASIS

Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons.

OASIS Architecture

OASIS introduced a two-tier architecture. OASIS maintains replicas that redirect a client to the nearest replica of any service. 

The OASIS replicas:

  1. Possess global membership information about all service replicas.
  2. Employ epidemic gossiping to detect and handle failures.
  3. Utilize consistent hashing for efficient replica placement.
The 2000s witnessed a surge in decentralized protocols, such as those employing consistent hashing. OASIS is an example of such an approach.

OASIS Algorithm

Step 1

The client's IP address is converted to an IP prefix. An IP prefix, or network prefix, is a group of IP addresses that identifies a network. IP prefixes help organize IP addresses and the devices connected to the Internet.

18.26.4.9 -> 18.0.0.0/8

18.0.0.0/8 is a CIDR notation representing a network address in routing. The /n suffix in CIDR indicates the number of bits in the subnet mask that are set to 1. In this specific case, /8 signifies a subnet mask of 255.0.0.0. To derive the network prefix from an IP address, we need to perform a bitwise AND operation between the IP address and the subnet mask.

Step 2

This IP prefix is mapped to a unique geographic proximity. There is a strong correlation between geographic proximity and RTT, the OASIS replica redirects the client to the geographically closest service replica.

If multiple replicas of service reside in the same geographic region, the client probes only those replicas to select the one with the lowest RTT, significantly reducing the probe space.

Integration and Impact

OASIS was seamlessly integrated into existing protocols like DNS and HTTP, demonstrating its practicality and potential for widespread adoption.

Google CDN Architecture

Google's CDN architecture shares similarities with OASIS, but with a key distinction: it utilizes the IP prefix of the DNS server initiating the request for redirection, rather than the client's IP address.

Goal

The authors contend that solely relying on RTT (a.k.a. end-to-end path information) for redirection, may not consistently deliver the best possible Quality of Service (QoS) to the client.

They delve into various factors that can contribute to suboptimal QoS, beyond simple RTT. The paper then explores methods for identifying and mitigating these contributing factors.

Before delving deeper into their proposed techniques, it's crucial to establish a common understanding of the key network terminologies employed throughout the paper.

Networking Architecture

Routing

Computers within a Local Area Network (LAN), such as your school network or a cluster of machines in a data center, communicate directly. These networks typically employ network switches to efficiently forward packets to their intended destinations based on their unique Media Access Control (MAC) addresses, which operate at the Ethernet layer.

However, when a packet needs to traverse beyond the confines of a local network, routing becomes essential. By examining the destination IP address of each packet, routers consult their routing tables to determine the optimal next hop in the network, sending the packet closer to its ultimate destination. These routing tables are dynamically calculated and maintained through sophisticated distributed algorithms.

Point of Presence (PoP)

A PoP is a designated entry point where end users can connect to an Internet Service Provider (ISP). Each ISP maintains a network of PoPs.

Within a single PoP, multiple routers may be present, allowing for various paths for data packets to travel from the end users to PoP.

Autonomous System (AS)

An AS is a collection of interconnected networks under the control of a single administrative entity. Each AS operates with its own distinct routing policies, determining how traffic flows within its boundaries. Common examples of entities operating ASes include:

  • ISPs
  • Large corporations (like Google)
  • Universities
  • Government agencies
Google's internal network, connecting its vast infrastructure (like the Borg cluster), functions as an AS with its own internal routing policies.

Border Gateway Routers (BGP Routers) are specialized routers reside at the edges of an AS, responsible for exchanging routing information with other ASes.

Traceroute

Traceroute is a widely used network diagnostic tool that reveals the sequence of routers traversed by packets traveling from a source to a destination. The router hops displayed by traceroute signifies the number of routing tables consulted along the path, providing insights into network topology.

It's important to note that the performance of Internet traffic can be impacted by the stability and condition of routers along the path, some of which may be older or less reliable.

Traceroute operates by sending a series of Internet Control Message Protocol (ICMP) packets with increasing Time-to-Live (TTL) values.


A fundamental characteristic of the Internet is that the path taken by packets from source A to destination B may differ significantly from the return path from B to A. For example:
  • A -> X -> Y -> B
  • B -> Y -> Z -> X -> A

Traceroute can only measure the RTT for each hop, not the one-way travel time. Thus, while traceroute effectively reveals the forward path, determining the reverse path requires more complex analysis.

iPlane

iPlane is a system designed to predict path properties between any two points on the Internet. iPlane uses traceroutes to all Internet prefixes from numerous vantage points (e.g., PlanetLab servers).

By analyzing the collected traceroute information, iPlane clusters routers into PoPs. Routers are grouped together if they exhibit the following characteristics:

  • Respond to traceroutes with the same source IP address.
  • Display similar RTT values across different vantage points.
This clustering technique enables iPlane to effectively model and predict Internet path characteristics, providing valuable insights into network topology and performance.

Latency Cause Detection Techniques

This paper explores several techniques for detecting latency issues in Google's CDN:

1. Redirection

Issue: Clients may experience increased latency when they are not redirected to the geographically closest CDN node. This can occur, for example, when the closest node is overloaded.

Detection: CDN nodes themselves can detect mismatched redirection by flagging IP prefixes that they are not responsible for serving.

2. Prefix Latency Inflation

Issue: Even when multiple IP prefixes are served by the same CDN node, some prefixes may exhibit significantly higher latency than others.

Detection: This is identified by performing traceroute analysis. For instance, if the route path is:

CDN -> R1 (1ms) -> R2 (2ms) -> R3 (100ms) -> Client

A significant delay between R2 and R3 (100ms) is considered anomalous, especially when the delay between CDN and R2 is only 2ms and the expected inter-hop delay is typically less than 40ms anywhere on earth. This suggests a circuitous path back from R3.

3. Queueing Delays:

Issue: Even within a single prefix, a significant difference between the median and minimum RTT observed by clients can indicate either different paths or queueing delays.

Observation: RTT inflation within a prefix is unaffected by:
  • Changes in routes throughout the day.
  • Changes in the PoP paths (determined using iPlane data).
  • Changes in the AS path.
The authors conclude that persistent RTT inflation is likely caused by queueing delays within the network.

Only (1) and (2) are addressed by the authors. In summary, (1) identifies anomalous prefixes facing redirection, and (2) identifies prefixes within a CDN node facing anomalously high RTT.

Mitigation

The paper explores several approaches to address the observed inflated RTT issue:

  • Direct Peering: Establishing direct peering connections between Google and the relevant ASes for the affected prefixes. This solution is exemplified in Case 1, where a new peering link was added between Google and PhilISP1.
  • Increased Peering Capacity: Enhancing the capacity of existing peering links between Google and ASes. This is demonstrated in Case 3, where a peering connection between Google and PhilISP2 existed but suffered from insufficient capacity.
  • Routing Configuration Optimization: Correcting routing configurations on border routers within Google's or the AS's network. Case 4 illustrates this approach, where Google's network administrators adjusted routing configurations to enable shorter reverse paths between Google and a JapanISP.
  • Traffic Engineering Techniques: Implementing various traffic engineering approaches to optimize network traffic flow.

Coral CDN 

Among the related works discussed in the paper, Coral CDN stands out. While I recall using Coral CDN in the 2000s, I never delved into its internal workings. Developed by the authors of OASIS, Coral CDN was a free, designed to mirror web content.

Coral CDN's primary use case was to mitigate "slashdotting", a phenomenon where a website, particularly smaller ones, experiences a sudden and overwhelming surge in traffic after being linked by a popular website.

The most distinctive feature of Coral CDN is its utilization of a Distributed Sloppy Hash Table (DSHT). Unlike traditional Distributed Hash Tables (DHTs) that rely on a single hash ring, DSHT employs multiple concentric Chord hash rings arranged hierarchically. DSHT can be adapted to work with other hashing protocols like Kademlia.

Coral CDN Interface

Coral provides a simple interface for higher-level applications:

  • put (key, val, ttl): Inserts a key-value mapping with a specified TTL for the entry.
  • get (key): Retrieves a subset of values associated with the given key.

Hierarchical Implementation

The Coral CDN implementation utilizes a three-tiered hierarchical structure:

  • Regional Layer: Connects nodes within the same geographical region.
  • Continental Layer: Connects nodes within a continent.
  • Global Layer: Connects nodes globally.

Nodes have the same Id across layers, but their assigned ranges vary due to variation in the hashing modulo. During the put operation, the value is replicated across all three layers, ensuring redundancy. Conversely, during the get operation, the system prioritizes retrieving the value from the local regional layer. 

Paper Review

The paper does introduce readers to design concepts related to CDNs. Additionally, fixing the issues discovered by these methods could have a very significant impact, potentially saving Google millions of dollars in network expenditure. However, the paper does not introduce any truly novel concepts. Rather, it essentially provides a strong approach to Internet analysis. I especially loved the part where they determined the queueing delay and circuitous reverse paths.

Comments

Popular Posts

Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...