Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan et. al, including Sushant Jain, who is listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for several years.

Paper Link

Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information.

1. OASIS (2006)

OASIS was developed by M. Freedman, K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres. It addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client.

Prior to OASIS, clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). This was highly accurate, however, this approach suffered from excessive probing and computationally expensive comparisons.

1.1. Architecture

OASIS introduced a two-tier architecture. OASIS maintains replicas that redirect a client to the nearest replica of any service.

The OASIS replicas:

Possess global membership information about all service replicas.
Employ epidemic gossiping to detect and handle failures.
Utilize consistent hashing for efficient replica placement.

Indeed, the 2000s witnessed a surge in decentralized protocols, such as those employing consistent hashing. OASIS is an example of such an approach.

OASIS was seamlessly integrated into existing protocols like DNS and HTTP, demonstrating its practicality and potential for widespread adoption.

1.2. Algorithm

Step 1

The client's IP address is converted to an IP prefix. An IP prefix, or network prefix, is a group of IP addresses that identifies a network. IP prefixes help organize IP addresses and the devices connected to the Internet. For instance:

18.26.4.9 -> 18.0.0.0/8

18.0.0.0/8 is a CIDR notation representing a network address in routing. The /n suffix in CIDR indicates the number of bits in the subnet mask that are set to 1. In this specific case, /8 signifies a subnet mask of 255.0.0.0. To derive the network prefix from an IP address, we need to perform a bitwise AND operation between the IP address and the subnet mask.

Step 2

This IP prefix is mapped to a unique geographic proximity. There is a strong correlation between geographic proximity and RTT, the OASIS replica redirects the client to the geographically closest service replica.

If multiple replicas of service reside in the same geographic region, the client probes only those replicas to select the one with the lowest RTT, significantly reducing the probe space.

2. Google CDN Architecture

Google's CDN architecture shares similarities with OASIS, but with a key distinction: it utilizes the IP prefix of the DNS server initiating the request for redirection, rather than the client's IP address.

2.1 Goal

The authors contend that solely relying on RTT (a.k.a. end-to-end path information) for redirection, may not consistently deliver the best possible Quality of Service (QoS) to the client.

They delve into various factors that can contribute to suboptimal QoS, beyond simple RTT. The paper then explores methods for identifying and mitigating these contributing factors.

Before delving deeper into their proposed techniques, it's crucial to establish a common understanding of the key network terminologies employed throughout the paper.

3. Networking Architecture

3.1. Routing

Computers within a Local Area Network (LAN), such as your school network or a cluster of machines in a data center, communicate directly. These networks typically employ network switches to efficiently forward packets to their intended destinations based on their unique Media Access Control (MAC) addresses, which operate at the Ethernet layer.

However, when a packet needs to traverse beyond the confines of a local network, routing becomes essential. By examining the destination IP address of each packet, routers consult their routing tables to determine the optimal next hop in the network, sending the packet closer to its ultimate destination. These routing tables are dynamically calculated and maintained through sophisticated distributed algorithms.

3.2. Point of Presence (PoP)

A PoP is a designated entry point where end users can connect to an Internet Service Provider (ISP). Each ISP maintains a network of PoPs.

Within a single PoP, multiple routers may be present, allowing for various paths for data packets to travel from the end users to PoP.

3.3. Autonomous System (AS)

An AS is a collection of interconnected networks under the control of a single administrative entity. Each AS operates with its own distinct routing policies, determining how traffic flows within its boundaries. Common examples of entities operating ASes include:

ISPs
Large corporations (like Google)
Universities
Government agencies

Google's internal network, connecting its vast infrastructure (like the Borg cluster), functions as an AS with its own internal routing policies.

Border Gateway Routers (BGP Routers) are specialized routers reside at the edges of an AS, responsible for exchanging routing information with other ASes.

3.4. Traceroute

The performance of Internet traffic can be impacted by the stability and condition of routers along the path, some of which may be older or less reliable. Traceroute is a widely used network diagnostic tool that reveals the sequence of routers traversed by packets traveling from a source to a destination. The router hops displayed by traceroute signifies the number of routing tables consulted along the path, providing insights into network topology.

Traceroute operates by sending a series of Internet Control Message Protocol (ICMP) packets with increasing Time-to-Live (TTL) values.

A fundamental characteristic of the Internet is that the path taken by packets from source A to destination B may differ significantly from the return path from B to A. For example:

A -> X -> Y -> B
B -> Y -> Z -> X -> A

Traceroute can only measure the RTT for each hop, not the one-way travel time. Thus, while traceroute effectively reveals the forward path, determining the reverse path requires more complex analysis.

4. iPlane

iPlane is a system designed to predict path properties between any two points on the Internet. iPlane uses traceroutes to all Internet prefixes from numerous vantage points (e.g., PlanetLab servers).

By analyzing the collected traceroute information, iPlane clusters routers into PoPs. Routers are grouped together if they exhibit the following characteristics:

Respond to traceroutes with the same source IP address.
Display similar RTT values across different vantage points.

This clustering technique enables iPlane to effectively model and predict Internet path characteristics, providing valuable insights into network topology and performance.

5. Latency Cause Detection Techniques

This paper explores several techniques for detecting latency issues in Google's CDN:

1. Redirection

Issue: Clients may experience increased latency when they are not redirected to the geographically closest CDN node. This can occur, for example, when the closest node is overloaded.

Detection: CDN nodes themselves can detect mismatched redirection by flagging IP prefixes that they are not responsible for serving.

2. Prefix Latency Inflation

Issue: Even when multiple IP prefixes are served by the same CDN node, some prefixes may exhibit significantly higher latency than others.

Detection: This is identified by performing traceroute analysis. For instance, if the route path is:

CDN -> R1 (1ms) -> R2 (2ms) -> R3 (100ms) -> Client

A significant delay between R2 and R3 (100ms) is considered anomalous, especially when the delay between CDN and R2 is only 2ms and the expected inter-hop delay is typically less than 40ms anywhere on earth. This suggests a circuitous path back from R3.

3. Queueing Delays:

Issue: Even within a single prefix, a significant difference between the median and minimum RTT observed by clients can indicate either different paths or queueing delays.

Observation: RTT inflation within a prefix is unaffected by:

Changes in routes throughout the day.
Changes in the PoP paths (determined using iPlane data).
Changes in the AS path.

The authors conclude that persistent RTT inflation is likely caused by queueing delays within the network.

Only (1) and (2) are addressed by the authors. In summary,

(1) identifies anomalous prefixes facing redirection, and

(2) identifies prefixes within a CDN node facing anomalously high RTT.

6. Mitigation

The paper explores several approaches to address the observed inflated RTT issue:

Direct Peering: Establishing direct peering connections between Google and the relevant ASes for the affected prefixes. This solution is exemplified in Case 1, where a new peering link was added between Google and PhilISP1.

Increased Peering Capacity: Enhancing the capacity of existing peering links between Google and ASes. This is demonstrated in Case 3, where a peering connection between Google and PhilISP2 existed but suffered from insufficient capacity.

Routing Configuration Optimization: Correcting routing configurations on border routers within Google's or the AS's network. Case 4 illustrates this approach, where Google's network administrators adjusted routing configurations to enable shorter reverse paths between Google and a JapanISP.

Traffic Engineering Techniques: Implementing various traffic engineering approaches to optimize network traffic flow.

7. Coral CDN

Among the related works discussed in the paper, Coral CDN stands out. While I recall using Coral CDN in the 2000s, I never delved into its internal workings. Coral CDN was developed by the authors of OASIS. It was a free service, designed to mirror web content.

Coral CDN's primary use case was to mitigate slashdotting, a phenomenon where a website, particularly smaller ones, experiences a sudden and overwhelming surge in traffic after being linked by a popular website.

The most distinctive feature of Coral CDN is its utilization of a Distributed Sloppy Hash Table (DSHT). Unlike traditional Distributed Hash Tables (DHTs) that rely on a single hash ring, DSHT employs multiple concentric Chord hash rings arranged hierarchically. DSHT can be adapted to work with other hashing protocols like Kademlia.

7.1. Interface

Coral provides a simple interface for higher-level applications:

put (key, val, ttl): Inserts a key-value mapping with a specified TTL for the entry.
get (key): Retrieves a subset of values associated with the given key.

7.2. Implementation

The Coral CDN implementation utilizes a three-tiered hierarchical structure:

Regional Layer: Connects nodes within the same geographical region.
Continental Layer: Connects nodes within a continent.
Global Layer: Connects nodes globally.

Nodes have the same Id across layers, but their assigned ranges vary due to variation in the hashing modulo. During the put operation, the value is replicated across all three layers, ensuring redundancy. Conversely, during the get operation, the system prioritizes retrieving the value from the local regional layer.

8. Paper Review

The paper introduce readers to design concepts related to CDNs. Additionally, fixing the issues discovered by these methods could have a very significant impact, potentially saving Google millions of dollars in network expenditure. However, the paper does not introduce any truly novel concepts. Rather, it essentially provides a strong approach to Internet analysis. I especially loved the part where they determined the queueing delay and circuitous reverse paths.

Pico

Search This Blog