Paper Insights #25 - Eliminating Receive Livelock in an Interrupt-driven Kernel

Jeff Mogul, a pioneering figure in computer science, authored this influential paper presented at Usenix ATC 1996. Since then, it has become a seminal work, sparking extensive discussion in academic circles. I also found it particularly engaging during Stanford's CS240, partly due to its clarity compared to other readings, but primarily because of my deep interest in networking.

Paper Link

Let's begin with some basic concepts.

CPU, Memory, & I/O Communication

In a Von Neumann Architecture, the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components:

Data Bus: Bidirectional, carrying the actual data being transferred.
Address Bus: Unidirectional, carrying memory addresses for data access.
Control Bus: Carrying control signals for synchronization and coordination.

Modern systems employ hierarchical bus architectures to enhance communication speed, moving beyond the simple single-bus model.

Interrupt

An interrupt (or trap) is a signal that prompts the CPU to suspend its current instruction execution and instead execute a specific Interrupt Service Routine (ISR).

Types of Interrupts

Interrupts can be broadly categorized as follows:

Hardware Interrupts: These are generated by hardware devices, compelling the microprocessor to jump to an ISR. A naive hardware implementation might directly modify the program counter to fetch the ISR's next instruction. However, this approach presents significant security vulnerabilities. A more common and secure method involves raising an interrupt flag on the control bus. The CPU's control unit detects this flag and, at the end of the current instruction cycle, handles the interrupt by branching to the ISR.

Software Interrupts: These are signals that instruct the processor to jump to an ISR when specific conditions (evaluated by the processor) are met. For instance, a memory access resulting in a page fault triggers a software interrupt, forcing the CPU to transition from user mode (Ring 3) to kernel mode (Ring 0) to manage the fault. Similarly, a program causing a segmentation fault generates a software interrupt to kernel.

Interrupt Vector Table (IVT)

The IVT is a lookup table containing the memory addresses of ISR routines corresponding to hardware-generated interrupts. (The kernel is aware of the ISR addresses for software interrupts). The microprocessor uses the IVT to find the appropriate routine for a given hardware interrupt.

Example: Keyboard Input

When the keyboard receives an input, it generates an interrupt. The ISR reads the input and stores it in a keyboard buffer. The terminal program then reads from this buffer.

Interrupt-Driven Scheduling

A scheduling system manages what executes on the microprocessor. Kernel CPU schedulers typically handle the scheduling of kernel and user threads. However, as discussed, interrupts also dictate when an ISR is executed. Consequently, interrupt-driven scheduling is another form of CPU scheduling.

Therefore, we have different scheduling mechanisms: kernel CPU scheduling and interrupt-driven scheduling. Interrupt-driven scheduling, particularly those originating from hardware, generally takes higher precedence and executes at more privileged protection rings, such as level 0 or 1 called the interrupt privilege level (IPL).

Network Interface Card (NIC)

The NIC is the I/O device that facilitates data reception and transmission over the network.

NIC Interactions

The NIC, like other I/O devices, interfaces directly with the system bus. Additionally, it incorporates an internal buffer memory.

Reception (RX)

The NIC buffers incoming data into its internal memory.
Upon buffer fullness, the NIC generates a maskable interrupt to the CPU.
The kernel, in response to the interrupt, copies the data from the NIC's buffer to RAM.
Higher-layer protocols then process the data (e.g., TCP/IP copying data to socket buffers), and the application is notified.

Transmission (TX)

The application writes data to a memory buffer.
The kernel copies the data to the NIC's buffer.
The NIC transmits the data over the network.

Direct Memory Access (DMA)

Not to be confused with RDMA.

To improve efficiency, modern NICs, and other hardware like disk controllers and graphics cards, utilize Direct Memory Access (DMA).

This allows the NIC to directly read and write to system memory without CPU intervention. The NIC has an internal processor (DMA controller) that can manage these memory transfers. While DMA is in progress, the system bus is in use by the DMA controller, briefly making it inaccessible to the CPU. This offloads data transfer tasks from the CPU, allowing it to perform other operations concurrently.

Linux Network Stack

Networking stands out as one of the most intricate I/O operations within a computer system. This complexity stems from the layered nature of the network stack: the MAC layer where the NIC operates, the IP layer governing internet communication, the Transport layer where TCP/UDP functions, and the application-specific protocols used by various applications. Furthermore, unlike storage I/O, the RX/TX rates in networking are inherently unpredictable. Storage device controllers can regulate RX/TX speeds based on buffer availability. However, network streams involve interactions with numerous external machines beyond the local system's control.

Consider a Denial-of-Service (DDoS) attack scenario where a machine is flooded with requests. The NIC lacks the ability to control the incoming packet stream. Its influence over network capacity usage is also limited; the Ethernet connection might become saturated with incoming traffic. Consequently, the NIC's primary recourse in such situations is to begin dropping packets. Higher-level protocols then assume the responsibility for retransmissions and ensuring reliable delivery. This type of uncontrolled and potentially overwhelming input is a situation that is not encountered in I/O operations involving storage devices.

Life of a RX Packet

Let's dive deeper into the journey of a network packet as it arrives in a Linux system.

Device Driver Layer

The NIC temporarily holds the packet in a buffer memory. Alternatively, for efficiency, the NIC can employ DMA to directly deposit the packet into a pre-arranged circular buffer in system memory, known as a RX ring buffer.
Once the packet is in its initial storage, the NIC signals the kernel's attention by generating a maskable interrupt. The interrupts are batched to amortize their cost.

Kernel Layer

The ISR processes all packets in the NIC's memory or the RX ring buffer. It directs these packets towards their intended destinations within the kernel. For instance, data destined for a user application typically gets placed in the TCP queue. In contrast, packets intended for the system's firewall are routed to the IP queue only.

User Layer

The user application thread reads the packets from the respective queue to the socket buffer.

Scheduling Overhead

The reception process involves the following scheduling events:

Interrupt Scheduling: This initial ISR is scheduled in response to hardware interrupt that reads the packets from NIC's memory or the RX ring buffer to the kernel queues.
User Thread Scheduling: Finally, the user application thread is scheduled to read the packets from kernel queues to the socket buffer.

Data Copying

During reception, the packet's data is typically copied the following number of times:

NIC to IP Queue: The raw packet data is copied from the NIC's memory or the RX ring buffer into the IP queue within the kernel.
IP to TCP Queue: If the packet is destined for a TCP socket, its data is copied from the IP queue to the corresponding TCP queue.
TCP Queue to User Space Socket Buffer: Finally, when the user application reads the data, it's copied from the TCP queue in the kernel's memory space to the application's socket buffer in user space.

Modern zero-copy techniques enhance network performance by bypassing the kernel for data transfer, allowing user-space threads direct access to the RX and TX ring buffers, thus eliminating kernel-to-user space data copies.

Receive Livelock

As previously mentioned:

A NIC can uncontrollably receive a large volume of network packets, potentially generating numerous interrupts in quick succession.
And interrupt-driven scheduling takes precedence over kernel scheduling.

Consequently, when network packets arrive, the processor is forced to spend its time processing the packets and placing them into appropriate kernel queues. However, if the kernel scheduler fails to schedule the corresponding kernel or user threads, the application will experience starvation.

This scenario can lead to a receive livelock, a state where the system cannot make meaningful progress because its entire processing capacity is consumed by handling receive interrupts.

Ideally, with unlimited CPU resources, an increase in input load would result in a linear increase in throughput. However, in the real world where CPU power is finite, the throughput will initially rise but then plateau and remain at a stable level. This sustained throughput is known as the Maximum Loss-Free Receive Rate (MLFRR). At this rate, packets are not dropped, and the CPU's full potential is utilized.

Beyond the MLFRR, packet loss will occur. While some packet loss is acceptable as long as the system maintains MLFRR throughput, a livelock causes system thrashing, preventing the sustained processing of data at the MLFRR.

Beyond the loss of throughput, a livelock manifests other problems, including:

Increased end-to-end latency. While the delay in accepting a packet from the NIC might be reduced, the overall e2e latency, which includes packet processing, will increase.
Ultimately, transmission throughput is also impacted. This occurs because even sending data requires the scheduling of kernel threads to move packets from kernel queues to the NIC.

Eliminating Livelock

The core system design and rate limiting principle we derive from this paper is to discard new work at the earliest possible stage if the system lacks the capacity to handle it.

In the context of receiving network packets, this translates to discarding packets at the NIC level itself when the system's processing capacity is saturated. This proactive dropping prevents unnecessary processing of packets that would ultimately fail to be handled completely, helping us achieve and sustain the MLFRR.

The authors implement this concept through two major techniques, as detailed below:

1. Rate Limiting Interrupts

Interrupts are disabled under the following conditions:

When the kernel queue is full. As a result, packets would be dropped directly at the NIC to avoid any processing overhead.
When certain amount of processing time has been spent processing packets. This allows kernel and user threads to make progress.

For the latter, the authors utilize a CPU timer to track the time spent processing packets within the ISR. If this duration becomes excessive, device interrupts are temporarily disabled and re-enabled only after the timer expires. Importantly, the timer interrupt has a higher priority than device interrupts, enabling it to interrupt ongoing interrupt handling. This mechanism ensures a more equitable distribution of CPU time.

2. Polling Thread

The authors introduce a spin polling thread. This thread is initiated upon receiving an interrupt, and once active, device interrupts are disabled. The polling is "spinning", meaning it involves continuous checks for new packets. This approach aims for the best of both worlds: low latency during periods of low RX rate (using interrupts) and sustained high throughput during high RX rates (using polling). The polling thread ceases operation and re-enables interrupts once no more packets are pending.

Crucially, the spin polling thread is subject to the regular scheduling decisions of the kernel scheduler, including round-robin scheduling, ensuring that other kernel and user threads continue to make progress.

Alternatively, interrupts could always be disabled and a continuously running polling thread could periodically process packets. However, this can introduce latency, potentially causing delays in packet delivery to applications.

Case of BSD Routers

The authors implemented these ideas within the BSD operating system for routers. Routers offer a simpler system to analyze compared to workstations, and they handle a substantial volume of packets for routing and forwarding. A router can be conceptualized as having multiple NICs, leading to multiple I/O queues at various levels.

Unlike general-purpose systems, routers primarily deal with the IP layer. Consequently, there are IP queues at the kernel level for IP packets.

The receive interrupt handler copies the packet from input NIC's memory to input IP queue.
The IP forwarder reads packets from input IP queue and places them into the corresponding output IP queue.
A transmit interrupt handler then reads from the output IP queue and forwards the packet to the designated output NIC.

As anticipated, the output packet rate decreases when the input packet rate exceeds a certain threshold. This problem is exacerbated by the use of screend, a firewall program designed to block unwanted packets. While the IP forwarder operates in kernel space, screend runs in user space, resulting in a higher rate of packet drops.

Eliminating Livelock

The solution implemented is consistent with the principle discussed earlier: invest processing effort in a packet only if it were to be processed to completion.

The authors entirely removed the input IP queue.

The kernel employs a single polling thread, triggered by interrupts. This polling thread invokes handler procedures:

Receive-packet callback procedure: This procedure reads packets directly from the input NIC and places them into the appropriate output IP queues. The IP forwarder logic is integrated into this procedure.
Transmit-packet callback procedure: This procedure reads packets from the output IP queue and sends them out through the output NIC.

Each of these callback procedures is assigned a fixed quota, defining the maximum number of packets it can process in one invocation.

Each queue provides feedback on its occupancy level. When a queue reaches 75% capacity, the receive-packet callback is no longer executed. Instead, either the transmit-packet callback is executed or the system yields CPU time to user-space applications like screend to allow them to process the already queued packets.

Finally, the polling thread as a whole periodically yields CPU time to allow other user-space processes to run.

Evaluation

The paper provides a relatively thorough evaluation of BSD routers with the fixes applied.

Figure 6-3: Polling without a limited quota (infinite quota) performs worse than the unmodified kernel. This is attributed to the receive-packet callback potentially processing a large number of input packets before encountering a full output IP queue and dropping them. The unmodified kernel performs better by dropping packets earlier at the input IP queue. However, polling with a defined quota successfully achieves the goal of sustaining the MLFRR.

Figure 6-4: Without feedback on output queue fullness, the performance is same as an unmodified kernel. This is because packets are still likely to be dropped at the input queue of screend. With feedback, however, the polling thread proactively stops processing new input and yields to screend to clear its queue, allowing subsequent input processing.

Figure 6-5: This figure analyzes the impact of different polling quota values on performance (sensitivity analysis). Larger quotas are shown to be less effective and increase the perceived per-packet latency. A packet quota in the range of 10-20 is found to be optimal.

Figure 6-6: This figure presents similar analysis with feedback enabled. The results indicate that with effective feedback mechanisms, the specific polling quota becomes less critical, as the feedback itself limits the amount of work done in each procedure.

Figure 7-1: This figure examines the effect of yielding CPU time to user-space processes after a certain time interval. While it aims to allow user-space processes to make progress, minor discrepancies are observed. These are potentially due to overhead from context switches, the lack of quota on the transmit-procedure procedure's processing (potentially monopolizing the CPU), and possible measurement errors.

Downside of Dropping Packets at NIC

Thus far, we've established that dropping packets at the NIC is generally optimal because it avoids investing resources in packets unlikely to be fully processed. However, this approach has a notable drawback: the drops are largely indiscriminate, and applications lack the ability to specify packet priority.

To address this, IP packets are often tagged with a traffic class. Applications can also provide this traffic class information. The NIC can then read this traffic class and make intelligent dropping decisions based on the priority associated with each class.

Paper Review

This paper is exceptionally well-written and accessible, making it easy for even beginners to grasp the problem statement and follow the elegant solution. It offers a clear and insightful walkthrough of the challenges and provides a compelling example of effective problem-solving. Investing time in understanding its intricacies is highly worthwhile.

Pico

Search This Blog

Paper Insights #25 - Eliminating Receive Livelock in an Interrupt-driven Kernel

CPU, Memory, & I/O Communication

Interrupt

Types of Interrupts

Interrupt Vector Table (IVT)

Example: Keyboard Input

Interrupt-Driven Scheduling

Network Interface Card (NIC)

NIC Interactions

Reception (RX)

Transmission (TX)

Direct Memory Access (DMA)

Linux Network Stack

Life of a RX Packet

Device Driver Layer

Kernel Layer

User Layer

Scheduling Overhead

Data Copying

Receive Livelock

Eliminating Livelock

1. Rate Limiting Interrupts

2. Polling Thread

Case of BSD Routers

Eliminating Livelock

Evaluation

Downside of Dropping Packets at NIC

Paper Review

Comments

Post a Comment

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Paper Insights #24 - Spanner: Google's Globally-Distributed Database