Skip to main content

Paper Insights #4 - Firecracker: Lightweight Virtualization for Serverless Applications

This paper, co-authored by Alexandru Agache and distinguished AWS scientist Marc Brooker, along with other researchers, was presented at the esteemed NSDI '20 conference.

Paper Link

Recommended Read: Paper Insights - A Comparison of Software and Hardware Techniques for x86 Virtualization where I introduced several virtualization concepts.

Before we explore the paper's ideas in depth, let's establish some foundational context.

Serverless Computing

Serverless computing is a cloud computing paradigm where the cloud provider dynamically allocates machine resources, especially compute, as needed. The cloud provider manages the underlying servers.

AWS offers a comprehensive serverless stack with a diverse range of products, including:

This paper primarily focuses on Lambda, a serverless compute platform. However, a foundational understanding of cloud object stores like S3 is also crucial. For those seeking a deeper dive into cloud object stores, I recommend exploring the Delta Lake paper by Databricks.

Before delving deeper into serverless computing, let's briefly review some relevant technologies.

Virtualization

Virtualization is a technology that enables the creation of virtual resources from underlying machine resources. Broadly, there are two types of virtualization:

Full Virtualization

Recommended ReadPaper Insights - A Comparison of Software and Hardware Techniques for x86 Virtualization where I discuss full virtualization in details.

Full virtualization is complete virtualization of the actual hardware to allow software environments, including a guest OS and its apps, to run unmodified. The guest OS (a.k.a VMs) remains unaware of its execution within a virtual environment. VMs operate on top of hypervisors, which can be categorized into two types:
  • Type 1 hypervisors: Run directly on bare-metal hardware, such as ESX Server.
  • Type 2 hypervisors: Run on top of another OS, such as VirtualBox.

Note: A hypervisor is also known as a Virtual Machine Monitor (VMM).


Two common techniques for full virtualization include:
  • Binary translation (as in classical virtualization): Translates privileged guest instructions into safer alternatives on-the-fly. Some early x86 virtualization solutions relied heavily on binary translation. The hypervisor intercepted privileged instructions (like those accessing hardware directly) and dynamically translated them into safe instructions that could be executed within the virtual machine's context.
  • Hardware-assisted virtualization: The CPU (e.g. Intel VT-x and AMD-V) itself provides virtualization support. It utilizes specialized CPU instructions to accelerate virtualization processes.

Both Type 1 and Type 2 hypervisors can support both techniques of full virtualization. Specifically, in binary translation, the core functionalities provided by the hypervisor are:

  • Instruction translation: Guest OS instructions are executed in non-privileged mode on the CPU. However, when a privileged instruction (e.g., accessing hardware directly) is encountered, it triggers a trap, transferring control to the hypervisor for translation.
  • Page table translation: Guest OS's page table entries are translated into the corresponding entries within the host OS's page table.
These translations are not required in hardware-assisted virtualization.

Note: These core functionalities have numerous variations and optimizations across different hypervisor implementations.

OS-level Virtualization

OS-level virtualization isolates a pool of resources within the operating system. This encompasses technologies such as containers (like Docker), jails (like FreeBSD jail), chroot, and virtual environments. The most popular among them today are containers.

In a typical OS, the following are typically accessible to the OS:
  • Hardware capabilities: CPU and network interface card (NIC) characteristics.
  • Data: Access to files and folders.
  • Peripherals: Connected devices like webcams, printers, and scanners.

Containers, however, restrict these system capabilities.

Linux Containers

Linux containers utilize OS-level virtualization techniques to provide application isolation. Key mechanisms enabling this isolation include:

  • cgroups: Isolate and limit resource consumption (CPU, memory, I/O) for individual containers.
  • namespaces: Create isolated environments for processes, network, and user ids.
  • seccomp-bpf: Restrict the system calls available to a container, enhancing security.
  • chroot: Isolate the container's file system view, limiting its access to specific parts of the host system's file system.
Do not confuse "Containers" with "Linux Containers." "Containers" is a broader term encompassing the concept of isolating applications within lightweight, portable units. "Linux Containers" specifically refers to the mechanisms used to create and run isolated applications within the Linux OS. In essence, containers like Docker are built upon the foundation of Linux containers.

Serverless Computing v/s Containers

Serverless computing focuses on event-driven functions, where code executes in response to specific triggers (e.g., API calls, data changes).

Containers, on the other hand, package applications and their dependencies into self-contained units, providing consistent execution environments across different systems.

KVM and QEMU

KVM (Kernel-based Virtual Machine) is a Type 1 hypervisor that operates directly on the host hardware. It leverages hardware-assisted virtualization extensions like Intel VT-x and AMD-V. KVM is tightly integrated within the Linux kernel.

QEMU (Quick Emulator) is a versatile open-source software that emulates various hardware components, including CPUs, disks, and network devices.

When combined, KVM and QEMU create a powerful virtualization solution. QEMU provides the necessary I/O emulation layer, while KVM harnesses the host hardware's virtualization capabilities, resulting in high-performance virtual machines.

With that background in mind, let's jump into details of the paper.

Introduction to Firecracker

Firecracker is a lightweight VMM that runs on top of KVM. It is designed for serverless computing.

Why Firecracker doesn't use Linux containers (or other OS-level virtualization)?

Firecracker prioritizes strong security isolation over the flexibility offered by Linux containers. Virtualization, with its inherent hardware-level isolation, provides robust defense against a wider range of attacks, including sophisticated threats like side-channel attacks. Containers offer some isolation, but can still be vulnerable to these types of attacks.

Why Firecracker doesn't use QEMU?

QEMU is bulky. Firecracker was designed to be lightweight and efficient, enabling higher server density and minimizing overhead for serverless functions.

Firecracker VMM

Essentially, Firecracker is a Type 1 hypervisor built upon KVM. 

The KVM Layer

Firecracker leverages KVM for hardware-assisted virtualization, enhancing performance and security:

  • All CPU instructions are directly executed by the KVM layer (with hardware-assistance).
  • KVM also handles thread and process scheduling within the guest environment.
  • KVM provides robust CPU isolation at the hardware level, preventing unauthorized access to host resources.

The VMM Layer

On top of KVM, instead of using the full-fledged QEMU, Firecracker employs a lightweight VMM implementation (re-using crosvm) that supports a limited set of essential devices:

  • Serial ports for basic input/output.
  • Network interfaces for communication.
  • Block devices for storage.

Serial ports have a lightweight implementation, requiring less than 250 lines of code. All network and block I/O operations within Firecracker are trapped into the virtio. Virtio provides access to network and block devices via a serial API, which has a concise implementation of under 1400 lines of code.

Rate Limiters

CPU and memory resources are constrained by modifying the cpuid instruction within KVM, limiting the available cores and memory. This approach ensures a homogeneous compute fleet. However, it is significantly less sophisticated than Linux container's cgroup mechanism, which offers advanced features such as CPU credit scheduling, core affinity, scheduler control, traffic prioritization, performance events, and accounting.

Virtual network and block devices also incorporate in-memory token bucket-based rate limiters. These limiters control the receive/transmit rates for network devices and I/O operations per second (IOPS) for block devices.

Understanding How Firecracker Works (Figure 3)

Firecracker Architecture

  • MicroManager: The central component of Firecracker, responsible for spawning and managing VMs (or, as they call it, MicroVMs).
  • Each VM is spawned as a process running on KVM layer.
  • This process itself has subcomponents:
    • Slot: Contains the guest OS kernel. Within the guest kernel, the lambda function runs as a user-space process. The lambda binary includes a λ-shim that facilitates communication with the MicroManager over TCP/IP.
    • Firecracker VMM: Handles all virtual I/O operations.

Lambda Function

  • Slot Affinity: Each lambda function is tightly bound to a specific slot (VM), ensuring consistent execution within the same environment.
  • Event-Driven: Lambdas are designed to be event-driven, activating only when events (such as incoming requests or messages) arrive. This minimizes resource consumption during idle periods.
  • Slot States:
    • Idle: The VM is descheduled, consuming minimal CPU resources but retaining its memory state.
    • Busy: The VM is actively processing events, utilizing both CPU and memory.
    • Dead: The VM has been terminated and removed from the system.

This paper explores specific scheduling constraints for lambda functions within the Firecracker environment to enhance performance and resource utilization. These constraints are detailed in section 4.1.1.

Multi-tenancy

A significant portion of lambda functions reside in an idle state, consuming approximately 40% of the available RAM. This characteristic can help in machine oversubscription.

The economic advantage of lambda functions stems from their ability to support multi-tenancy.

For example, consider a machine with 10GB of RAM capacity. If 10 lambda functions are scheduled on this machine, each allocated 5GB of capacity, and the compliance offering is 80% (guaranteeing at least 4GB of RAM to each lambda), the machine is oversubscribed fourfold.

Assuming all lambdas can acquire their required resources upon request, the efficiency of the system can be calculated as:

Efficiency = Total Allocated Capacity / Machine Capacity = 40GB / 10GB = 4x

This high level of efficiency contributes significantly to revenue generation.

Fast Boot time

The boot time for a Firecracker microVM is approximately 125ms.

Little's Law: Keep at least creation rate x creation latency slots ready.

For example, if the desired creation rate is 100 VMs per second and the boot time (creation latency) is 125ms (0.125s), the minimum number of ready VMs in the pool should be:

100 VMs/second * 0.125s = 12.5 VMs

Therefore, we need to maintain at least 13 ready VMs in the pool to accommodate the expected workload.

Evaluation

Memory Overhead

This analysis focuses solely on the memory consumption of the VMM process within each MicroVM, as this constitutes the primary overhead of the MicroVM. Given that all VMMs are initiated as processes, the code is loaded into shared memory segments and consequently shared across all VMMs. This static overhead is excluded from the evaluation. Therefore, the analysis only considers the sum of non-shared memory segments, as identified by the pmap command.

Firecracker demonstrates exceptional lightweight characteristics, not only in terms of binary size but also due to its minimal per-VM RAM overhead, which is remarkably low at 3 MB.

I/O Performace

Like QEMU, Firecracker VMM acts as an I/O emulator, handling all I/O operations within the virtual machine. This allows for software-level rate limiting of I/O traffic. 

Block Device I/O Performance

Block devices (such as SSDs and HDDs) operate at block-level. In Firecracker VMM, the reads and writes to blocks are serialized. This serial access significantly impacts performance compared to QEMU or bare-metal systems (increased latency and reduced throughput).

Large-block writes are efficient but that is because of Firecracker's lack of flush-on-write. Without flush-on-write, all writes must be explicitly flushed to the underlying storage, introducing additional latency.

Network Device Performance

Emulated network devices within Firecracker also exhibits lower throughput compared to dedicated hardware.

Security

The paper discusses several types of attacks that can potentially affect Lambda functions:

  • Side-Channel Attacks: These attacks exploit unintended information leakage through channels other than the primary data flow.

    • Cache Side-Channel Attacks: Multiple VMs might share the same physical cache. By analyzing cache access patterns, an attacker could potentially infer sensitive information, such as cryptographic keys.
    • Timing Side-Channel Attacks: These attacks involve measuring the execution time of another process to glean information.
    • Power Consumption Side-Channel Attacks: These attacks attempt to extract information by monitoring the power consumption of a target process.
  • Meltdown Attacks: These attacks exploit a race condition in modern CPUs that allows a process to bypass privilege checks and potentially read the memory of other processes. The race condition happens between memory access and privilege checking during instruction processing.

  • Spectre Attacks: These attacks use timing of speculative execution and branch prediction to infer sensitive data like cryptographic keys and passwords.

  • Zombieload Attacks: By carefully timing the execution of specific instructions, attackers can observe the side effects of "zombie loads" (a type of memory access) and potentially extract sensitive information.

  • Rowhammer Attacks: These attacks exploit a physical limitation of DRAM where repeated access to a specific memory row can cause unintended bit flips in neighboring rows, leading to data corruption.

The paper also provides the following remedies:

  • Disabling Simultaneous Multithreading (SMT a.k.a Hyperthreading): SMT allows multiple instruction streams to execute concurrently on a single physical core. Disabling SMT can help reduce the potential for one process to observe the activity of another, thereby mitigating certain types of side-channel attacks.

  • Kernel Page-Table Isolation (KPTI): KPTI isolates the kernel's page table from user-space page tables. This separation helps to prevent Meltdown attacks, which exploit vulnerabilities in memory access control mechanisms.

  • Indirect Branch Prediction Barriers (IBPB): IBPB prevents previously executed code from influencing the prediction for future indirect branches. This helps to mitigate speculative execution attacks like Spectre.

  • Indirect Branch Restricted Speculation (IBRS): IBRS is a hardware-based mitigation technique that aims to restrict the impact of speculative execution on sensitive data.

  • Cache-Based Attacks: Techniques like Flush + Reload and Prime + Probe exploit cache sharing between processes. Mitigating these attacks often involves avoiding shared files and implementing appropriate memory access controls.

  • Disabling Swap: Disabling swap can help to reduce the attack surface by minimizing the amount of sensitive data that resides in volatile memory.

Paper Review

I will admit that this paper presented a significant challenge, requiring multiple days of careful reading to fully grasp its concepts. The material is quite dense and presents a large amount of information in a concise format. To fully comprehend the paper, I had to revisit several key concepts from my operating systems coursework.

While the core idea is not entirely novel within the realm of virtualization technologies, the paper effectively demonstrates its implementation and potential. I highly recommend this paper to anyone interested in the intersection of operating systems and distributed systems.

Now I know why Azure functions, back in 2016, had such a terrible response time!

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...