Skip to main content

Paper Insights #2 - A Comparison of Software and Hardware Techniques for x86 Virtualization

This paper, authored by the esteemed Keith Adams and Ole Agesen at VMWare, was presented by Professor Dawson Engler at Stanford's CS240, where he offered valuable insights. Professor Engler, who knew Ole Agesen personally, described him as exceptionally detail-oriented. Written in 2006, the paper discusses software virtualization and compares it with hardware virtualization.

Software virtualization was already in use and popular at the time. Notably, software virtualization was a highly complex undertaking, involving approximately 100,000 lines of code. Notably, in that era, only a small team of 5-10 engineers at VMware possessed a comprehensive understanding of the VMware Workstation's binary translator. 

Paper Link

This article assumes reader's basic familiarity with microprocessors and operating systems.

Let's begin with some basic concepts.

Microprocessor

It's true, the world of microprocessors is vast and complex. Let's distill the core concepts into a more comprehensive, yet still concise, overview.

At the heart of every computer lies the microprocessor, the engine responsible for executing instructions and performing computations. In essence, it's the "brain" of the system. In this article, we will use the term microprocessor and CPU interchangeably.

Each instruction consists of two fundamental parts: the opcode, which specifies the operation to be performed (e.g., addition, subtraction), and one or more operands, which represent the data to be manipulated. While instructions are stored and processed in binary form, assembly language, a low-level symbolic representation, provides a more human-readable format.

Instruction Set Architecture (ISA)

The microprocessor's "manual", known as the Instruction Set Architecture (ISA), defines the complete set of instructions it can execute. Think of it as the control panel of a machine, outlining every possible action. For example, the x86 ISA is a widely adopted standard, supported by a diverse range of processors, including the 8086 (the original x86 microprocessor), Pentium, Celeron, Xeon, Atom, and Intel Core series.

This article focuses on SISD microprocessors, which execute a single instruction on a single data stream. This contrasts with GPUs, which utilize SIMD architecture to apply a single instruction to multiple data streams.

Hardware Architecture

The term microprocessor often refers to a single processing core. This core is the fundamental unit of execution. Microprocessors are typically integrated into sockets, which can accommodate one or more microprocessors. While most personal computers utilize a single socket, server-grade processors like Intel Xeon frequently reside in multi-socket systems.


A microprocessor's internal architecture comprises several key components:
  • Execution Unit: This is where the actual computations occur, primarily through the Arithmetic Logic Unit (ALU), which performs arithmetic and logical operations.
  • Control Unit: The control unit acts as the orchestrator for the flow of instructions and data within the microprocessor. It fetches instructions from memory, decodes them, and generates the necessary control signals to coordinate the other components.
  • Registers: Registers are small, high-speed storage locations within the microprocessor that hold data and instructions temporarily during processing. They are crucial for rapid access and manipulation of frequently used data.

Fetch-Decode-Execute Cycle

The execution of instructions on a single microprocessor follows the fetch-decode-execute cycle:

  1. Fetch: The control unit retrieves an instruction from memory.
  2. Decode: The control unit deciphers the instruction to determine the operation to be performed.
  3. Execute: The execution unit performs the specified operation.

To enhance performance, modern microprocessors employ pipelining, a technique that allows multiple instructions to be processed concurrently in different stages of the cycle. This overlap significantly increases the throughput of the processor.

CPU Cache

Caches are small, high-speed memory units that store frequently accessed data and instructions, enabling faster retrieval. There are typically multiple levels of cache:

  • L1 Cache: The fastest and smallest level of cache, dedicated to each individual core within the microprocessor. It is often split into L1 instruction cache (L1i) and L1 data cache (L1d). L1 cache sizes vary depending upon the CPU, but are often in the tens of KBs.
  • L2 Cache: A larger and slightly slower cache than L1, often shared between cores on the same chip. L2 cache sizes also vary greatly, but are commonly in the hundreds of KBs, or multiple MBs.
  • L3 Cache: The largest and slowest level of cache, typically shared by all cores on the socket. L3 cache sizes can range from several MBs to tens of MBs.

By utilizing cache memory, microprocessors can significantly reduce the time required to access data and instructions, thereby improving overall performance.

To prevent data inconsistencies when a single process runs on multiple CPU cores, each with its own local cache, cache coherence is vital. This complex problem is resolved by hardware-level mechanisms.

CPU Registers

CPU registers are essential components that hold data and instructions that the CPU is actively working on. There are limited set of registers that are available. Let's go through some of the registers that x86 supports.

General-Purpose Registers

These are versatile registers used for various operations, including arithmetic, logical, and data movement. They are often labeled as R0, R1, R2, etc., or have specific names like %a, %b, %c, %d in the x86 architecture. 

The naming conventions of x86 registers are deeply rooted in the architecture's evolution. Register %a for instance, originated as the accumulator register, a central component optimized for arithmetic operations. However, the specific roles of these registers can differ considerably based on the compiler being used. Nowadays, %a is commonly designated for storing function return values, this isn't a strict, universally applied rule.

In addition to the more historically named registers, the x86 architecture also incorporates general-purpose registers numbered from %r8 to %r15, which offer increased flexibility for modern programming tasks.

In x86 architecture, registers are often identified by prefixes indicating their size:
  • The prefix r and suffix x for named ones denotes a 64-bit register (e.g., %rax, %rbx, %rcx, %rdx).
  • The prefix e and suffix x for named ones and a suffix d for numbered ones signifies the 32-bit version (e.g., %eax, %ebx, %ecx, %edx, %r8d, %r9d, %r10d, %r11d, %r12d, %r13d, %r14d, %r15d). It is important to understand that the 32-bit registers are simply the lower 32 bits of their corresponding 64-bit registers.
  • The suffix x for named ones and the suffix w for numbered ones denotes a 16-bit version (e.g., %ax, %bx, %cx, %dx, %r8w, %r9w, %r10w, %r11w, %r12w, %r13w, %r14w, %r15w).
  • Finally, the suffix l for named ones and the suffix b for numbered ones denotes is used for 8-bit version (e.g., %al, %bl, %cl, %dl, %r8b, %r9b, %r10b, %r11b, %r12b, %r13b, %r14b, %r15b).
Note: There is also a suffix h that corresponds to the upper 8 bits of the lower 16 bits.

64-bit | Lower 32 bits | Lower 16 bits | Lower 8 bits
======================================================
%rax   | %eax          | %ax           | %al
%rbx   | %ebx          | %bx           | %bl
%rcx   | %ecx          | %cx           | %cl
%rdx   | %edx          | %dx           | %dl
%r8    | %r8d          | %r8w          | %r8b
%r9    | %r9d          | %r9w          | %r9b
%r10   | %r10d         | %r10w         | %r10b
%r11   | %r11d         | %r11w         | %r11b
%r12   | %r12d         | %r12w         | %r12b
%r13   | %r13d         | %r13w         | %r13b
%r14   | %r14d         | %r14w         | %r14b
%r15   | %r15d         | %r15w         | %r15b

Index Registers

Index registers are frequently used in conjunction with loop counters to iterate through data structures. Common examples include the %si (Source Index), %di (Destination Index) registers. These are used in conjunction with %bp (Base Pointer) register. In the 32-bit x86 architecture, their extended counterparts are %esi, %edi, and %ebp, respectively. In the 64-bit architecture, they are %rsp, %rdi, and %rbp.

Special-Purpose Registers

  • Program Counter: Holds the relative memory address of the next instruction to be executed. Instructions are fetched from this address. In the x86 architecture, the PC is known as the Instruction Pointer (%ip, %eip, %rip).
  • Stack Pointer (%sp, %esp, %rsp): Points to the top of the stack in memory, used for function calls and temporary data storage.
  • Status Register (or Flags Register): Contains flags that indicate the status of the CPU and the results of operations. Common flags include:
    • Carry Flag (C): Indicates a carry or borrow in an arithmetic operation.
    • Zero Flag (Z): Indicates that the result of an operation is zero.
    • Overflow Flag (O): Indicates an overflow in an arithmetic operation.
    • Sign Flag (S): Indicates the sign of the result.

Segment Registers

A program's memory space is organized into distinct segments, each serving a specific purpose. One crucial segment is the code segment, which holds the program's executable instructions.

Beyond the code segment, other essential memory segments include:

  • Data segment: This segment stores initialized global and static variables. These variables have a fixed memory location throughout the program's execution.
  • BSS segment (Block Started by Symbol): This segment holds uninitialized global and static variables. These variables are typically initialized to zero or null before the program begins execution.
  • Stack segment: This segment manages the dynamic allocation of memory for local variables and function call information. It operates on a Last-In, First-Out (LIFO) principle.
  • Heap segment: This segment is used for dynamic memory allocation during program execution, allowing programs to request and release memory as needed. This is where dynamically allocated memory from functions like malloc() or new are stored.
Corresponding to each of these segments, we have registers storing their addresses:
  • %cs: Points to the code segment. It is used in conjunction with the IP to determine the memory address of the next instruction to be executed. The CS register also encodes the current privilege level (described later) of the CPU.
  • %ds: Points to the data segment.
  • %es: Points to extra data segment. Many compilers use extra data segment for strings.
  • %fs and %gsGeneral-purpose segment register that can point to any data segment. However, its most common use is to point to thread-local storage.
  • %ss: Points to the stack segment.
There are no dedicated registers for BSS and heap.

Descriptor Table Registers

Segment information is stored in descriptor tables:

  • Global Descriptor Table (GDT): This table, managed by the operating system, contains memory segment information for the entire system.
  • Local Descriptor Table (LDT): Each process maintains its own LDT, which stores memory segment information specific to that process.

The base addresses of these tables are stored in the following registers:

  • The GDTR (Global Descriptor Table Register) holds the base address of the GDT.
  • The LDTR (Local Descriptor Table Register) holds a selector that points to an entry in the GDT, which in turn points to the base address of the current process's LDT.

Additionally, the Interrupt Descriptor Table (IDT) stores information about interrupt handlers. The base address of the IDT is held in the IDTR (Interrupt Descriptor Table Register).

Control Registers

Control registers are used to control the behavior of the microprocessor. With different settings, the microprocessor may behave differently. There are control registers %cr0 to %cr15. However, not all registers are usable. Some of the commonly used control registers are:

  • %cr0: Controls various processor modes and flags, including protected mode, paging, and enabling/disabling features.
  • %cr2: Stores the linear address that caused a page fault.
  • %cr3: Holds the address of the page table (a.k.a. Page Table Register), essential for memory paging.
  • %cr4: Enables various processor extensions and features, such as PAE and virtualization extensions.
  • %cr8: Task Priority Register, used for prioritizing external interrupts.

Addressing Modes

Addressing modes are crucial aspects of CPU architecture that determine how the operands of an instruction are accessed. They define how the effective address of an operand is calculated. There are several different types of addressing modes.

Immediate Addressing

The operand itself is directly embedded within the instruction. For example, the following instruction sets the value of 10 to %rax.

mov %rax, $10

Register Addressing

The operand is located in a CPU register. For example, the following instruction sets the value of %rax to %rbx.

mov %rax, %rbx

Direct Addressing (Absolute Addressing)

The instruction contains the actual memory address of the operand. For example, the following instruction moves the value from memory address 1000 into the %rax.

mov %rax, [$1000]

Indirect Addressing

The instruction specifies a register that holds the address of the operand. For example. the following instruction move the value from the memory address pointed to by the %rbx into the %rax.

mov %rax, [%rbx]

Indexed Addressing

The effective address is calculated by adding an index register's value to a base address (which can be a constant or another register). For example, the following moves the value from the memory address calculated by adding the %rbx (base) and %rsi (index) into the %rax.

mov %rax, [%rbx + %rsi]

Base-Indexed with Displacement Addressing

The effective address is calculated by adding the values of a base register, an index register, and a displacement (constant offset). For example, the following moves the value from the memory address calculated by adding the rbx, rsi registers, and 10 into the rax.

mov %rax, [%rbx + %rsi + $10]

PC-Relative Addressing

The effective address is calculated relative to the current PC. For example, the following instruction makes the program jump 10 steps from the current instruction address.

jmp +$10

Example Program: isPrime

Let's go through an example program to understand the concepts. I am re-using the same isPrime program that shows up in the paper as well.

1. int isPrime(int a) {
2.   for (int i = 2; i < a; i++) {
3.     if (a % i == 0) return 0;
4.   }
5.   return 1;
6. }

===================================

1.  isPrime:   mov %ecx, %edi
2.             mov %esi, $2
3.             cmp %esi, %ecx
4.             jge prime

5.  nexti:     mov %eax, %ecx
6.             cdq
7.             idiv %esi 
8.             test %edx, %edx
9.             jz notPrime
10.            inc %esi
11.            cmp %esi, %ecx
12.            jl nexti

13.  prime:    mov %eax, $1
14.            ret 

15.  notPrime: xor %eax, %eax
16.            ret


The isPrime function determines if a given number is prime by iterating from 2 up to one less than the number, checking for divisibility at each step. If any number divides the input, it's not prime; otherwise, it's prime.

For this discussion, I am assuming that the reader understand the fundamental mechanics of function calls. Before a function is called, the caller's register values are saved to the stack. Function arguments are passed either via the stack or, for a limited number of arguments, in registers. The function's return value is placed in a specific register, and after the function finishes, the stack is popped to restore the caller's register state. The compiler is responsible for doing all these linkages among functions.

Coming back to the example, the function's logic is divided into four labeled blocks:
  • isPrime: This initial block effectively handles the special case where the input number is 2.
    • %ecx is set to %edi, where %edi holds the function's argument "a".
    • %esi (the loop counter) is initialized to 2.
    • %esi is compared to %ecx.
    • If %esi is greater than or equal to %ecx (meaning the input is 2), the program jumps to the prime label; otherwise, it proceeds.
  • nexti:
    • %eax is set to %ecx (the input number).
    • cdq sign-extends %eax into %edx, preparing for signed division.
    • idiv performs the division of %edx:%eax by %esi (the loop counter), with the remainder in %edx. The remainder will be negative for negative input.
    • The remainder is checked for zero.
    • If the remainder is zero (meaning %ecx is divisible by %esi), the program jumps to the notPrime label.
    • %esi is incremented (loop counter).
    • %esi is compared to %ecx.
    • If %esi is less than %ecx, the program jumps back to the nexti label (loop continuation).
  • prime:
    • %eax is set to 1 (indicating prime).
    • The stack is popped (return).
  • notPrime:
    • %eax is set to 0 (indicating not prime).
    • The stack is popped (return).
Note that on line 15, xor %eax, %eax is used instead of mov %eax, $0. Historically, and sometimes still, xor can be faster than mov due to how it's handled by the CPU.

Operating Systems

An operating system (OS) can be viewed as a layer of software that possesses privileged access to the CPU's hardware. The OS acts as an intermediary, orchestrating the execution of user programs. When a user program requires system resources or performs privileged operations, it makes a request to the OS. The OS, in turn, grants or denies these requests, ensuring system stability and security. Once the OS completes its task, it yields control back to the program.

Protection Ring

To enforce security and manage access to hardware resources, modern CPUs implement a system of privilege levels, commonly known as protection rings. These rings define the level of access that software has to the CPU and its peripherals. The most common arrangement is:

  • Ring 0 (Kernel Mode): This is the highest privilege level, reserved for the OS kernel. The kernel, being the core of the OS, has unrestricted access to all CPU instructions and hardware resources.
  • Ring 1 & 2 (Device Driver Mode): These rings are typically used for device drivers, which are software components that enable the OS to communicate with hardware devices. Using different rings for drivers allows for further fine grained control of driver access. Although these rings exist, Ring 2 is rarely used in modern operating systems.
  • Ring 3 (User Mode): This is the lowest privilege level, where user applications run. Applications in Ring 3 have limited access to hardware resources and are restricted from executing privileged instructions. Any attempt to execute a privileged instruction from Ring 3 results in a trap, which transfers control to the OS kernel.

The OS kernel, residing in Ring 0, holds the ultimate authority over the system. It can execute any CPU instruction, whereas applications in Ring 3 are restricted from accessing certain privileged instructions.

The protection ring level in which the CPU currently operates is also called Current Privilege Level (CPL) and is encoded in the %cs register as described before.

Trap

When an application in CPL 3 attempts to execute a privileged instruction, a trap occurs. This trap is a hardware-generated interrupt that forces the CPU to switch to kernel mode (CPL 0) and execute a kernel routine, allowing the OS to handle the interrupt and failing the program if required.

System Calls

A system call, like a trap, also causes the CPU to switch to kernel mode. Internally, it is implemented as a trap to the kernel. The kernel then executes the requested operation on behalf of the user application.

User applications are often built using libraries, which provide pre-written code for common tasks. These libraries can be linked to the application in two ways:

  • Static Linking: The library code is incorporated directly into the application's executable during compilation. This results in a larger executable but eliminates the need for external library files at runtime.
  • Dynamic Linking: The library code is stored in separate files (e.g., .dll files in Windows, .so files in Linux) and loaded into memory at runtime. This allows for smaller executable files and enables multiple applications to share the same library code.

Most core OS functionalities, such as input/output and network access, are provided through dynamically linked kernel libraries. These libraries encapsulate system calls.

Virtual Memory

Virtual memory provides programs with an abstraction of physical memory, creating the illusion of a vast, contiguous address space.

Operating systems manage this virtual address space, translating virtual addresses to physical addresses. A virtual address can map to any location: physical memory, secondary storage (like a hard drive), or even to a non-existent location, as the virtual address space may exceed the available physical resources.

Paging

Direct byte-level mapping of virtual to physical addresses would require an impractically large translation table. To address this, memory is divided into fixed-size units called pages, typically 4 KB.

A page table is a data structure used for translating virtual page numbers (VPNs) to physical page numbers (PPNs) in main memory:

Page Table: VPN -> PPN

Historically, operating systems used only fixed-size pages, which still resulted in large page tables. To reduce the overhead, the concept of superpages (or huge pages) was introduced. Superpages group multiple contiguous pages into a single, larger page, reducing the number of entries in the page table.

Each entry in the page table also contains some bits (apart from the physical address of the page):

  • Present Bit: Indicates whether the corresponding page is currently present in physical memory. If this bit is not set, a page fault occurs, and the operating system must load the page from secondary storage.
  • Protection Bit: Controls whether the page can be read from or written to. This allows the operating system to enforce memory protection.
  • Mode Bit: Determines whether the page can be accessed by user-level programs or only by the operating system kernel.
  • Dirty Bit: Indicates whether the page has been modified since it was loaded into memory. This is used by the operating system to determine whether a page needs to be written back to disk.

Storing Page Table

The page table can itself be stored by the OS in a specific memory location. The address of the page table is stored in Page Table Register (%cr3).

Swapping

Accessing a page within main memory is relatively fast. The CPU has a direct memory bus that facilitates rapid data transfer. When data is read from main memory, it's fetched and often cached in CPU cache lines.

Conversely, accessing a page located in secondary storage (like a hard drive or SSD) is significantly slower. The CPU does not have a direct connection to secondary storage. Therefore, a page must first be transferred from secondary storage to main memory before it can be accessed by the CPU.

When main memory becomes full (i.e., there are no free pages), the operating system employs swapping. Swapping involves moving inactive or less frequently used pages from main memory to secondary storage, freeing up space for active pages. This process is essentially the transfer of pages that were residing in main memory to available space within secondary storage. When swapped out pages are accessed, it results in a page fault which brings back the page into the memory.

Opposite to swapping is the concept of caching, where disk pages residing in secondary storage (where they should be residing) are brought into main memory. This process can help speed up reads and writes to files. Note that, caching is different from demand paging, which brings in those pages from the secondary storage which are suppose to reside in main memory but were previously swapped out.

It's crucial to understand that the page table only contains entries for pages that are currently mapped to physical memory. Pages that exist only in secondary storage or have not yet been allocated will not have corresponding entries in the page table.

Memory Management Unit (MMU)

The Memory Management Unit (MMU) is a dedicated hardware component integrated into the CPU, playing a pivotal role in virtual memory management.

Translation Lookaside Buffer (TLB)

It might seem that every virtual address translation would require the OS to intervene, as it alone manages page table entries. This would entail a context switch on every memory access, causing significant performance overhead due to the expensive process of saving and restoring CPU state, and flushing CPU caches. To mitigate this, CPUs implement a hardware cache called the Translation Lookaside Buffer (TLB).

The TLB acts as a high-speed translation cache, storing recently used virtual-to-physical page mappings. When the CPU accesses a virtual address, the MMU first checks the TLB. If a matching translation is found (a TLB hit), the physical address is retrieved directly, bypassing the need for a page table lookup.

However, the TLB has a limited capacity, typically holding 4,000 entries. If the required translation is not found (a TLB miss), a page fault is generated. This fault triggers a trap, transferring control to the OS. The OS then performs a page table walk to locate the physical address, updates the TLB with the new mapping, and resumes the process.

Side Note: TLB, Caches & Context Switches

Upon a process's departure from the CPU, all TLB entries associated with that process are invalidated (flushed). This is crucial to prevent subsequent processes from accessing the previous process's memory pages through stale TLB entries.

The invlpg instruction is typically used for this purpose.

Because TLB entries are cleared, they must be repopulated as the new process executes. This TLB flushing and subsequent repopulation contribute significantly to the overhead of context switches between processes. On top of that, CPU caches also become ineffective after a context switch, further impacting performance.

Consequently, many operating systems employ CPU affinity, assigning processes to dedicated CPU cores, to minimize context switching and improve efficiency.

Hardware Walking Page Tables

In addition to OS-driven page table walks during page faults, modern MMUs often feature hardware walking capabilities. This means the MMU itself can automatically traverse the page table structure to locate missing translations.

The physical address of the current process's page table is stored in a special CPU register, the Page Table Register (PTR). On x86, this is register %cr3. This register allows the MMU to locate the active page table during address translation.

When a TLB miss occurs, the MMU's hardware logic takes over, fetching the necessary page table entries from memory. This process eliminates the overhead of a context switch to the kernel, significantly speeding up address translation. The MMU is designed to understand the page table structure, allowing it to efficiently navigate through multi-level page tables.

Virtualization

Virtualization essentially involves creating an illusion for a guest operating system, making it believe it's running directly on physical hardware. The goal is complete transparency, where the guest OS cannot distinguish its virtualized environment from a native one. This is achieved by emulating hardware behavior and maintaining consistency through shadow data structures.

Strawman

A basic virtualization approach utilizes a trap-and-emulate model. In this setup, the guest OS itself runs as a standard application within the host OS, executing in CPL 3. Consequently, any privileged instruction executed by the guest OS, such as I/O operations, triggers a trap, transferring control to the host OS. The host OS then emulates the requested operation on behalf of the guest.

However, I/O is just one aspect of virtualization. Memory management presents another challenge. The guest OS is allocated a limited memory space, which, being a process within the host, also has its own virtual address space. This necessitates a two-tiered address translation: 

  • The guest OS's virtual addresses are translated to the actual physical addresses.
  • And then the applications within the guest must have their own page tables to translate their virtual addresses into the guest OS's physical addresses. 
This double translation introduces significant performance overhead.

While this strawman model provides a functional virtualization framework, it's inherently inefficient.

Full v/s OS Virtualization

Virtualization can itself be categorized into -

Full virtualization

Full virtualization is complete virtualization of the actual hardware to allow software environments, including a guest OS and its apps, to run unmodified. The guest OS (a.k.a VMs) remains unaware of its execution within a virtual environment. VMs operate on top of hypervisors, which can be categorized into two types:

  • Type 1 hypervisors: Run directly on bare-metal hardware, such as ESX Server.
  • Type 2 hypervisors: Run on top of another OS, such as VirtualBox.

Note: A hypervisor is also known as a Virtual Machine Monitor (VMM).

OS-level Virtualization

OS-level virtualization isolates a pool of resources within the operating system. This encompasses technologies such as containers (like Docker), jails (like FreeBSD jail), chroot, and virtual environments. The most popular among them today are containers.

For the purposes of this article, the term "virtualization" is used to denote full virtualization only.

Recommended Read: Paper Insights - Firecracker: Lightweight Virtualization for Serverless Applications for more on OS-level virtualization technique.

Classical Virtualization

To ensure clarity in our discussion of virtualization techniques, let's establish key definitions: Host will consistently refer to the VMM, which can operate either as an application on a host OS (Type 2) or directly on bare metal (Type 1). Guest OS denotes the guest OS running on top of the VMM, while guest process refers to any process executing within the guest OS environment.

VMM Characteristics

Popek and Goldberg defined three essential characteristics of a VMM for classical virtualization:

  • Fidelity: The guest OS software must execute identically to its execution on the native hardware, requiring no modifications.
  • Performance: A significant majority of guest OS instructions should execute directly on the CPU, minimizing traps.
  • Safety: All hardware accesses must be mediated by the VMM.

Refining Strawman

Let's refine our basic virtualization model, still using the trap-and-emulate approach. Guest instructions execute in a deprivileged mode, and privileged instructions trigger traps, transferring control to the VMM for emulation.

Shadow Page Table

One key optimization involves the page table translation process. Instead of double translation, we introduce a shadow page table that directly maps the virtual address space of a guest process to machine (host) physical page numbers.

  • Guest Page Table: VPN -> PPN
  • Shadow Page Table: VPN -> MPN
Here, PPN refers to guest's known physical page number and MPN refers to machine physical page number, i.e., the physical page number known to host.

The shadow page table is used for actual address translation during guest process execution, eliminating the overhead of double translations. The guest OS remains aware only of its own guest page table.

Hidden Fault Handling

Whenever the guest OS modifies its page table entries (adding or removing mappings), the VMM may not immediately reflect these changes in the shadow page table. Accessing a page not present in the shadow page table results in a hidden fault. Upon a hidden fault, the VMM updates the shadow page table.

Tracing

Hidden faults effectively populate the shadow page table with missing entries. However, we must also address how to remove entries from the shadow page table when they are removed from the guest page table. Without this, a guest process might access invalid pages that have been reallocated, leading to potential security breaches.

The solution is to utilize access bits on the guest page table. The guest page table is itself stored on some physical page known to the host. That physical page is write-protected. Updates to it will trap to the host. This is how the VMM detects modifications to the guest page table and can then update the shadow page table accordingly. This page-protection method is known as tracing.

Performance

Despite its functionality, classical virtualization exhibits performance bottlenecks primarily arising from the trap-and-emulate mechanism and the overhead associated with tracing memory access operations.

Trap-and-Emulate is Non-Classical

The trap-and-emulate virtualization approach deviates from classical virtualization, as defined by Popek and Goldberg, in two key ways:
  • The guest OS can detect it's not running in Ring 0 by reading the code segment register. 
  • Instructions like popf bypass traps, allowing direct modification of status flags without VMM intervention. These deviations violate the principles of fidelity and safety, undermining the transparency and controlled hardware access that classical virtualization aims to provide.

Software Virtualization

Software virtualization fundamentally relies on binary translation. Instead of direct CPU execution, guest instructions are intercepted and translated by the host into safe equivalents. This allows for features like virtualized flags (e.g., from popf) and spoofed CS registers.

Binary translation is not a novel concept; tools like Valgrind utilize it for dynamic bug detection.

However, this translation overhead can significantly impact performance if each guest instruction is translated individually (like a interpreter). To mitigate this, just-in-time (JIT) compilation is employed, similar to the JVM's JIT. This dynamic translation occurs at runtime, on demand, enabling optimizations.

Side Note: JIT Benefits

JIT compilation is a common compiler technique and offers performance enhancements. Notably, it can rearrange code blocks to improve instruction cache locality, reducing cache misses caused by frequent jumps in the original code.

Binary Translation

The software virtualization approach executes guest code in a less privileged level (e.g., CPL 1). This allows most guest instructions to execute natively without full translation, minimizing overhead.

The translation process operates on translation units (TUs), which are basic blocks of guest code. Each TU is translated into a compiled code fragment (CCF). TUs are typically limited in size (e.g., 12 instructions) and terminate with a jump or return instruction.

The authors demonstrate that translated instructions offer a performance advantage over the conventional trap-and-emulate approach. In a specific instance, a read timestamp counter (rdtsc) operation required 2030 cycles using trap-and-emulate, while the same operation completed in only 216 cycles with TC emulation.

IDENT vs. NON-IDENT Instructions

  • IDENT instructions: These instructions require no translation and can be executed directly. The majority of guest instructions fall into this category.
  • NON-IDENT instructions: These instructions necessitate translation. Examples include:
    • PC-relative addressing - This is because the translated code may not be in the same place as the original code.
    • Control flow - Jump (e.g., jge, jmp) and return instructions. 
    • Privileged instructions - Such as popf.
    • Non-privileged instructions - Such as load, store, etc (adaptive).

Lazy Translation

As a consequence of just-in-time compilation, only executed code paths are translated. Unexecuted code remains untranslated.

Chaining Optimization

 The translation process may alter instruction addresses, requiring adjustments to target addresses. If a jump target is determined to be the next sequential instruction, the jump is replaced with a no-op instruction, a technique known as chaining optimization.

Example

Let's run through the translated isPrime example again on input 49 which is not a prime.

1. isPrime’:   mov %ecx, %edi 
2.             mov %esi, $2 
3.             cmp %esi, %ecx 
4.             jge [takenAddr]
5.
6. nexti’:     mov %eax, %ecx
7.             cdq idiv %esi 
8.             test %edx, %edx 
9.             jz notPrime’
10.
11.            inc %esi
12.            cmp %esi, %ecx 
13.            jl nexti’
14.            jmp [fallthrAddr3] 
15. notPrime’: xor %eax, %eax
16.            pop %r11
17.            mov %gs:0xff39eb8(%rip), %rcx 
18.            movzx %ecx, %r11b 
19.            jmp %gs:0xfc7dde0(8*%rcx)

Let's break down each point:

  • Line 4: The jump instruction's target address has been altered. While it was intended to point to a prime block, that block was never executed for the given input (49), hence its translation was skipped.
  • Lines 5 & 10: The jump instruction jmp [fallthrAddr] was removed because the "fall-through" address was the very next instruction. This is chaining optimization.
  • Line 14: The jump jmp [fallthrAddr3] was retained. However, since the prime block was never translated, the jump's target remains unresolved at this stage.
  • Line 16 - 19: The ret instruction is translated:
    • Line 16: The pop instruction retrieves the return address from the stack into %r11.
    • Line 17: The %rcx register is saved to a memory location. The memory address is calculated by adding a fixed displacement of 0xff39eb8 to %rip and this will be the relative address within the segment pointed by %gs register.
    • Line 18: The lower 8 bits of the %r11 register (%r11b) are extracted and placed into the %ecx register, with the remaining bits of %ecx being set to zero.
    • Line 19: Probably, relative address 0xfc7dde0 within %gs segment is the data structure (such as an array) which stores the return addresses to the callee. The index into the array is coming from %rcx register.
Note that this translation is performed by compiler. It may seem complex but that is how compilers really work. For instance spilling %rcx on line 17 was required, because its value was used to store another variable.

Running the Translator

The translator runs as a part of VMM. The translator needs to store its data structures used for address translation. This data structure is stored in a memory segment mapped by %gs register. This memory segment is mapped to guest OS (to make it accountable).

Adaptive Binary Translation

Apart from privileged instructions, there are also non-privileged instructions that can result in a trap. For example, load and store instructions that access memory can potentially cause a trap if there is a page fault.

In such cases, the authors utilize adaptive translation, where they monitor these CCFs that cause frequent traps and then replace them with cheaper alternatives.



Figure 1 illustrates an example of this. On the left side, jumping to CCF1 results in a page fault. The translator monitors the faults and then replaces CCF1 with CCF5. All jumps to CCF1 are now adapted to CCF5 which doesn't cause page faults.

Hardware Virtualization

Hardware-assisted virtualization significantly alters virtualization assumptions. Intel's VT extensions provide hardware support for this technique.

Guest Mode

In hardware-assisted virtualization, CPUs introduce a protected mode (also called guest mode) alongside the real mode. The VMM configures guest mode using the Virtual Machine Control Block (VMCB).

Guest code executes directly on the CPU at the same CPL as VMM (CPL 0). However, based on the VMCB settings, specific instructions trigger traps, transferring control to the VMM.

The vmrun instruction initiates guest mode, and traps result in exits back to the VMM.

Virtual Machine Control Block (VMCB)

The VMCB is a data structure used by the host and CPU to facilitate virtualization. It contains:

  • Control Bits: Used to intercept reads and writes to control registers (%cr0 - %cr15), generating traps.
  • Debug Bits: Similar 32-bit fields for debug registers (%dr0 - %dr15).
  • Intercept Bits: 8 bits to intercept reads and writes to IDTR, GDTR, and LDTR.
  • Instruction Intercepts: Bits to intercept pushf, popf, vmrun, hlt, invlpg, int, iret, and in/out operations to specified ports.
  • Page Table Intercepts: Because the page table base is stored in a control register (%cr3), the VMM can program the VMCB to generate exits on guest page faults and TLB flushes, enabling shadow page table management.
There are numerous additional bits for various virtualization controls.

When the guest OS relinquishes control to the VMM, the guest's state (register values) is saved.

Example: fork(2) with Hardware Virtualization

The paper illustrates the fork(2) system call's behavior in hardware virtualization. Let's examine this process and compare it to software virtualization.

It's crucial to understand that both parent and child processes are in an identical state when fork(2) is called, and their states only begin to diverge afterward.

The fork(2) system call initiates a child process that, as an optimization for speed, initially shares the parent's page table. This approach avoids the time-consuming process of immediately copying the entire parent's memory to a new location before the child can begin execution. Instead, the operating system marks all of the parent's memory pages as read-only, effectively write-protecting them. When either the child or the parent process attempts to modify a write-protected page, the operating system intervenes and creates a separate copy of that page for the modifying process. The write-protected, shared pages represent the portion of their shared state that has not yet diverged between parent and child process.

When a guest process calls fork(2):

  • The system call transitions the CPL from 3 to 0, as the guest OS handles it. Since the guest runs at CPL 0, it can process the call without direct VMM intervention. In software virtualization, this would trigger a VMM trap.
  • The guest OS write-protects the parent's page table. Accessing the page table register (%cr3) triggers a VM exit, transferring control to the VMM. The VMM then write-protects the corresponding shadow page table entry.
  • The guest OS resumes and schedules the child process. Scheduling the child requires updating the page table register. This again causes a VM exit. The VMM updates the shadow page table register.
  • From then, hidden page faults are also vectored to the VMM, allowing it to maintain the shadow page table.

Comparison

Turning to the paper's central objective: a comparative analysis of hardware and software virtualization techniques. At first glance, hardware virtualization might appear superior, particularly when VM exits are infrequent. However, the authors' experiments challenge this assumption, revealing several scenarios where binary translation in software virtualization outperforms hardware virtualization.

It's important to acknowledge a potential bias: as VMware developed binary translation, there's a natural inclination towards showcasing the strengths of software virtualization.

Experiment Results

The experimental design focused on scenarios triggering privileged operations, as compute-intensive benchmarks demonstrated near-native performance (Figure 2). Notably, the "mcf" application exhibited performance exceeding native execution, attributed by the authors to virtualization-induced improvements in TLB and cache access patterns.

Figure 3 illustrates that hardware VMMs occasionally outperform software VMMs. However, given the complexity of operating system interactions, attributing definitive conclusions is challenging. The authors' explanations for these variations appeared somewhat inconclusive.

In the fork/wait experiment, software virtualization completed 40,000 fork/wait operations in 36.9 seconds, while hardware virtualization required 106.4 seconds, representing a 4.4x latency increase.

Microbenchmark results further highlighted performance differences:

  • syscall: Software virtualization incurred a significant performance penalty, adding approximately 2,000 cycles compared to hardware or native execution.
  • in: Software virtualization excelled by translating the I/O port read instruction, resulting in rapid execution.
  • cr8wr: Software virtualization again demonstrated superior performance by translating the privileged register write into a more efficient instruction sequence.
  • call/ret: Binary translation introduced overhead, slowing down control flow due to the translation process.
  • pgfault: Software virtualization outperformed hardware, as software-handled page faults proved less costly than VM run/exit cycles.
  • divzero: Hardware virtualization exhibited superior performance, as the CPU-generated trap was handled more efficiently.
  • ptrmod: Software virtualization excelled in manipulating page table entries, as it avoided traps and translated the instruction into a more efficient variant.

Paper Review

To be entirely candid, my academic background in compilers and operating systems has not been my strongest area of study, and I found this particular paper to be exceptionally dense and difficult to navigate. In addition, my lack of practical experience with operating system kernels made it challenging to discern the precise mechanisms by which the various elements of an operating system interface with the VMM. It required a substantial amount of dedicated time to engage in thorough research and to fully comprehend the underlying concepts.

I believe that while this paper demands a significant time commitment, it presents a valuable opportunity to augment one's understanding of key concepts in compilers and operating systems.

Comments

Popular posts from this blog

Paper Insights #18 - Practical Uses of Synchronized Clocks in Distributed Systems

This influential paper was authored by Barbara Liskov , a renowned computer scientist who pioneered the field of distributed systems. Paper Link The paper provides a valuable overview of several groundbreaking systems: At-most-once delivery (SCMP) : This system ensures that a message is delivered at most once, preventing duplicate messages. Authenticator Systems (Kerebos) : This system focuses on secure authentication and authorization within distributed environments. Cache consistency (Echo) : This system addresses the challenges of maintaining data consistency across distributed caches. Distributed Databases (Thor) : This system explores the design and implementation of distributed databases. Replicated File System (Harp) : This system investigates the principles of replicating files across multiple servers for improved availability and performance. While many of these concepts may seem outdated in the context of modern computing, studying them provides crucial insights in...

Paper Insights #1 - Moving Beyond End-to-End Path Information to Optimize CDN Performance

This highly influential paper on Content Delivery Networks (CDNs) was authored by Rupa Krishnan   et. al, including Sushant Jain, who was listed fourth among the authors. Sushant was a valued colleague of mine at Google Ads Infrastructure, where he served as Senior Engineering Director for many years. Paper Link Before delving into the paper's concepts, which are generally straightforward to grasp, let's explore some relevant background information. OASIS (2006) OASIS , developed by M. Freedman , K. Lakshminarayanan, and my former Distributed Systems (CS244b) professor at Stanford, D. Mazieres , elegantly addresses the critical challenge for Internet: locating the service replica with the lowest latency for a given client. Prior to OASIS Clients naively pinged every service replica to determine the fastest one based on round-trip time (RTT). While highly accurate, this approach suffered from excessive probing and computationally expensive comparisons. OASIS Architecture OASIS i...

Paper Insights #16 - Cassandra - A Decentralized Structured Storage System

This research paper, authored by Avinash Lakshman (co-inventor of Amazon Dynamo) and Prashant Malik , originates from Facebook and dates back to 2008. Paper Link Cassandra, in its design, appears to be a synthesis of Amazon's Dynamo (2007) and Google's Bigtable (2006). It draws heavily upon the concepts of both systems. Notably, this paper was published during the rise of these influential databases. However, Cassandra also introduces novel ideas that warrant further investigation. Recommended Read: Dynamo: Amazon's Highly Available Key-value Store Let's begin with some of fundamental concepts. SQL Databases SQL databases are a category of databases which are inherently consistency. This implies that data integrity is always upheld. For instance, in a banking database, the cumulative balance across all accounts must remain unchanged at any time regardless of the number of transfer transactions. To ensure this data consistency (the C in ACID), SQL databases necessita...

Paper Insights #13 - Delta Lake: High Performance ACID Table Storage over Cloud Object Stores

At the 2020 VLDB conference, a notable paper was presented by  Michael Armbrust  (Databricks), with co-authors including CEO  Ali Ghodsi  and  Matei Zaharia . Paper Link Before we delve into the paper's details, I would like to introduce some topics to readers. Cloud Data Store The paper effectively describes the design of a cloud data store. Due to its key-value nature and simple API, it has seen wider adoption than a fully-fledged distributed file system. Popular examples of cloud data stores include  Google Cloud Storage ,  Amazon S3 , and  Azure Blob Storage . Design Points Key-Value Store with Eventual Consistency : Functions as a key-value store with eventual consistency. Keys resemble file paths (strings) while values can be byte arrays ranging from a few kilobytes to terabytes. Data Immutability : In most cloud stores, data is immutable. Appends are possible but generally not optimal. Unlike a file system where appends result in addin...

Paper Insights #19 - Kafka: A Distributed Messaging System for Log Processing

This paper was authored by Jay Kreps, Neha Narkhede , and Jun Rao. This seminal paper, presented at the NetDB '11 workshop, laid the foundation for Apache Kafka , a highly influential open-source project in the realm of distributed systems. Paper Link While the paper initially focused on a specific use case – log processing – Kafka has since evolved into a versatile and robust platform for general message delivery. Both Jay Kreps and Neha Narkhede went on to co-found Confluent Inc. , a company commercializing Kafka. Although workshop papers typically carry less weight than conference papers, this particular work garnered significant attention and has had a profound impact on the field. The paper's relatively weak evaluation section may have contributed to its non-selection for the main conference track. However, this in no way diminishes its significance and the lasting influence of Apache Kafka. Messaging Systems Messaging systems facilitate the exchange of messages between di...

Paper Insights #5 - The Design and Implementation of a Log-Structured File System

This paper, authored by M. Rosenblum (co-founder of VMware) and J. Ousterhout, explores Log-Structured File Systems (LFS). While LFS was previously considered obsolete, the rise of Solid State Drives (SSDs) has rekindled interest in its core principles, particularly the concept of immutability. Paper Link Modern file systems, such as RAID 5, incorporate principles from log-structured file systems. HP's commercial AutoRAID product, for example, is based on RAID 5. Let's begin with some basic concepts. File A file is an ordered collection of bytes. Files can reside in various locations, such as on disk, in memory, or across a network. This article focuses on disk-based files. While Von Neumann architecture efficiently utilizes processors and memory, the need for files arose from the desire for persistence. Files provide a mechanism to save the results of a program so they can be retrieved and used later, essentially preserving data across sessions. Essentially File is also ...

Paper Insights #15 - Dynamo: Amazon's Highly Available Key-value Store

This groundbreaking paper, presented at SOSP 2007, has become a cornerstone in the field of computer systems, profoundly influencing subsequent research and development. It served as a blueprint for numerous NoSQL databases, including prominent examples like MongoDB ,  Cassandra , and Azure Cosmos DB . Paper Link A deep dive into this work is essential for anyone interested in distributed systems. It explores several innovative concepts that will captivate and enlighten readers. Let's visit some fundamental ideas (with a caution that there are several of them!). Distributed Hash Tables (DHTs) A DHT is a decentralized system that provides a lookup service akin to a traditional hash table. Key characteristics of DHTs include: Autonomy and Decentralization: Nodes operate independently, forming the system without centralized control. Fault Tolerance: The system remains reliable even when nodes join, leave, or fail. Scalability: It efficiently handles systems with thousands or mil...

Paper Insights #22 - A New Presumed Commit Optimization for Two Phase Commit

Lampson and Lomet 's 1993 paper, from the now-defunct DEC Cambridge Research Lab, remains a classic. Paper Link The paper's concept are hard to grasp. My notes below are elaborated, yet, it may require multiple readings to fully comprehend the reasonings. Let's begin by reviewing fundamental concepts of SQL databases. Serializability Transaction serializability guarantees that, while transactions may execute concurrently for performance reasons, the final outcome is effectively equivalent to some sequential execution of those same transactions. The "effectively" part means that the system ensures a consistent, serializable result even if the underlying execution is parallelized. Strict serializability builds upon serializability by adding a temporal dimension. It mandates that once a transaction commits, its effects are immediately visible to all clients (a.k.a. external consistency ). This differs from linearizability, which focuses on single-object operati...

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...