Paper Insights #2 - A Comparison of Software and Hardware Techniques for x86 Virtualization

This paper is authored by the esteemed Keith Adams and Ole Agesen at VMWare. Written in 2006, the paper discusses software virtualization and compares it with hardware virtualization.

Software virtualization was already in use and popular at the time. Notably, software virtualization was a highly complex undertaking, involving approximately 100,000 lines of code. Only a small team of 5-10 engineers at VMware possessed a comprehensive understanding of the VMware Workstation's binary translator.

Paper Link

This article assumes reader's basic familiarity with microprocessors and operating systems.

Let's begin with some basic concepts.

1. Microprocessor

The world of microprocessors is vast and complex. Let's distill the core concepts into a more comprehensive, yet still concise, overview.

At the heart of every computer lies the microprocessor, the engine responsible for executing instructions and performing computations. In essence, it's the "brain" of the system. In this article, we will use the term microprocessor and CPU interchangeably.

Each instruction consists of two fundamental parts: the opcode, which specifies the operation to be performed (e.g., addition, subtraction), and one or more operands, which represent the data to be manipulated. While instructions are stored and processed in binary form, assembly language, a low-level symbolic representation, provides a more human-readable format.

1.1. Instruction Set Architecture (ISA)

The microprocessor's "manual", known as the Instruction Set Architecture (ISA), defines the complete set of instructions it can execute. Think of it as the control panel of a machine, outlining every possible action. For example, the x86 ISA is a widely adopted standard, supported by a diverse range of processors, including the 8086 (the original x86 microprocessor), Pentium, Celeron, Xeon, Atom, and the Core series.

This article focuses on SISD microprocessors, which execute a single instruction on a single data stream. This contrasts with GPUs, which utilize SIMD architecture to apply a single instruction to multiple data streams.

1.2. Hardware Architecture

The term "microprocessor" often refers to a single processing core. This core is the fundamental unit of execution. Microprocessors are typically integrated into sockets, which can accommodate one or more microprocessors, as shown in illustration 1. While most personal computers utilize a single socket, server-grade processors like Intel Xeon frequently reside in multi-socket systems.

Illustration 1: Hardware architecture of a CPU.

A microprocessor's internal architecture comprises several key components:

Execution Unit: This is where the actual computations occur, primarily through the Arithmetic Logic Unit (ALU), which performs arithmetic and logical operations.

Control Unit: The control unit acts as the orchestrator for the flow of instructions and data within the microprocessor. It fetches instructions from memory, decodes them, and generates the necessary control signals to coordinate the other components.

Registers: Registers are small, high-speed storage locations within the microprocessor that hold data and instructions temporarily during processing. They are crucial for rapid access and manipulation of frequently used data.

1.3. Fetch-Decode-Execute Cycle

The execution of instructions on a single microprocessor follows the fetch-decode-execute cycle:

Fetch: The control unit retrieves an instruction from memory.
Decode: The control unit deciphers the instruction to determine the operation to be performed.
Execute: The execution unit performs the specified operation.

To enhance performance, modern microprocessors employ pipelining, a technique that allows multiple instructions to be processed concurrently in different stages of the cycle. This overlap significantly increases the throughput of the processor.

1.4. CPU Cache

Caches are small, high-speed memory units that store frequently accessed data and instructions, enabling faster retrieval. There are typically multiple levels of cache:

L1 Cache: The fastest and smallest level of cache, dedicated to each individual core within the microprocessor. It is often split into L1 instruction cache (L1i) and L1 data cache (L1d). L1 cache sizes vary depending upon the CPU, but are often in the tens of KBs.

L2 Cache: A larger and slightly slower cache than L1, often shared between cores on the same chip. L2 cache sizes also vary greatly, but are commonly in the hundreds of KBs, or multiple MBs.

L3 Cache: The largest and slowest level of cache, typically shared by all cores on the socket. L3 cache sizes can range from several MBs to tens of MBs.

By utilizing cache memory, microprocessors can significantly reduce the time required to access data and instructions, thereby improving overall performance.

To prevent data inconsistencies when a single process runs on multiple CPU cores, each with its own local cache, cache coherence is vital. This complex problem is resolved by hardware-level mechanisms.

1.5. CPU Registers

CPU registers are essential components that hold data and instructions that the CPU is actively working on. There are a limited set of registers that are available. Let's go through some of the registers that x86 supports.

1.5.1. General-Purpose Registers

These are versatile registers used for various operations, including arithmetic, logical, and data movement. They are often labeled as R0, R1, R2, etc., or have specific names like %a, %b, %c, %d in the x86 architecture.

The naming conventions of x86 registers are deeply rooted in the architecture's evolution. Register %a for instance, originated as the accumulator register, a central component optimized for arithmetic operations. However, the specific roles of these registers can differ considerably based on the compiler being used. Nowadays, %a is commonly designated for storing function return values, this isn't a strict, universally applied rule.

In addition to the more historically named registers, the x86 architecture also incorporates numbered registers from %r8 to %r15, which offer increased flexibility for modern programming tasks.

In x86 architecture, registers are often identified by prefixes indicating their size:

The prefix r and suffix x for named ones denotes a 64-bit register (e.g., %rax, %rbx, %rcx, %rdx).
The prefix e and suffix x for named ones and a suffix d for numbered ones signifies the 32-bit version (e.g., %eax, %ebx, %ecx, %edx, %r8d, %r9d, %r10d, %r11d, %r12d, %r13d, %r14d, %r15d). It is important to understand that the 32-bit registers are simply the lower 32 bits of their corresponding 64-bit registers.
The suffix x for named ones and the suffix w for numbered ones denotes a 16-bit version (e.g., %ax, %bx, %cx, %dx, %r8w, %r9w, %r10w, %r11w, %r12w, %r13w, %r14w, %r15w).
Finally, the suffix l for named ones and the suffix b for numbered ones denotes is used for 8-bit version (e.g., %al, %bl, %cl, %dl, %r8b, %r9b, %r10b, %r11b, %r12b, %r13b, %r14b, %r15b).

Note: There is also a suffix h that corresponds to the upper 8 bits of the lower 16 bits.

64-bit | Lower 32 bits | Lower 16 bits | Lower 8 bits
======================================================
%rax   | %eax          | %ax           | %al
%rbx   | %ebx          | %bx           | %bl
%rcx   | %ecx          | %cx           | %cl
%rdx   | %edx          | %dx           | %dl
%r8    | %r8d          | %r8w          | %r8b
%r9    | %r9d          | %r9w          | %r9b
%r10   | %r10d         | %r10w         | %r10b
%r11   | %r11d         | %r11w         | %r11b
%r12   | %r12d         | %r12w         | %r12b
%r13   | %r13d         | %r13w         | %r13b
%r14   | %r14d         | %r14w         | %r14b
%r15   | %r15d         | %r15w         | %r15b

1.5.2. Index Registers

Index registers are frequently used in conjunction with loop counters to iterate through data structures. Common examples include the %si (Source Index), %di (Destination Index) registers. These are used in conjunction with the %bp (Base Pointer) register. In the 32-bit x86 architecture, their extended counterparts are %esi, %edi, and %ebp, respectively. In the 64-bit architecture, they are %rsp, %rdi, and %rbp.

1.5.3. Special-Purpose Registers

Program Counter (PC): Holds the relative memory address of the next instruction to be executed. Instructions are fetched from this address. In the x86 architecture, the PC is known as the Instruction Pointer (%ip, %eip, %rip).

Stack Pointer (%sp, %esp, %rsp): Points to the top of the stack in memory, used for function calls and temporary data storage.

Status Register (or Flags Register): Contains flags that indicate the status of the CPU and the results of operations. Common flags include:

Carry Flag (C): Indicates a carry or borrow in an arithmetic operation.
Zero Flag (Z): Indicates that the result of an operation is zero.
Overflow Flag (O): Indicates an overflow in an arithmetic operation.
Sign Flag (S): Indicates the sign of the result.

1.5.4. Segment Registers

A program's memory space is organized into distinct segments, each serving a specific purpose. Essential memory segments include:

Code segment: This segment holds the program's executable instructions.

Data segment: This segment stores initialized global and static variables. These variables have a fixed memory location throughout the program's execution.

BSS (Block Started by Symbol) segment: This segment holds uninitialized global and static variables. These variables are typically initialized to zero or null before the program begins execution.

Stack segment: This segment manages the dynamic allocation of memory for local variables and function call information. It operates on a Last-In, First-Out (LIFO) principle.

Heap segment: This segment is used for dynamic memory allocation during program execution, allowing programs to request and release memory as needed. This is where dynamically allocated memory from functions like malloc() or new are stored.

Corresponding to each of these segments, we have registers storing their addresses:

%cs: Points to the code segment. It is used in conjunction with the IP to determine the memory address of the next instruction to be executed. The CS register also encodes the current privilege level (described later) of the CPU.

%ds: Points to the data segment.

%es: Points to extra data segment. Many compilers use extra data segment for strings.

%fs and %gs: General-purpose segment register that can point to any data segment. However, its most common use is to point to thread-local storage.

%ss: Points to the stack segment.

There are no dedicated registers for BSS and heap.

1.5.5. Descriptor Table Registers

Segment information is stored in descriptor tables:

Global Descriptor Table (GDT): This table, managed by the operating system, contains memory segment information for the entire system.

Local Descriptor Table (LDT): Each process maintains its own LDT, which stores memory segment information specific to that process.

The base addresses of these tables are stored in the following registers:

The GDTR (Global Descriptor Table Register) holds the base address of the GDT.
The LDTR (Local Descriptor Table Register) holds a selector that points to an entry in the GDT, which in turn points to the base address of the current process's LDT.

Additionally, the Interrupt Descriptor Table (IDT) stores information about interrupt handlers. The base address of the IDT is held in the IDTR (Interrupt Descriptor Table Register).

1.5.6. Control Registers

Control registers are used to control the behavior of the microprocessor. With different settings, the microprocessor may behave differently. There are control registers %cr0 to %cr15. However, not all registers are usable. Some of the commonly used control registers are:

%cr0: Controls various processor modes and flags, including protected mode, paging, and enabling/disabling features.
%cr2: Stores the address that caused a page fault.
%cr3: Holds the address of the page table (a.k.a. Page Table Register), essential for memory paging.
%cr4: Enables various processor extensions and features, such as PAE and virtualization extensions.
%cr8: Task Priority Register, used for prioritizing external interrupts.

1.6. Addressing Modes

Addressing modes are crucial aspects of CPU architecture that determine how the operands of an instruction are accessed. They define how the effective address of an operand is calculated. There are several different types of addressing modes.

1.6.1. Immediate Addressing

The operand itself is directly embedded within the instruction. For example, the following instruction sets the value of 10 to %rax.

mov %rax, $10

1.6.2. Register Addressing

The operand is located in a CPU register. For example, the following instruction sets the value of %rax to %rbx.

mov %rax, %rbx

1.6.3. Direct Addressing (Absolute Addressing)

The instruction contains the actual memory address of the operand. For example, the following instruction moves the value from memory address 1000 into the %rax.

mov %rax, [$1000]

1.6.4. Indirect Addressing

The instruction specifies a register that holds the address of the operand. For example. the following instruction moves the value from the memory address pointed to by the %rbx into the %rax.

mov %rax, [%rbx]

1.6.5. Indexed Addressing

The effective address is calculated by adding an index register's value to a base address (which can be a constant or another register). For example, the following moves the value from the memory address calculated by adding the %rbx (base) and %rsi (index) into the %rax.

mov %rax, [%rbx + %rsi]

1.6.6. Base-Indexed with Displacement Addressing

The effective address is calculated by adding the values of a base register, an index register, and a displacement (constant offset). For example, the following moves the value from the memory address calculated by adding the %rbx, %rsi registers, and 10 into the %rax.

mov %rax, [%rbx + %rsi + $10]

1.6.7. PC-Relative Addressing

The effective address is calculated relative to the current PC. For example, the following instruction makes the program jump 10 steps from the current instruction address.

jmp +$10

1.7. Example Program: isPrime

Let's go through an example program to understand the concepts. I am re-using the same isPrime program that shows up in the paper as well.

1. int isPrime(int a) {

2. for (int i = 2; i < a; i++) {

3. if (a % i == 0) return 0;

4. }

5. return 1;

6. }

===================================

1. isPrime: mov %ecx, %edi

2. mov %esi, $2

3. cmp %esi, %ecx

4. jge prime

5. nexti: mov %eax, %ecx

6. cdq

7. idiv %esi

8. test %edx, %edx

9. jz notPrime

10. inc %esi

11. cmp %esi, %ecx

12. jl nexti

13. prime: mov %eax, $1

14. ret

15. notPrime: xor %eax, %eax

16. ret

The isPrime function determines if a given number is prime by iterating from 2 up to one less than the number, checking for divisibility at each step. If any number divides the input, it's not prime; otherwise, it's prime.

Here is a brief refresher on how function call works: before a function is called, the caller's register values are saved to the stack. Function arguments are passed either via the stack or, for a limited number of arguments, in registers. The function's return value is placed in a specific register, and after the function finishes, the stack is popped to restore the caller's register state. The compiler is responsible for doing all these linkages among functions.

Coming back to the example, the function's logic is divided into four labeled blocks:

isPrime: This initial block effectively handles the special case where the input number is 2.

%ecx is set to %edi, where %edi holds the function's argument "a".
%esi (the loop counter) is initialized to 2.
%esi is compared to %ecx.
If %esi is greater than or equal to %ecx (meaning the input is 2), the program jumps to the prime label; otherwise, it proceeds.

nexti:

%eax is set to %ecx (the input number).
cdq sign-extends %eax into %edx, preparing for signed division.
idiv performs the division of %edx:%eax by %esi (the loop counter), with the remainder in %edx. The remainder will be negative for negative input.
The remainder is checked for zero.
If the remainder is zero (meaning %ecx is divisible by %esi), the program jumps to the notPrime label.
%esi is incremented (loop counter).
%esi is compared to %ecx.
If %esi is less than %ecx, the program jumps back to the nexti label (loop continuation).

prime:

%eax is set to 1 (indicating prime).
The stack is popped (return).

notPrime:

%eax is set to 0 (indicating not prime).
The stack is popped (return).

Note that on line 15, xor %eax, %eax is used instead of mov %eax, $0. Historically, and sometimes still, xor can be faster than mov due to how it's handled by the CPU.

2. Operating Systems

An operating system (OS) can be viewed as a layer of software that possesses privileged access to the CPU's hardware. The OS acts as an intermediary, orchestrating the execution of user programs. When a user program requires system resources or performs privileged operations, it makes a request to the OS. The OS, in turn, grants or denies these requests, ensuring system stability and security. Once the OS completes its task, it yields control back to the program.

2.1. Protection Rings

To enforce security and manage access to hardware resources, modern CPUs implement a system of privilege levels, commonly known as protection rings, as shown in illustration 2. These rings define the level of access that software has to the CPU and its peripherals. The most common arrangement is:

Ring 0 (Kernel Mode): This is the highest privilege level, reserved for the kernel. The kernel, being the core of the OS, has unrestricted access to all CPU instructions and hardware resources.

Ring 1 & 2 (Device Driver Mode): These rings are typically used for device drivers, which are software components that enable the OS to communicate with hardware devices. Using different rings for drivers allows for further fine grained control of driver access. Ring 2 is rarely used in modern operating systems.

Ring 3 (User Mode): This is the lowest privilege level, where user applications run. Applications in Ring 3 have limited access to hardware resources and are restricted from executing privileged instructions. Any attempt to execute a privileged instruction from Ring 3 results in a trap, which transfers control to the OS kernel.

Illustration 2: Protection rings in a CPU.

The OS kernel, residing in Ring 0, holds the ultimate authority over the system. It can execute any CPU instruction, whereas applications in Ring 3 are restricted from accessing certain privileged instructions.

The protection ring level in which the CPU is currently operating is also called Current Privilege Level (CPL) and is encoded in the %cs register as described before.

2.2. Trap

When an application in CPL 3 attempts to execute a privileged instruction, a trap occurs. This trap is a hardware-generated interrupt that forces the CPU to switch to kernel mode (CPL 0) and execute a kernel routine, allowing the OS to handle the interrupt and failing the program if required.

2.3. System Calls

A system call, like a trap, also causes the CPU to switch to kernel mode. Internally, it is implemented as a trap to the kernel. The kernel then executes the requested operation on behalf of the user application.

User applications are often built using libraries, which provide pre-written code for common tasks. These libraries can be linked to the application in two ways:

Static Linking: The library code is incorporated directly into the application's executable during compilation. This results in a larger executable but eliminates the need for an external library at runtime.

Dynamic Linking: The library code is stored in separate files (e.g., .dll files in Windows, .so files in Linux) and loaded into memory at runtime. This allows for smaller executable files and enables multiple applications to share the same library code.

Most core OS functionalities, such as input/output and network access, are provided through dynamically linked kernel libraries. These libraries encapsulate system calls.

3. Virtual Memory

Virtual memory provides programs with an abstraction of physical memory, creating the illusion of a vast, contiguous address space, as shown in illustration 3.

Operating systems manage this virtual address space, translating virtual addresses to physical addresses. A virtual address can map to any location: physical memory, secondary storage (like a hard drive), or even to a non-existent location, as the virtual address space may exceed the available physical resources.

Illustration 3: Virtual memory managed by an operating system.

3.1. Paging

Direct byte-level mapping of virtual to physical addresses would require an impractically large translation table. To address this, memory is divided into fixed-size units called pages, typically 4 KB.

A page table is a data structure used for translating virtual page numbers (VPNs) to physical page numbers (PPNs) in main memory:

Page Table: VPN → PPN

Historically, operating systems used only fixed-size pages, which still resulted in large page tables. To reduce the overhead, the concept of superpages (or huge pages) was introduced. Superpages group multiple contiguous pages into a single, larger page, reducing the number of entries in the page table.

Each entry in the page table (shown in illustration 4) also contains some bits (apart from the physical address of the page):

Present Bit: Indicates whether the corresponding page is currently present in physical memory. If this bit is not set, a page fault occurs, and the operating system must load the page from secondary storage.

Protection Bit: Controls whether the page can be read from or written to. This allows the operating system to enforce memory protection.

Mode Bit: Determines whether the page can be accessed by user-level programs or only by the operating system kernel.

Dirty Bit: Indicates whether the page has been modified since it was loaded into memory. This is used by the operating system to determine whether a page needs to be written back to disk.

Illustration 4: Page table entry format.

3.1.1. Storing Page Table

The page table can itself be stored by the OS in a specific memory location. The address of the page table is stored in Page Table Register (%cr3).

3.2. Page Swapping and Page Caching

Accessing a page within main memory is relatively fast. The CPU has a direct memory bus that facilitates rapid data transfer. When data is read from main memory, it's fetched and often cached in CPU cache lines.

Conversely, accessing a page located in secondary storage (like a hard drive or SSD) is significantly slower. The CPU does not have a direct connection to secondary storage. Therefore, a page must first be transferred from secondary storage to main memory before it can be accessed by the CPU.

When main memory becomes full (i.e., there are no free pages), the operating system employs swapping. Swapping involves moving inactive or less frequently used pages from main memory to secondary storage, freeing up space for active pages. This process is essentially the transfer of pages that were residing in main memory to available space within secondary storage. When swapped out pages are accessed, it results in a page fault which brings back the page into the memory.

Opposite to swapping is the concept of caching, where disk pages residing in secondary storage (where they should be residing) are brought into main memory. This process can help speed up reads and writes to files. Note that, caching is different from demand paging, which brings in those pages from the secondary storage which are supposed to reside in main memory but were previously swapped out.

It's crucial to understand that the page table only contains entries for pages that are currently mapped to physical memory. Pages that exist only in secondary storage or have not yet been allocated will not have corresponding entries in the page table.

3.3. Memory Management Unit

The Memory Management Unit (MMU) is a dedicated hardware component integrated into the CPU, playing a pivotal role in virtual memory management, as shown in illustration 5.

Illustration 5: Virtual address translation using MMU.

3.3.1. Translation Lookaside Buffer

It might seem that every virtual address translation would require the OS to intervene, as it alone manages page table entries. This would entail a context switch on every memory access, causing significant performance overhead due to the expensive process of saving and restoring CPU state, and flushing CPU caches. To mitigate this, CPUs implement a hardware cache called the Translation Lookaside Buffer (TLB).

The TLB acts as a high-speed translation cache, storing recently used virtual-to-physical page mappings. When the CPU accesses a virtual address, the MMU first checks the TLB. If a matching translation is found (a TLB hit), the physical address is retrieved directly, bypassing the need for a page table lookup.

However, the TLB has a limited capacity, typically holding 4,000 entries. If the required translation is not found (a TLB miss), a page fault is generated. This fault triggers a trap, transferring control to the OS. The OS then performs a page table walk to locate the physical address, updates the TLB with the new mapping, and resumes the process.

3.3.1.1. Side Note: TLB, Caches & Context Switches

Upon a process's departure from the CPU, all TLB entries associated with that process are invalidated (flushed). This is crucial to prevent subsequent processes from accessing the previous process's memory pages through stale TLB entries.

The invlpg instruction is typically used for this purpose.

Because TLB entries are cleared, they must be repopulated as the new process executes. This TLB flushing and subsequent repopulation contribute significantly to the overhead of context switches between processes. On top of that, CPU caches also become ineffective after a context switch, further impacting performance.

Consequently, many operating systems employ CPU affinity, assigning processes to dedicated CPU cores, to minimize context switching and improve efficiency.

3.3.2. Hardware Walking Page Tables

In addition to OS-driven page table walks during page faults, modern MMUs often feature hardware walking capabilities. This means the MMU itself can automatically traverse the page table structure to locate missing translations.

The physical address of the current process's page table is stored in a special CPU register, the Page Table Register (PTR). On x86, this is register %cr3. This register allows the MMU to locate the active page table during address translation.

When a TLB miss occurs, the MMU's hardware logic takes over, fetching the necessary page table entries from memory. This process eliminates the overhead of a context switch to the kernel, significantly speeding up address translation. The MMU is designed to understand the page table structure, allowing it to efficiently navigate through even hierarchical page tables.

4. Virtualization

Virtualization essentially involves creating an illusion for a guest operating system, making it believe it's running directly on physical hardware. The goal is complete transparency, where the guest OS cannot distinguish its virtualized environment from a native one. This is achieved by emulating hardware behavior and maintaining consistency through shadow data structures.

4.1. Strawman

A basic virtualization approach utilizes a trap-and-emulate model. In this setup, the guest OS itself runs as a standard application within the host OS, executing in CPL 3. Consequently, any privileged instruction executed by the guest OS, such as I/O operations, triggers a trap, transferring control to the host OS. The host OS then emulates the requested operation on behalf of the guest.

However, I/O is just one aspect of virtualization. Memory management presents another challenge. The guest OS is allocated a limited memory space, which, being a process within the host, also has its own virtual address space. This necessitates a two-tiered address translation:

The guest OS's virtual addresses are translated to the actual physical addresses.
And then the applications within the guest must have their own page tables to translate their virtual addresses into the guest OS's physical addresses.

This double translation introduces significant performance overhead.

While this strawman model provides a functional virtualization framework, it's inherently inefficient.

4.2. Full v/s OS Virtualization

Virtualization can itself be categorized into -

4.2.1. Full Virtualization

Full virtualization is complete virtualization of the actual hardware to allow software environments, including a guest OS and its apps, to run unmodified. The guest OS (a.k.a VMs) remains unaware of its execution within a virtual environment. VMs operate on top of hypervisors, which can be categorized into two types, as shown in illustration 6:

Type 1 hypervisors: Run directly on bare-metal hardware, such as ESX Server.
Type 2 hypervisors: Run on top of another OS, such as VirtualBox.

Note: A hypervisor is also known as a Virtual Machine Monitor (VMM).

Illustration 6: Type 1 v/s Type 2 hypervisors.

4.2.2. OS-level Virtualization

OS-level virtualization isolates a pool of resources within the operating system. This encompasses technologies such as containers (like Docker), jails (like FreeBSD jail), chroot, and virtual environments. The most popular among them today are containers.

For the purposes of this paper, the term "virtualization" is used to denote full virtualization only.

5. Classical Virtualization

To ensure clarity in our discussion of virtualization techniques, let's establish key definitions: Host will consistently refer to the VMM, which can operate either as an application on a host OS (Type 2) or directly on bare metal (Type 1). Guest OS denotes the guest OS running on top of the VMM, while guest process refers to any process executing within the guest OS environment.

5.1. VMM Characteristics

Popek and Goldberg defined three essential characteristics of a VMM for classical virtualization:

Fidelity: The guest OS software must execute identically to its execution on the native hardware, requiring no modifications.
Performance: A significant majority of guest OS instructions should execute directly on the CPU, minimizing traps.
Safety: All hardware accesses must be mediated by the VMM.

5.2. Refining Strawman

Let's refine our basic virtualization model (4.1), still using the trap-and-emulate approach. Guest instructions execute in a deprivileged mode, and privileged instructions trigger traps, transferring control to the VMM for emulation.

5.2.1. Shadow Page Table

One key optimization involves the page table translation process. Instead of double translation, we introduce a shadow page table that directly maps the virtual address space of a guest process to machine (host) physical page numbers.

Guest Page Table: VPN → PPN
Shadow Page Table: VPN → MPN

Here, PPN refers to the guest's known physical page number and MPN refers to the machine's physical page number, i.e., the physical page number known to the host.

The shadow page table is used for actual address translation during guest process execution, eliminating the overhead of double translations. The guest OS remains aware only of its own guest page table.

5.2.2. Hidden Fault Handling

Whenever the guest OS modifies its page table entries (adding or removing mappings), the VMM may not immediately reflect these changes in the shadow page table. Accessing a page not present in the shadow page table results in a hidden fault. Upon a hidden fault, the VMM updates the shadow page table.

5.2.3. Tracing

Hidden faults effectively populate the shadow page table with missing entries. However, we must also address how to remove entries from the shadow page table when they are removed from the guest page table. Without this, a guest process might access invalid pages that have been reallocated, leading to potential security breaches.

The solution is to utilize access bits on the guest page table. The guest page table is itself stored on some physical page known to the host. That physical page is write-protected. Updates to it will trap to the host. This is how the VMM detects modifications to the guest page table and can then update the shadow page table accordingly. This page-protection method is known as tracing.

5.3. Performance

Despite its functionality, classical virtualization exhibits performance bottlenecks primarily arising from the trap-and-emulate mechanism and the overhead associated with tracing memory access operations.

5.4. Trap-and-Emulate is Non-Classical

The trap-and-emulate virtualization approach also deviates from classical virtualization, as defined by Popek and Goldberg, in two key ways:

The guest OS can detect it's not running in Ring 0 by reading the code segment register.
Instructions like popf bypass traps, allowing direct modification of status flags without VMM intervention. These deviations violate the principles of fidelity and safety, undermining the transparency and controlled hardware access that classical virtualization aims to provide.

6. Software Virtualization

Software virtualization fundamentally relies on binary translation. Instead of direct CPU execution, guest instructions are intercepted and translated by the host into safe equivalents. This allows for features like virtualized flags (e.g., from popf) and spoofed CS registers.

Binary translation is not a novel concept; tools like Valgrind utilize it for dynamic bug detection.

However, this translation overhead can significantly impact performance if each guest instruction is translated individually (like an interpreter). To mitigate this, just-in-time (JIT) compilation is employed, similar to the JVM's JIT. This dynamic translation occurs at runtime, on demand, enabling optimizations.

Side Note: JIT Benefits

JIT compilation is a common compiler technique and offers performance enhancements. Notably, it can rearrange code blocks (as shown in illustration 7) to improve instruction cache locality, reducing cache misses caused by frequent jumps in the original code.

Illustration 7: Code block rearrangement during JIT compilation.

6.1. Binary Translation

The software virtualization approach executes guest code at a less privileged level (e.g., CPL 1). This allows most guest instructions to execute natively without full translation, minimizing overhead.

The translation process operates on translation units (TUs), which are basic blocks of guest code. Each TU is translated into a compiled code fragment (CCF). TUs are typically limited in size (e.g., 12 instructions) and terminate with a jump or return instruction.

The authors demonstrate that translated instructions offer a performance advantage over the conventional trap-and-emulate approach. In a specific instance, a read timestamp counter (rdtsc) operation required 2030 cycles using trap-and-emulate, while the same operation completed in only 216 cycles with TC emulation.

6.1.1. IDENT vs. NON-IDENT Instructions

IDENT instructions: These instructions require no translation and can be executed directly. The majority of guest instructions fall into this category.

NON-IDENT instructions: These instructions necessitate translation. Examples include:

PC-relative addressing - This is because the translated code may not be in the same place as the original code.
Control flow - Jump (e.g., jge, jmp) and return instructions.
Privileged instructions - Such as popf.
Non-privileged instructions - Such as load, store, etc (adaptive).

6.1.2. Lazy Translation

As a consequence of just-in-time compilation, only executed code paths are translated. Unexecuted code remains untranslated.

6.1.3. Chaining Optimization

The translation process may alter instruction addresses, requiring adjustments to target addresses. If a jump target is determined to be the next sequential instruction, the jump is replaced with a no-op instruction, a technique known as chaining optimization.

6.1.4. Example

Let's run through the translated isPrime example again on input 49 which is not a prime.

1. isPrime’: mov %ecx, %edi

2. mov %esi, $2

3. cmp %esi, %ecx

4. jge [takenAddr]

6. nexti’: mov %eax, %ecx

7. cdq idiv %esi

8. test %edx, %edx

9. jz notPrime’

10.

11. inc %esi

12. cmp %esi, %ecx

13. jl nexti’

14. jmp [fallthrAddr3]

15. notPrime’: xor %eax, %eax

16. pop %r11

17. mov %gs:0xff39eb8(%rip), %rcx

18. movzx %ecx, %r11b

19. jmp %gs:0xfc7dde0(8*%rcx)

Let's break down each point:

Line 4: The jump instruction's target address has been altered. While it was intended to point to a prime block, that block was never executed for the given input (49), hence its translation was skipped.

Lines 5 & 10: The jump instruction jmp [fallthrAddr] was removed because the "fall-through" address was the very next instruction. This is chaining optimization.

Line 14: The jump jmp [fallthrAddr3] was retained. However, since the prime block was never translated, the jump's target remains unresolved at this stage.

Line 16 - 19: The ret instruction is translated:

Line 16: The pop instruction retrieves the return address from the stack into %r11.

Line 17: The %rcx register is saved to a memory location. The memory address is calculated by adding a fixed displacement of 0xff39eb8 to %rip and this will be the relative address within the segment pointed by %gs register.

Line 18: The lower 8 bits of the %r11 register (%r11b) are extracted and placed into the %ecx register, with the remaining bits of %ecx being set to zero.

Line 19: Probably, relative address 0xfc7dde0 within %gs segment is the data structure (such as an array) which stores the return addresses to the callee. The index into the array is coming from the %rcx register.

Note that this translation is performed by the compiler. It may seem complex but that is how compilers really work. For instance spilling %rcx on line 17 was required, because its value was used to store another variable.

6.1.5. Running the Translator

The translator runs as a part of VMM. The translator needs to store its data structures used for address translation. This data structure is stored in a memory segment mapped by %gs register. This memory segment is mapped to guest OS (to make it accountable).

6.2. Adaptive Binary Translation

Apart from privileged instructions, there are also non-privileged instructions that can result in a trap. For example, load and store instructions that access memory can potentially cause a trap if there is a page fault.

In such cases, the authors utilize adaptive translation, where they monitor these CCFs that cause frequent traps and then replace them with cheaper alternatives.

Figure 1: Annotated copy of figure 1 from the paper.

Figure 1 illustrates an example of this. On the left side, jumping to CCF1 results in a page fault. The translator monitors the faults and then replaces CCF1 with CCF5. All jumps to CCF1 are now adapted to CCF5 which doesn't cause page faults.

7. Hardware Virtualization

Hardware-assisted virtualization significantly alters virtualization assumptions. Intel's VT extensions provide hardware support for this technique.

7.1. Guest Mode

In hardware-assisted virtualization, CPUs introduce a protected mode (also called guest mode) alongside the real mode. The VMM configures guest mode using the Virtual Machine Control Block (VMCB).

Guest code executes directly on the CPU at the same CPL as VMM (CPL 0). However, based on the VMCB settings, specific instructions trigger traps, transferring control to the VMM.

The vmrun instruction initiates guest mode, and traps result in exits back to the VMM.

7.2. Virtual Machine Control Block

The VMCB is a data structure used by the host and CPU to facilitate virtualization. It contains:

Control Bits: Used to intercept reads and writes to control registers (%cr0 - %cr15), generating traps.
Debug Bits: Similar 32-bit fields for debug registers (%dr0 - %dr15).
Intercept Bits: 8 bits to intercept reads and writes to IDTR, GDTR, and LDTR.
Instruction Intercepts: Bits to intercept pushf, popf, vmrun, hlt, invlpg, int, iret, and in/out operations to specified ports.
Page Table Intercepts: Because the page table base is stored in a control register (%cr3), the VMM can program the VMCB to generate exits on guest page faults and TLB flushes, enabling shadow page table management.

There are numerous additional bits for various virtualization controls.

When the guest OS relinquishes control to the VMM, the guest's state (register values) is saved.

7.3. Example: fork(2) with Hardware Virtualization

The paper illustrates the fork(2) system call's behavior in hardware virtualization. Let's examine this process and compare it to software virtualization.

It's crucial to understand that both parent and child processes are in an identical state when fork(2) is called, and their states only begin to diverge afterward.

The fork(2) system call initiates a child process that, as an optimization for speed, initially shares the parent's page table. This approach avoids the time-consuming process of immediately copying the entire parent's memory to a new location before the child can begin execution. Instead, the operating system marks all of the parent's memory pages as read-only, effectively write-protecting them. When either the child or the parent process attempts to modify a write-protected page, the operating system intervenes and creates a separate copy of that page for the modifying process. The write-protected, shared pages represent the portion of their shared state that has not yet diverged between parent and child process.

When a guest process calls fork(2):

The system call transitions the CPL from 3 to 0, as the guest OS handles it. Since the guest runs at CPL 0, it can process the call without direct VMM intervention. In software virtualization, this would trigger a VMM trap.
The guest OS write-protects the parent's page table. Accessing the page table register (%cr3) triggers a VM exit, transferring control to the VMM. The VMM then write-protects the corresponding shadow page table entry.
The guest OS resumes and schedules the child process. Scheduling the child requires updating the page table register. This again causes a VM exit. The VMM updates the shadow page table register.
From then, hidden page faults are also vectored to the VMM, allowing it to maintain the shadow page table.

8. Comparison

Turning to the paper's central objective: a comparative analysis of hardware and software virtualization techniques. At first glance, hardware virtualization might appear superior, particularly when VM exits are infrequent. However, the authors' experiments challenge this assumption, revealing several scenarios where binary translation in software virtualization outperforms hardware virtualization.

It's important to acknowledge a potential bias: as VMware developed binary translation, there's a natural inclination towards showcasing the strengths of software virtualization.

8.1. Experiment Results

The experimental design focused on scenarios triggering privileged operations, as compute-intensive benchmarks demonstrated near-native performance (Figure 2). Notably, the "mcf" application exhibited performance exceeding native execution, attributed by the authors to virtualization-induced improvements in TLB and cache access patterns.

Figure 3 illustrates that hardware VMMs occasionally outperform software VMMs. However, given the complexity of operating system interactions, attributing definitive conclusions is challenging. The authors' explanations for these variations appeared somewhat inconclusive.

In the fork/wait experiment, software virtualization completed 40,000 fork/wait operations in 36.9 seconds, while hardware virtualization required 106.4 seconds, representing a 4.4x latency increase.

Microbenchmark results further highlighted performance differences:

syscall: Software virtualization incurred a significant performance penalty, adding approximately 2,000 cycles compared to hardware or native execution.

in: Software virtualization excelled by translating the I/O port read instruction, resulting in rapid execution.

cr8wr: Software virtualization again demonstrated superior performance by translating the privileged register write into a more efficient instruction sequence.

call/ret: Binary translation introduced overhead, slowing down control flow due to the translation process.

pgfault: Software virtualization outperformed hardware, as software-handled page faults proved less costly than VM run/exit cycles.

divzero: Hardware virtualization exhibited superior performance, as the CPU-generated trap was handled more efficiently.

ptrmod: Software virtualization excelled in manipulating page table entries, as it avoided traps and translated the instruction into a more efficient variant.

9. Paper Review

To be entirely candid, my academic background in compilers and operating systems has not been my strongest area of study, and I found this particular paper to be exceptionally dense and difficult to navigate. In addition, my lack of practical experience with operating system kernels made it challenging to discern the precise mechanisms by which the various elements of an operating system interface with the VMM. It required a substantial amount of dedicated time to engage in thorough research and to fully comprehend the underlying concepts.

I believe that while this paper demands a significant time commitment, it presents a valuable opportunity to augment one's understanding of key concepts in compilers and operating systems.

Pico