This paper, authored by the esteemed Keith Adams and Ole Agesen at VMWare, was presented by Professor Dawson Engler at Stanford's CS240, where he offered valuable insights. Professor Engler, who knew Ole Agesen personally, described him as exceptionally detail-oriented. Written in 2006, the paper discusses software virtualization and compares it with hardware virtualization.
Software virtualization was already in use and popular at the time. Notably, software virtualization was a highly complex undertaking, involving approximately 100,000 lines of code. Notably, in that era, only a small team of 5-10 engineers at VMware possessed a comprehensive understanding of the VMware Workstation's binary translator.
This article assumes reader's basic familiarity with microprocessors and operating systems.
Let's begin with some basic concepts.
Microprocessor
It's true, the world of microprocessors is vast and complex. Let's distill the core concepts into a more comprehensive, yet still concise, overview.
At the heart of every computer lies the microprocessor, the engine responsible for executing instructions and performing computations. In essence, it's the "brain" of the system. In this article, we will use the term microprocessor and CPU interchangeably.
Each instruction consists of two fundamental parts: the opcode, which specifies the operation to be performed (e.g., addition, subtraction), and one or more operands, which represent the data to be manipulated. While instructions are stored and processed in binary form, assembly language, a low-level symbolic representation, provides a more human-readable format.
Instruction Set Architecture (ISA)
The microprocessor's "manual", known as the Instruction Set Architecture (ISA), defines the complete set of instructions it can execute. Think of it as the control panel of a machine, outlining every possible action. For example, the x86 ISA is a widely adopted standard, supported by a diverse range of processors, including the 8086 (the original x86 microprocessor), Pentium, Celeron, Xeon, Atom, and Intel Core series.
This article focuses on SISD microprocessors, which execute a single instruction on a single data stream. This contrasts with GPUs, which utilize SIMD architecture to apply a single instruction to multiple data streams.
Hardware Architecture
- Execution Unit: This is where the actual computations occur, primarily through the Arithmetic Logic Unit (ALU), which performs arithmetic and logical operations.
- Control Unit: The control unit acts as the orchestrator for the flow of instructions and data within the microprocessor. It fetches instructions from memory, decodes them, and generates the necessary control signals to coordinate the other components.
- Registers: Registers are small, high-speed storage locations within the microprocessor that hold data and instructions temporarily during processing. They are crucial for rapid access and manipulation of frequently used data.
Fetch-Decode-Execute Cycle
The execution of instructions on a single microprocessor follows the fetch-decode-execute cycle:
- Fetch: The control unit retrieves an instruction from memory.
- Decode: The control unit deciphers the instruction to determine the operation to be performed.
- Execute: The execution unit performs the specified operation.
To enhance performance, modern microprocessors employ pipelining, a technique that allows multiple instructions to be processed concurrently in different stages of the cycle. This overlap significantly increases the throughput of the processor.
CPU Cache
Caches are small, high-speed memory units that store frequently accessed data and instructions, enabling faster retrieval. There are typically multiple levels of cache:
- L1 Cache: The fastest and smallest level of cache, dedicated to each individual core within the microprocessor. It is often split into L1 instruction cache (L1i) and L1 data cache (L1d). L1 cache sizes vary depending upon the CPU, but are often in the tens of KBs.
- L2 Cache: A larger and slightly slower cache than L1, often shared between cores on the same chip. L2 cache sizes also vary greatly, but are commonly in the hundreds of KBs, or multiple MBs.
- L3 Cache: The largest and slowest level of cache, typically shared by all cores on the socket. L3 cache sizes can range from several MBs to tens of MBs.
By utilizing cache memory, microprocessors can significantly reduce the time required to access data and instructions, thereby improving overall performance.
To prevent data inconsistencies when a single process runs on multiple CPU cores, each with its own local cache, cache coherence is vital. This complex problem is resolved by hardware-level mechanisms.
CPU Registers
CPU registers are essential components that hold data and instructions that the CPU is actively working on. There are limited set of registers that are available. Let's go through some of the registers that x86 supports.
General-Purpose Registers
These are versatile registers used for various operations, including arithmetic, logical, and data movement. They are often labeled as R0, R1, R2, etc., or have specific names like %a, %b, %c, %d in the x86 architecture.- The prefix r and suffix x for named ones denotes a 64-bit register (e.g., %rax, %rbx, %rcx, %rdx).
- The prefix e and suffix x for named ones and a suffix d for numbered ones signifies the 32-bit version (e.g., %eax, %ebx, %ecx, %edx, %r8d, %r9d, %r10d, %r11d, %r12d, %r13d, %r14d, %r15d). It is important to understand that the 32-bit registers are simply the lower 32 bits of their corresponding 64-bit registers.
- The suffix x for named ones and the suffix w for numbered ones denotes a 16-bit version (e.g., %ax, %bx, %cx, %dx, %r8w, %r9w, %r10w, %r11w, %r12w, %r13w, %r14w, %r15w).
- Finally, the suffix l for named ones and the suffix b for numbered ones denotes is used for 8-bit version (e.g., %al, %bl, %cl, %dl, %r8b, %r9b, %r10b, %r11b, %r12b, %r13b, %r14b, %r15b).
64-bit | Lower 32 bits | Lower 16 bits | Lower 8 bits
======================================================
%rax | %eax | %ax | %al
%rbx | %ebx | %bx | %bl
%rcx | %ecx | %cx | %cl
%rdx | %edx | %dx | %dl
%r8 | %r8d | %r8w | %r8b
%r9 | %r9d | %r9w | %r9b
%r10 | %r10d | %r10w | %r10b
%r11 | %r11d | %r11w | %r11b
%r12 | %r12d | %r12w | %r12b
%r13 | %r13d | %r13w | %r13b
%r14 | %r14d | %r14w | %r14b
%r15 | %r15d | %r15w | %r15b
Index Registers
Special-Purpose Registers
- Program Counter: Holds the relative memory address of the next instruction to be executed. Instructions are fetched from this address. In the x86 architecture, the PC is known as the Instruction Pointer (%ip, %eip, %rip).
- Stack Pointer (%sp, %esp, %rsp): Points to the top of the stack in memory, used for function calls and temporary data storage.
- Status Register (or Flags Register): Contains flags that indicate the status of the CPU and the results of operations. Common flags include:
- Carry Flag (C): Indicates a carry or borrow in an arithmetic operation.
- Zero Flag (Z): Indicates that the result of an operation is zero.
- Overflow Flag (O): Indicates an overflow in an arithmetic operation.
- Sign Flag (S): Indicates the sign of the result.
Segment Registers
A program's memory space is organized into distinct segments, each serving a specific purpose. One crucial segment is the code segment, which holds the program's executable instructions.
Beyond the code segment, other essential memory segments include:
- Data segment: This segment stores initialized global and static variables. These variables have a fixed memory location throughout the program's execution.
- BSS segment (Block Started by Symbol): This segment holds uninitialized global and static variables. These variables are typically initialized to zero or null before the program begins execution.
- Stack segment: This segment manages the dynamic allocation of memory for local variables and function call information. It operates on a Last-In, First-Out (LIFO) principle.
- Heap segment: This segment is used for dynamic memory allocation during program execution, allowing programs to request and release memory as needed. This is where dynamically allocated memory from functions like malloc() or new are stored.
- %cs: Points to the code segment. It is used in conjunction with the IP to determine the memory address of the next instruction to be executed. The CS register also encodes the current privilege level (described later) of the CPU.
- %ds: Points to the data segment.
- %es: Points to extra data segment. Many compilers use extra data segment for strings.
- %fs and %gs: General-purpose segment register that can point to any data segment. However, its most common use is to point to thread-local storage.
- %ss: Points to the stack segment.
Descriptor Table Registers
Segment information is stored in descriptor tables:
- Global Descriptor Table (GDT): This table, managed by the operating system, contains memory segment information for the entire system.
- Local Descriptor Table (LDT): Each process maintains its own LDT, which stores memory segment information specific to that process.
The base addresses of these tables are stored in the following registers:
- The GDTR (Global Descriptor Table Register) holds the base address of the GDT.
- The LDTR (Local Descriptor Table Register) holds a selector that points to an entry in the GDT, which in turn points to the base address of the current process's LDT.
Additionally, the Interrupt Descriptor Table (IDT) stores information about interrupt handlers. The base address of the IDT is held in the IDTR (Interrupt Descriptor Table Register).
Control Registers
Control registers are used to control the behavior of the microprocessor. With different settings, the microprocessor may behave differently. There are control registers %cr0 to %cr15. However, not all registers are usable. Some of the commonly used control registers are:
- %cr0: Controls various processor modes and flags, including protected mode, paging, and enabling/disabling features.
- %cr2: Stores the linear address that caused a page fault.
- %cr3: Holds the address of the page table (a.k.a. Page Table Register), essential for memory paging.
- %cr4: Enables various processor extensions and features, such as PAE and virtualization extensions.
- %cr8: Task Priority Register, used for prioritizing external interrupts.
Addressing Modes
Addressing modes are crucial aspects of CPU architecture that determine how the operands of an instruction are accessed.
Immediate Addressing
The operand itself is directly embedded within the instruction. For example, the following instruction sets the value of 10 to %rax.Register Addressing
The operand is located in a CPU register. For example, the following instruction sets the value of %rax to %rbx.
mov %rax, %rbx
Direct Addressing (Absolute Addressing)
The instruction contains the actual memory address of the operand. For example, the following instruction moves the value from memory address 1000 into the %rax.
%rax, [$1000]Indirect Addressing
The instruction specifies a register that holds the address of the operand. For example. the following instruction move the value from the memory address pointed to by the %rbx into the %rax.Indexed Addressing
The effective address is calculated by adding an index register's value to a base address (which can be a constant or another register). For example, the following moves the value from the memory address calculated by adding the %rbx (base) and %rsi (index) into the %rax.Base-Indexed with Displacement Addressing
The effective address is calculated by adding the values of a base register, an index register, and a displacement (constant offset). For example, the following moves the value from the memory address calculated by adding the rbx, rsi registers, and 10 into the rax.PC-Relative Addressing
The effective address is calculated relative to the current PC. For example, the following instruction makes the program jump 10 steps from the current instruction address.Example Program: isPrime
- isPrime: This initial block effectively handles the special case where the input number is 2.
- %ecx is set to %edi, where %edi holds the function's argument "a".
- %esi (the loop counter) is initialized to 2.
- %esi is compared to %ecx.
- If %esi is greater than or equal to %ecx (meaning the input is 2), the program jumps to the prime label; otherwise, it proceeds.
- nexti:
- %eax is set to %ecx (the input number).
- cdq sign-extends %eax into %edx, preparing for signed division.
- idiv performs the division of %edx:%eax by %esi (the loop counter), with the remainder in %edx. The remainder will be negative for negative input.
- The remainder is checked for zero.
- If the remainder is zero (meaning %ecx is divisible by %esi), the program jumps to the notPrime label.
- %esi is incremented (loop counter).
- %esi is compared to %ecx.
- If %esi is less than %ecx, the program jumps back to the nexti label (loop continuation).
- prime:
- %eax is set to 1 (indicating prime).
- The stack is popped (return).
- notPrime:
- %eax is set to 0 (indicating not prime).
- The stack is popped (return).
Operating Systems
An operating system (OS) can be viewed as a layer of software that possesses privileged access to the CPU's hardware. The OS acts as an intermediary, orchestrating the execution of user programs. When a user program requires system resources or performs privileged operations, it makes a request to the OS. The OS, in turn, grants or denies these requests, ensuring system stability and security. Once the OS completes its task, it yields control back to the program.
Protection Ring
To enforce security and manage access to hardware resources, modern CPUs implement a system of privilege levels, commonly known as protection rings. These rings define the level of access that software has to the CPU and its peripherals. The most common arrangement is:
- Ring 0 (Kernel Mode): This is the highest privilege level, reserved for the OS kernel. The kernel, being the core of the OS, has unrestricted access to all CPU instructions and hardware resources.
- Ring 1 & 2 (Device Driver Mode): These rings are typically used for device drivers, which are software components that enable the OS to communicate with hardware devices. Using different rings for drivers allows for further fine grained control of driver access. Although these rings exist, Ring 2 is rarely used in modern operating systems.
- Ring 3 (User Mode): This is the lowest privilege level, where user applications run. Applications in Ring 3 have limited access to hardware resources and are restricted from executing privileged instructions. Any attempt to execute a privileged instruction from Ring 3 results in a trap, which transfers control to the OS kernel.
The protection ring level in which the CPU currently operates is also called Current Privilege Level (CPL) and is encoded in the %cs register as described before.
Trap
When an application in CPL 3 attempts to execute a privileged instruction, a trap occurs. This trap is a hardware-generated interrupt that forces the CPU to switch to kernel mode (CPL 0) and execute a kernel routine, allowing the OS to handle the interrupt and failing the program if required.
System Calls
A system call, like a trap, also causes the CPU to switch to kernel mode. Internally, it is implemented as a trap to the kernel. The kernel then executes the requested operation on behalf of the user application.
User applications are often built using libraries, which provide pre-written code for common tasks. These libraries can be linked to the application in two ways:
- Static Linking: The library code is incorporated directly into the application's executable during compilation. This results in a larger executable but eliminates the need for external library files at runtime.
- Dynamic Linking: The library code is stored in separate files (e.g., .dll files in Windows, .so files in Linux) and loaded into memory at runtime. This allows for smaller executable files and enables multiple applications to share the same library code.
Most core OS functionalities, such as input/output and network access, are provided through dynamically linked kernel libraries. These libraries encapsulate system calls.
Virtual Memory
Virtual memory provides programs with an abstraction of physical memory, creating the illusion of a vast, contiguous address space.
Operating systems manage this virtual address space, translating virtual addresses to physical addresses. A virtual address can map to any location: physical memory, secondary storage (like a hard drive), or even to a non-existent location, as the virtual address space may exceed the available physical resources.
Paging
Direct byte-level mapping of virtual to physical addresses would require an impractically large translation table. To address this, memory is divided into fixed-size units called pages, typically 4 KB.
A page table is a data structure used for translating virtual page numbers (VPNs) to physical page numbers (PPNs) in main memory:
Page Table: VPN -> PPN
Historically, operating systems used only fixed-size pages, which still resulted in large page tables. To reduce the overhead, the concept of superpages (or huge pages) was introduced. Superpages group multiple contiguous pages into a single, larger page, reducing the number of entries in the page table.
Each entry in the page table also contains some bits (apart from the physical address of the page):
- Present Bit: Indicates whether the corresponding page is currently present in physical memory. If this bit is not set, a page fault occurs, and the operating system must load the page from secondary storage.
- Protection Bit: Controls whether the page can be read from or written to. This allows the operating system to enforce memory protection.
- Mode Bit: Determines whether the page can be accessed by user-level programs or only by the operating system kernel.
- Dirty Bit: Indicates whether the page has been modified since it was loaded into memory. This is used by the operating system to determine whether a page needs to be written back to disk.
Storing Page Table
The page table can itself be stored by the OS in a specific memory location. The address of the page table is stored in Page Table Register (%cr3).
Swapping
Accessing a page within main memory is relatively fast. The CPU has a direct memory bus that facilitates rapid data transfer. When data is read from main memory, it's fetched and often cached in CPU cache lines.
Conversely, accessing a page located in secondary storage (like a hard drive or SSD) is significantly slower. The CPU does not have a direct connection to secondary storage. Therefore, a page must first be transferred from secondary storage to main memory before it can be accessed by the CPU.
When main memory becomes full (i.e., there are no free pages), the operating system employs swapping. Swapping involves moving inactive or less frequently used pages from main memory to secondary storage, freeing up space for active pages. This process is essentially the transfer of pages that were residing in main memory to available space within secondary storage. When swapped out pages are accessed, it results in a page fault which brings back the page into the memory.
Opposite to swapping is the concept of caching, where disk pages residing in secondary storage (where they should be residing) are brought into main memory. This process can help speed up reads and writes to files. Note that, caching is different from demand paging, which brings in those pages from the secondary storage which are suppose to reside in main memory but were previously swapped out.
It's crucial to understand that the page table only contains entries for pages that are currently mapped to physical memory. Pages that exist only in secondary storage or have not yet been allocated will not have corresponding entries in the page table.
Memory Management Unit (MMU)
The Memory Management Unit (MMU) is a dedicated hardware component integrated into the CPU, playing a pivotal role in virtual memory management.
Translation Lookaside Buffer (TLB)
It might seem that every virtual address translation would require the OS to intervene, as it alone manages page table entries. This would entail a context switch on every memory access, causing significant performance overhead due to the expensive process of saving and restoring CPU state, and flushing CPU caches. To mitigate this, CPUs implement a hardware cache called the Translation Lookaside Buffer (TLB).
The TLB acts as a high-speed translation cache, storing recently used virtual-to-physical page mappings. When the CPU accesses a virtual address, the MMU first checks the TLB. If a matching translation is found (a TLB hit), the physical address is retrieved directly, bypassing the need for a page table lookup.
However, the TLB has a limited capacity, typically holding 4,000 entries. If the required translation is not found (a TLB miss), a page fault is generated. This fault triggers a trap, transferring control to the OS. The OS then performs a page table walk to locate the physical address, updates the TLB with the new mapping, and resumes the process.
Side Note: TLB, Caches & Context Switches
Upon a process's departure from the CPU, all TLB entries associated with that process are invalidated (flushed). This is crucial to prevent subsequent processes from accessing the previous process's memory pages through stale TLB entries.
The invlpg
instruction is typically used for this purpose.
Because TLB entries are cleared, they must be repopulated as the new process executes. This TLB flushing and subsequent repopulation contribute significantly to the overhead of context switches between processes. On top of that, CPU caches also become ineffective after a context switch, further impacting performance.
Consequently, many operating systems employ CPU affinity, assigning processes to dedicated CPU cores, to minimize context switching and improve efficiency.
Hardware Walking Page Tables
In addition to OS-driven page table walks during page faults, modern MMUs often feature hardware walking capabilities. This means the MMU itself can automatically traverse the page table structure to locate missing translations.
The physical address of the current process's page table is stored in a special CPU register, the Page Table Register (PTR). On x86, this is register %cr3. This register allows the MMU to locate the active page table during address translation.
When a TLB miss occurs, the MMU's hardware logic takes over, fetching the necessary page table entries from memory. This process eliminates the overhead of a context switch to the kernel, significantly speeding up address translation. The MMU is designed to understand the page table structure, allowing it to efficiently navigate through multi-level page tables.
Virtualization
Virtualization essentially involves creating an illusion for a guest operating system, making it believe it's running directly on physical hardware. The goal is complete transparency, where the guest OS cannot distinguish its virtualized environment from a native one. This is achieved by emulating hardware behavior and maintaining consistency through shadow data structures.
Strawman
A basic virtualization approach utilizes a trap-and-emulate model. In this setup, the guest OS itself runs as a standard application within the host OS, executing in CPL 3. Consequently, any privileged instruction executed by the guest OS, such as I/O operations, triggers a trap, transferring control to the host OS. The host OS then emulates the requested operation on behalf of the guest.
However, I/O is just one aspect of virtualization. Memory management presents another challenge. The guest OS is allocated a limited memory space, which, being a process within the host, also has its own virtual address space. This necessitates a two-tiered address translation:
- The guest OS's virtual addresses are translated to the actual physical addresses.
- And then the applications within the guest must have their own page tables to translate their virtual addresses into the guest OS's physical addresses.
While this strawman model provides a functional virtualization framework, it's inherently inefficient.
Full v/s OS Virtualization
Virtualization can itself be categorized into -
Full virtualization
Full virtualization is complete virtualization of the actual hardware to allow software environments, including a guest OS and its apps, to run unmodified. The guest OS (a.k.a VMs) remains unaware of its execution within a virtual environment. VMs operate on top of hypervisors, which can be categorized into two types:
- Type 1 hypervisors: Run directly on bare-metal hardware, such as ESX Server.
- Type 2 hypervisors: Run on top of another OS, such as VirtualBox.
Note: A hypervisor is also known as a Virtual Machine Monitor (VMM).
OS-level Virtualization
OS-level virtualization isolates a pool of resources within the operating system. This encompasses technologies such as containers (like Docker), jails (like FreeBSD jail), chroot, and virtual environments. The most popular among them today are containers.
For the purposes of this article, the term "virtualization" is used to denote full virtualization only.
Recommended Read: Paper Insights - Firecracker: Lightweight Virtualization for Serverless Applications for more on OS-level virtualization technique.
Classical Virtualization
VMM Characteristics
Popek and Goldberg defined three essential characteristics of a VMM for classical virtualization:
- Fidelity: The guest OS software must execute identically to its execution on the native hardware, requiring no modifications.
- Performance: A significant majority of guest OS instructions should execute directly on the CPU, minimizing traps.
- Safety: All hardware accesses must be mediated by the VMM.
Refining Strawman
Let's refine our basic virtualization model, still using the trap-and-emulate approach. Guest instructions execute in a deprivileged mode, and privileged instructions trigger traps, transferring control to the VMM for emulation.
Shadow Page Table
One key optimization involves the page table translation process. Instead of double translation, we introduce a shadow page table that directly maps the virtual address space of a guest process to machine (host) physical page numbers.
- Guest Page Table: VPN -> PPN
- Shadow Page Table: VPN -> MPN
The shadow page table is used for actual address translation during guest process execution, eliminating the overhead of double translations. The guest OS remains aware only of its own guest page table.
Hidden Fault Handling
Whenever the guest OS modifies its page table entries (adding or removing mappings), the VMM may not immediately reflect these changes in the shadow page table. Accessing a page not present in the shadow page table results in a hidden fault. Upon a hidden fault, the VMM updates the shadow page table.
Tracing
Hidden faults effectively populate the shadow page table with missing entries. However, we must also address how to remove entries from the shadow page table when they are removed from the guest page table. Without this, a guest process might access invalid pages that have been reallocated, leading to potential security breaches.
The solution is to utilize access bits on the guest page table. The guest page table is itself stored on some physical page known to the host. That physical page is write-protected. Updates to it will trap to the host. This is how the VMM detects modifications to the guest page table and can then update the shadow page table accordingly. This page-protection method is known as tracing.
Performance
Despite its functionality, classical virtualization exhibits performance bottlenecks primarily arising from the trap-and-emulate mechanism and the overhead associated with tracing memory access operations.
Trap-and-Emulate is Non-Classical
- The guest OS can detect it's not running in Ring 0 by reading the code segment register.
- Instructions like popf bypass traps, allowing direct modification of status flags without VMM intervention. These deviations violate the principles of fidelity and safety, undermining the transparency and controlled hardware access that classical virtualization aims to provide.
Software Virtualization
Software virtualization fundamentally relies on binary translation. Instead of direct CPU execution, guest instructions are intercepted and translated by the host into safe equivalents. This allows for features like virtualized flags (e.g., from popf
) and spoofed CS registers.
Binary translation is not a novel concept; tools like Valgrind utilize it for dynamic bug detection.
However, this translation overhead can significantly impact performance if each guest instruction is translated individually (like a interpreter). To mitigate this, just-in-time (JIT) compilation is employed, similar to the JVM's JIT. This dynamic translation occurs at runtime, on demand, enabling optimizations.
Side Note: JIT Benefits
JIT compilation is a common compiler technique and offers performance enhancements. Notably, it can rearrange code blocks to improve instruction cache locality, reducing cache misses caused by frequent jumps in the original code.
Binary Translation
The software virtualization approach executes guest code in a less privileged level (e.g., CPL 1). This allows most guest instructions to execute natively without full translation, minimizing overhead.
The translation process operates on translation units (TUs), which are basic blocks of guest code. Each TU is translated into a compiled code fragment (CCF). TUs are typically limited in size (e.g., 12 instructions) and terminate with a jump or return instruction.
The authors demonstrate that translated instructions offer a performance advantage over the conventional trap-and-emulate approach. In a specific instance, a read timestamp counter (rdtsc) operation required 2030 cycles using trap-and-emulate, while the same operation completed in only 216 cycles with TC emulation.
IDENT vs. NON-IDENT Instructions
- IDENT instructions: These instructions require no translation and can be executed directly. The majority of guest instructions fall into this category.
- NON-IDENT instructions: These instructions necessitate translation. Examples include:
- PC-relative addressing - This is because the translated code may not be in the same place as the original code.
- Control flow - Jump (e.g.,
jge
, jmp) and return instructions. - Privileged instructions - Such as popf.
- Non-privileged instructions - Such as load, store, etc (adaptive).
Lazy Translation
As a consequence of just-in-time compilation, only executed code paths are translated. Unexecuted code remains untranslated.
Chaining Optimization
The translation process may alter instruction addresses, requiring adjustments to target addresses. If a jump target is determined to be the next sequential instruction, the jump is replaced with a no-op instruction, a technique known as chaining optimization.Example
Let's break down each point:
- Line 4: The jump instruction's target address has been altered. While it was intended to point to a prime block, that block was never executed for the given input (49), hence its translation was skipped.
- Lines 5 & 10: The jump instruction
jmp [fallthrAddr]
was removed because the "fall-through" address was the very next instruction. This is chaining optimization.
- Line 14: The jump
jmp [fallthrAddr3]
was retained. However, since the prime block was never translated, the jump's target remains unresolved at this stage.
- Line 16 - 19: The
ret
instruction is translated:
- Line 16: The
pop
instruction retrieves the return address from the stack into %r11.
- Line 17: The %rcx register is saved to a memory location. The memory address is calculated by adding a fixed displacement of 0xff39eb8 to %rip and this will be the relative address within the segment pointed by %gs register.
- Line 18: The lower 8 bits of the %r11 register (%r11b) are extracted and placed into the %ecx register, with the remaining bits of %ecx being set to zero.
- Line 19: Probably, relative address 0xfc7dde0 within %gs segment is the data structure (such as an array) which stores the return addresses to the callee. The index into the array is coming from %rcx register.
Running the Translator
Adaptive Binary Translation
Hardware Virtualization
Comparison
Turning to the paper's central objective: a comparative analysis of hardware and software virtualization techniques. At first glance, hardware virtualization might appear superior, particularly when VM exits are infrequent. However, the authors' experiments challenge this assumption, revealing several scenarios where binary translation in software virtualization outperforms hardware virtualization.
It's important to acknowledge a potential bias: as VMware developed binary translation, there's a natural inclination towards showcasing the strengths of software virtualization.
Experiment Results
The experimental design focused on scenarios triggering privileged operations, as compute-intensive benchmarks demonstrated near-native performance (Figure 2). Notably, the "mcf" application exhibited performance exceeding native execution, attributed by the authors to virtualization-induced improvements in TLB and cache access patterns.
Figure 3 illustrates that hardware VMMs occasionally outperform software VMMs. However, given the complexity of operating system interactions, attributing definitive conclusions is challenging. The authors' explanations for these variations appeared somewhat inconclusive.
In the fork/wait experiment, software virtualization completed 40,000 fork/wait operations in 36.9 seconds, while hardware virtualization required 106.4 seconds, representing a 4.4x latency increase.
Microbenchmark results further highlighted performance differences:
- syscall: Software virtualization incurred a significant performance penalty, adding approximately 2,000 cycles compared to hardware or native execution.
- in: Software virtualization excelled by translating the I/O port read instruction, resulting in rapid execution.
- cr8wr: Software virtualization again demonstrated superior performance by translating the privileged register write into a more efficient instruction sequence.
- call/ret: Binary translation introduced overhead, slowing down control flow due to the translation process.
- pgfault: Software virtualization outperformed hardware, as software-handled page faults proved less costly than VM run/exit cycles.
- divzero: Hardware virtualization exhibited superior performance, as the CPU-generated trap was handled more efficiently.
- ptrmod: Software virtualization excelled in manipulating page table entries, as it avoided traps and translated the instruction into a more efficient variant.
Paper Review
To be entirely candid, my academic background in compilers and operating systems has not been my strongest area of study, and I found this particular paper to be exceptionally dense and difficult to navigate. In addition, my lack of practical experience with operating system kernels made it challenging to discern the precise mechanisms by which the various elements of an operating system interface with the VMM. It required a substantial amount of dedicated time to engage in thorough research and to fully comprehend the underlying concepts.
I believe that while this paper demands a significant time commitment, it presents a valuable opportunity to augment one's understanding of key concepts in compilers and operating systems.
Comments
Post a Comment