Skip to main content

Paper Insights #3 - Memory Resource Management in VMware ESX Server

Released by VMware in 2002 and presented at ACM SIGOPS, this paper details the memory management techniques employed by VMware ESX Server, a Type 1 hypervisor enabling full virtualization. I encountered this influential paper in Stanford's CS240 class. There are lots of beautiful techniques discussed in the paper.

Paper Link

Recommended ReadPaper Insights - A Comparison of Software and Hardware Techniques for x86 Virtualization

Let's begin with some basic concepts on computer memory.

Process Memory

A Linux process operates within a complex memory space composed of several distinct regions. Let's examine each component individually.

Binary Region

A Linux process is built from various files, including the executable, object code, and shared libraries. These modules adhere to a common format known as the ELF (Executable and Linkable Format). On Windows, the equivalent format is Portable Executable.

The ELF format structures a module into key sections:

  • Header: Contains fundamental information about the file, such as its type, architecture, and entry point.
  • Program Header: Describes the segments of the program that should be loaded into memory during execution, including their permissions and locations.
  • Section Header: Provides a table detailing the various sections within the file, including their names, sizes, and offsets.
  • Sections: These contain the actual data and instructions of the program:
    • .text: The code segment, holding the executable instructions of the program.
    • .data: The data segment, storing the program's initialized global and static variables.
    • .rodata: The read-only data segment, typically used for storing string literals and constant values.
    • .bss: The Block Started by Symbol (BSS) segment, reserved for uninitialized global and static variables, which are set to zero by the kernel at program start.

Thread Stacks

Beyond the program's binary image, each thread within a process has its own stack, typically a few megabytes in size (e.g., 2MB default). The stack is allocated upon thread creation. It serves as storage for local variables and maintains the execution state during function calls.

Heap Region

The heap is a large memory region dedicated to dynamic memory allocation. Objects created at runtime are allocated on the heap. Various memory allocators manage this space, with popular choices including the default PTMalloc, Google's TCMalloc, and Facebook's JEMalloc.

All allocators keep track of the available heap regions using various metadata.

PTMalloc

PTMalloc, the default memory allocator in C libraries like GNU, manages memory by dividing the heap into chunks of varying sizes. These chunks are organized into bins based on their size, allowing for efficient lookup.

When a program requests new memory for an object, PTMalloc searches these bins for a free chunk of the appropriate size. However, in a multi-threaded environment, multiple threads might try to allocate memory simultaneously. This can lead to contention, where threads compete for access the bins, potentially slowing down memory allocation.

TCMalloc


TCMalloc also organizes the heap into slabs of fixed sizes. Each thread maintains a local lock-free free list of available slabs in a cache for efficient object allocation. A central, locked free list serves as a fallback for threads that exhaust their list in thread-local cache. Additionally, a central heap handles allocations of very large objects.

JEMalloc

JEMalloc categorizes memory requests into three main sizes:

  • Small Objects: 8, 16, 32, 128 bytes, up to 512 bytes. Directly handled by the thread-local cache, allowing for fast allocation.
  • Large Objects: Ranging from 4 KB up to 4 MB. Leverages thread-local cache or arenas. Each arena is an independent memory pool, typically 4 MB in size, that manages its own set of memory chunks.
  • Huge Objects: Any allocation 4 MB or larger (e.g., 4 MB, 8 MB, 12 MB). These are allocated directly from the underlying operating system's free memory. JEMalloc uses a global red-black tree to track these huge allocations. 

MMap Memory

Not to be confused with Memory-Mapped I/O. 

mmap is a mechanism for mapping files from disk directly into a process's address space. This creates a memory region linked to the file. MMap regions can be:

  • Clean Region: Represents files that have only been read. These regions can be readily discarded from memory as their content is identical to the on-disk version and can be reloaded if needed.
  • Dirty Region: Contains data that has been written to within the mapped memory region. These changes are not immediately reflected on disk and require a sync operation to persist the modifications.

Kernel Memory

Finally, each process is associated with a portion of kernel memory. This memory is managed by the operating system kernel on behalf of the process and often includes I/O buffers, such as network TCP/IP buffers, used for communication.

Paging

The system's memory, both primary (RAM) and secondary (disk), is organized into fixed-size blocks called pages. We can broadly categorize these pages based on their primary location and purpose:

  • Memory Pages: These are pages intended to reside in the main memory (RAM) for active use.
  • Disk Pages: These are pages primarily stored on secondary storage (disk), often corresponding to files or other persistent data.

Swapping

When the system's primary memory becomes low, a process called swapping occurs. During swapping, memory pages that are currently in RAM are moved to a dedicated area on the secondary storage called the swap space. This frees up physical RAM.

When the data on a swapped-out page is needed again, it is read back from the swap space into the main memory. This on-demand retrieval is known as demand paging.

Page Caching

Page caching is a distinct mechanism where disk pages (pages originating from files on disk) are read into the main memory. This is often done through mmap() discussed above. The goal of page caching is to improve performance by keeping frequently accessed file data readily available in RAM, avoiding the slower process of repeatedly reading from the disk.

Non-Anonymous v/s Anonymous Pages

Pages in memory can be further classified into two types:

  • Non-Anonymous Pages (File-Backed): These are the pages that constitute the page cache – they correspond to files on disk. Because their content is ultimately backed by a file, clean (unmodified) non-anonymous pages can be discarded from memory if space is needed, as their original state is preserved on disk. Dirty (modified) non-anonymous pages must be written back to their corresponding disk files before they can be evicted from RAM.

  • Anonymous Pages: All other pages in main memory that are not file-backed are considered anonymous. These pages typically hold data that doesn't have a direct persistent storage on disk, such as the heap, stack, and binary segments of processes. When the system runs out of memory, anonymous pages are the ones written to the swap space to free up RAM.

Private v/s Shared Memory

Processes in a system have a considerable amount of memory that is exclusively their own, allowing them unrestricted read and write access. However, a notable portion of memory is also shared across different processes. This shared memory primarily consists of commonly used:

  • ELF files, such as shared libraries.
  • Clean pages residing in the page cache.

Memory Accounting

Linux employs several metrics to track memory usage by processes.

Resident Set Size (RSS)

RSS represents the portion of a process's memory that is currently residing in physical RAM. It includes all memory pages belonging to the process that are resident and have not been swapped out to disk.

Includes:

  • All non-swapped binary, stack, and heap pages.
  • Disk pages currently held in the page cache.
  • Shared memory segments. Note: this can lead to double counting when multiple processes share the same memory.

Excludes:

Kernel memory used to support the process. This memory is managed from a central kernel memory pool and is not attributed to individual processes.

Proportional Set Size (PSS)

PSS is similar to RSS but accounts for shared memory more accurately. Instead of counting the entire shared memory segment for each process, PSS divides the size of each shared memory segment equally among all the processes sharing it. This provides a more realistic view of a process's "fair share" of memory usage.

Virtual Memory Size (VMS)

VM Size encompasses the total virtual address space used by a process.

Includes:

All memory the process has access to, regardless of whether it's currently in RAM or on disk (swapped out). This includes the resident pages (counted in RSS) and the non-resident pages.

Excludes:

Kernel memory used to support the process.

Virtual Memory

Virtual memory provides programs with an abstraction of physical memory, creating the illusion of a vast, contiguous address space. A virtual address can map to any location: physical memory, secondary storage (like a hard drive), or even to a non-existent location, as the virtual address space may exceed the available physical resources. Virtual address space is also divided into pages of same size as the physical pages. 

A page table is a data structure used for translating virtual page numbers (VPNs) to physical page numbers (PPNs) in main memory:

Page Table: VPN -> PPN

The Memory Management Unit (MMU) is a dedicated hardware component integrated into the CPU, playing a pivotal role in virtual memory management. It helps in translating virtual address from CPU to physical address. The translation makes use of page table along with support from translation look-aside buffer (TLB) to speed up.

Memory Mapped I/O

Not to be confused with mmap.

Port-Mapped I/O was a common method for CPU communication with peripheral devices in earlier computing systems. This technique involved the CPU using specialized instructions to interact with dedicated I/O ports of the devices. For instance, reading data from a device might involve the following x86 assembly instructions:

mov %dx, 0x3F8
in %al, %dx

The instructions above loads I/O port address into %dx and then reads a byte from the port in %dx into %al register. Similarly, writing data to a device could be accomplished with instructions like:

mov %dx, 0x3FC
mov %al, 0x01
out %dx, %al

Port-Mapped I/O is less efficient. For example, transferring data from memory to a disk would require the CPU to read data from memory into its registers and then write that data out to the disk's I/O port. The CPU is directly involved in every step of the data transfer. The in and out instructions specifically target the I/O ports of the connected devices.

In contrast, Memory-Mapped I/O (MMIO) simplifies device interaction by mapping the physical registers and memory of peripheral devices into the kernel's virtual address space. 

Once this mapping is established, communication with these devices occurs through standard memory read and write operations. For example:

mov %eax, 0x01
mov 0xf0, %eax

Advantages:

  • A single address bus serves both memory and all I/O devices, eliminating the need for separate I/O ports and buses.
  • Devices can benefit from CPU and bus optimizations designed for memory access.

The kernel's virtual memory space allocated to each memory-mapped devices can be logically divided into input regions (for data coming into the device), output regions (for data going out from the device), and control.

Direct Memory Access (DMA)

Following the widespread adoption of Memory-Mapped I/O, DMA controllers became prevalent. DMA enables peripherals to directly access system RAM without constant CPU intervention. The CPU initiates a DMA transfer by providing the DMA controller with the physical address in RAM, the I/O device's input/output region, and the number of bytes to transfer. The DMA controller then handles the data transfer autonomously. It's important to note that traditional DMA controllers operate using physical addresses and do not inherently understand virtual addresses.

Some older DMA controllers had limitations, such as only being able to address the first 4 GB of physical memory due to 32-bit addressing. In such cases, the CPU might need to re-copy data between this lower physical memory region and the actual memory locations used by user applications.

To overcome this limitation and enhance flexibility, Input/Output Memory Management Units (IOMMUs) was integrated. An IOMMU functions similarly to the CPU's MMU but performs address translation for I/O devices, allowing DMA controllers to work with virtual addresses.

Device Drivers

Device drivers are essentially software modules containing instructions that enable the operating system kernel to interact with specific I/O devices.

In the era of Port-Mapped I/O, device drivers often contained extensive low-level code to manage the intricacies of communicating through dedicated I/O ports. However, with the advent of Memory-Mapped I/O and DMA, device drivers have become significantly simpler. Their primary tasks involve:

  • Reading control and status bits from device registers (via MMIO).
  • Instructing the DMA controller to initiate data transfers between specified physical RAM addresses and the I/O device's input/output regions (address and number of bytes).

Device drivers are responsible for translating the virtual addresses used by applications into the physical addresses required by DMA controllers.

Note that in systems without a DMA controller, device drivers will directly perform data transfers to and from memory-mapped I/O regions using standard memory access instructions.

Also note that, user-space applications are generally restricted from directly accessing I/O devices. Instead, they interact with devices by reading and writing to buffers within the kernel's address space. The kernel then initiates the actual I/O operations to the device based on these buffer accesses.

Memory Management in ESX Server

ESX Server operates as a type-1 hypervisor (a.k.a. Virtual Machine Monitor), meaning it runs directly on the host hardware, controlling all hardware resources and mediating access for VMs. A critical function of this type-1 hypervisor is memory virtualization, which involves creating an abstract memory space for each VM.

Recap: Memory Virtualization

See Paper Insights - A Comparison of Software and Hardware Techniques for x86 Virtualization where I discussed how virtualization works in details.

Every process on VM running on a type-1 hypervisor possesses its own distinct virtual address space. Within this space, a guest page table is maintained by the guest OS. This guest page table is responsible for translating VPNs into PPNs. From the perspective of the guest OS, these PPNs represent the actual physical addresses of memory.

Guest Page Table: VPN -> PPN

However, since the VM is running on top of a hypervisor, the PPNs understood by the guest OS are not the true physical addresses. They are, in fact, virtualized physical addresses. To access the actual hardware memory, these PPNs must be further translated into machine page numbers (MPNs), which correspond to the host's physical memory addresses.

To avoid the performance overhead of a double translation (VPN -> PPN -> MPN), there is shadow page table. The shadow page table directly maps the virtual address space of a guest process to machine (host) physical page numbers.

Shadow Page Table: VPN -> MPN

The guest OS remains oblivious to the existence and operation of the shadow page table; it continues to manage only its own guest page table.

In addition to the guest and shadow page tables, hypervisors maintains a third data structure: the pmap. The pmap maps guest PPNs to MPNs.

pmap: PPN -> MPN

Hidden Page Fault

A crucial aspect of this system is how changes are handled. When the guest OS modifies its guest page table entries (e.g., adding or removing page mappings), the hypervisor may not immediately update the corresponding shadow page table entries. If a guest process attempts to access a page that is mapped in the guest page table but not yet reflected in the shadow page table, it results in a hidden fault. Upon detecting such a hidden fault, the hypervisor updates the shadow page table to include the necessary mapping.

With that background, let's focus our attention to various strategies used by ESX server for optimization.

1. Memory Overcommitment and Reclamation Strategies

Overcommitment is a widely adopted practice in cloud and cluster computing. It involves allocating more virtual resources to VMs than the physical resources available on the host machine. For example, a machine with 100 GB of RAM might allocate 50 GB to each of three VMs, resulting in a total commitment of 150 GB. This strategy is predicated on the assumption that not all VMs will simultaneously demand their full allocated memory capacity.

Furthermore, the memory limits assigned to each VM are typically soft limits. This means that VMs are not rigidly capped at their allocated memory unless the system experiences memory pressure and the VMs are exceeding their limits.

When the host machine's physical memory resources become constrained, pages from a VM's memory need to be reclaimed. This usually involves swapping these pages out to disk. In a bare-metal OS, the OS itself has knowledge of its memory usage and can intelligently select the "best" pages to swap out, often employing policies like LRU (Least Recently Used).

However, a significant challenge arises because ESX Server is a type-1 hypervisor. The VMs are opaque to the hypervisor; it cannot "see" inside a VM to determine which pages are truly useful or recently accessed by the guest OS and its applications. This lack of insight makes it difficult for the hypervisor to make informed decisions about which pages to reclaim.

Ballooning

Ballooning is a clever technique specifically designed to address the opacity issue. A balloon driver is installed in each guest OS. This balloon driver communicates with the hypervisor through an appropriate channel (e.g., inter-process communication or hypercalls).

When the hypervisor determines that memory needs to be reclaimed, it instructs the balloon driver within a guest VM to "pin" certain pages. In response, the guest OS is compelled to start paging out its non-pinned pages to its own swap space. Crucially, the pages "pinned" by the balloon driver are then effectively returned to the hypervisor, making that physical memory available for allocation to other VMs.


This technique introduces an element of paravirtualization. Paravirtualization occurs when the guest OS is aware of the hypervisor's presence. The balloon driver essentially paravirtualizes the guest setup.

Simply having the balloon driver pin pages is not enough. To ensure that the guest OS truly cannot access these reclaimed pages, the hypervisor takes an additional step: it removes the corresponding entry in the shadow page table. If the guest OS then attempts to access an MPN that has been reclaimed (and whose entry has been removed from the shadow page table), it will result in a fault. This fault can then be handled by the hypervisor, which might redirect the access to a new page (e.g., a zeroed-out page or a page with random values) that is no longer part of the VM's active memory.

Random Demand Paging

Even after ballooning has been employed, there might be scenarios where pages still need to be forced out of memory. For example, the ballooning mechanism might be disabled by the guest OS (the host doesn't have much control over what guest can do; the guest can choose against paravirtualization for security reasons).

In such situations, ESX Server resorts to demand paging itself, directly paging out memory to the swap space. However, the ESX Server does not use an LRU policy for this.

The reason for avoiding LRU is to prevent a "LRU + LRU" scenario. If both the hypervisor and the guest OS were to use LRU, a page deemed LRU by the hypervisor and swapped out could simultaneously be deemed LRU by the guest OS. When the guest OS then attempts to access this same page, it would cause a page fault, forcing the hypervisor to page it back in (and mark it as recently used). As a result, ESX Server opts for random paging when it directly performs demand paging to avoid these conflicts.

Evaluation

The paper's presentation of its evaluation is somewhat unconventional. Instead of a dedicated, consolidated section for evaluating all techniques, it discusses the evaluation of each technique immediately after its technical explanation.

Figure 2 in the paper provides a comparative analysis. On the x-axis, it represents different VM sizes. The gray bars illustrate performance when ballooning is actively reducing the VM's memory footprint to the indicated size. The black bars, in contrast, depict performance when a hard memory limit is set to that specific size.

The evaluation indicates that performance with ballooning remains notably close to raw (native) performance. A slight overhead observed is attributed primarily to the CPU usage involved in the ballooning process itself. Additionally, the paper mentions that there are some memory overheads associated with the larger data structures that ESX Server maintains for memory management.

2. Content-Based Page Sharing

The second technique: sharing pages across VMs when they have byte-identical content. This happens when pages load common library code that can be reused across different VMs.

The system regularly scans pages across different VMs, looking for matches. It hashes each page for an initial comparison, then performs a full comparison to prevent hash collisions from causing issues.

If a page isn't found to be shared, it gets a hint frame with the following fields:
  • hash: The hash of the page content.
  • MPN: The machine page number.
  • VM: The VM it belongs to.
  • PPN: The physical page number known to the VM.
A page with only a hint frame can be modified freely by the VM. Once the scan finds another matching page (using a full match), the hint frame is converted to a shared frame with a reference count. Additionally, the page is marked as protected in the shadow page table of all VMs sharing it. As a result, if the page is written to, it will be copied to a fresh machine page number, and the reference count will be decremented. The shadow page table is then updated to map the VPN to the new common MPN.


Evaluation

When running identical VMs, Figure 4 shows that 60% of the memory is shared. There's a gap between the shared and reclaimed lines because at least one page needs to exist for all its copies. The absolute gap between shared and reclaimed is constant. The percentage gap is large with two machines but closes as the number of VMs sharing the page increases, demonstrating the effectiveness of sharing.

Even with one VM, there's a 5 MB memory saving from shared zero pages (pages with entirely zero content).

In a real-world scenario, page sharing reclaimed one-third of all RAM usage. This is very useful because cloud providers can offer and sell more space while incurring fewer charges, at the cost of a slight CPU overhead.

Considerations

A question that comes up is how this impacts cache performance and interacts with huge pages. Contiguous virtual page numbers should ideally be assigned contiguous machine page numbers. However, with shared pages, the machine page numbers might be random. This could affect L2 and L3 cache performance.

3. Proportional Memory Distribution

This technique addresses how the ESX Server efficiently distributes memory among various VMs.

The Problem of Memory Distribution

A straightforward approach to memory management might involve simply allowing oversubscription and allocating memory proportionally to what each VM requests. This is typically achieved through share-based allocation, where each task is assigned a number of shares, and memory is distributed accordingly.

The paper introduces a refined metric: the share-per-page ratio. This ratio essentially indicates how many shares a VM is willing to "burn" for each page of memory it's allocated. The core idea is to incentivize active memory utilization. If a page is actively used, fewer shares are paid for it. Conversely, if a page is not actively used (idle), it costs more shares. This mechanism encourages VMs to actively utilize the memory they've been allocated.

Once the share-per-page ratio (ρ) is determined for a VM, the pages are allocated proportional to it.

Calculating Share-Per-Page (ρ)

The share-per-page ratio (ρ) for a VM is calculated using the following formula:

ρ = S / (P * (f + k * (1 - f)))

S: Total number of shares assigned to the VM.
P: Total number of pages allocated to the VM without idle page adjustment.
f: The fraction of pages that are actively used.

Here, k represents a price constant for idle pages. You can think of k as a multiplier that increases the cost in shares for keeping idle pages. It's expressed as:

k = 1 / (1 − τ)

where τ is the tax rate.
  • If τ = 0, then k = 1. In this scenario, idle pages and used pages have the same price in shares.
  • If τ = 0.99, then k = 100. This means idle pages are 100x more expensive in terms of shares.
With a high tax rate, like τ = 0.99, a VM with a significant amount of idle memory will have a very high share-per-page ratio, leading to a much lower RAM allocation. This can trigger ballooning much faster, reclaiming those idle pages.

Determining the Fraction of Idle Pages (f)

The last crucial step is to determine f, the fraction of idle pages for a VM. This is done through statistical sampling: the system randomly picks N pages, removes their entries from the shadow page table, and then monitors for faults upon access.

The system maintains three estimations for f:
  • A slow exponentially weighted moving average of the ratio of faults to samples (t/n) over many samples.
  • A faster weighted average that adapts more quickly to changes.
  • Another version of the faster average that incorporates samples from the current period.
The system then takes the maximum of these three estimations. Taking the maximum incorporates hysteresis during downscaling: upscaling (increasing memory allocation) is rapid, while downscaling (decreasing memory allocation) is slower and more cautious.

Evaluation

Figure 7 in the paper illustrates the impact of this technique:
  • No Taxes (τ = 0): When no taxes are applied, each VM is allocated a similar amount of memory, even if one VM is idle and not processing anything.
  • Tax (τ = 0.75, so k = 4): With a tax rate of 0.75, idle pages become four times more expensive than used pages. In this scenario, if VM2 has nearly 100% utilization, its memory limit is increased. Conversely, if VM1 has idle pages, its memory limit is decreased because it's paying a higher price per page. If both VM1 and VM2 are given the same number of shares, VM1 will exhaust its shares on fewer pages.
The evaluation shows that with a tax rate of 0.75, the ratio of memory allocation between a busy VM and an idle VM can become 2.5:1, instead of the 1:1 ratio observed without taxes. This highlights the effectiveness of proportional memory allocation in penalizing idle memory and reallocating it to actively used VMs.

4. Adhering to the User Limits

Beyond proportional memory distribution using shares, it's crucial to respect the min and max RAM limits set by the user for each VM.

Even if the calculated page allocation (based on shares and share-per-page ratio) is very small—for example, due to all pages being idle—the minimum RAM guarantee will still be honored. Conversely, the allocated memory will not exceed the maximum RAM limit.

The specified minimum for a VM actually includes a 32 MB overhead for various data structures like the pmap, shadow page table, and frame buffer. Since the ESX server only guarantees minimum memory availability, any memory between the maximum and minimum limits can be paged out to disk. This means a corresponding amount of swap space needs to be reserved for the difference (max - min).

Min Memory and Reclamation

It's worth shedding some light on how min memory works with reclamation. The min setting specifies the minimum amount of memory a VM gets. No pages below that will be reclaimed. Pages above the min limit may be reclaimed as follows:
  • First, ballooning will kick in to reclaim pages. The guest OS will be "fooled" into believing that it is using too much memory (almost close to its max), and it will itself trigger demand paging.
  • If ballooning doesn't work, then the hypervisor will itself page out the pages.

Page Sharing and Memory Accounting

It seems like the ESX server follows the definition of RSS. Shared pages are a transparent optimization; the memory accounted to each VM includes the shared pages as well.

Memory Pressure States

The system operates under different memory pressure states, which dictate its response to memory availability:
  • High (6% free): Plenty of free memory; no constraints are applied.
  • Soft (4% free): Ballooning is initiated to reclaim pages and deflate VMs that are operating above their minimum memory limits. Again, note that only the minimum limit is strictly honored.
  • Hard (2% free): The system begins paging out memory to disk.
  • Low (1% free): Execution of VMs is blocked, preferably those that are above their maximum limits.

Evaluation

An evaluation was conducted using five VMs, each running Windows with an application. The maximum memory for these VMs was set to 256 MB×3 and 320 MB×2. The minimum was set to half of the maximum. The total maximum memory, including the 32 MB×5 overheads, amounted to 1462 MB. Memory overcommitment was at 60%, and shares were distributed proportionally to the maximum memory.

All techniques—ballooning, proportional share allocation, demand paging, and page sharing—were enabled. Figure 8 illustrates the complex interactions when these techniques are combined. Here's a high-level overview of the observations:
  • Startup: Upon boot, Windows accesses all pages to zero them out. This causes all VMs to initially access their maximum memory limits. Since many pages are zeroed, this also creates opportunities for page sharing, which helps keep overall memory usage in check.
  • Post-Startup: After the initial startup phase, ballooning activates to reclaim pages allocated during start-up. This is necessary because otherwise, VMs would continue to use up to their maximum allocated limits. Page sharing becomes less effective after startup as more pages contain non-zero data.
  • SQL Server Behavior: When Microsoft SQL Server is idle, it is still allocated its minimum memory limit. However, when it processes a large query, many of its pages become active. At this point, proportional share allocation ensures it receives more pages.

5. I/O Page Remapping

As discussed earlier, memory-mapped I/O operations occur in the lower memory regions of a machine, while an application's allocated machine pages might reside in higher memory ranges. This often necessitates the use of bounce buffers to copy I/O bytes from the lower, I/O-mapped regions to the application's higher memory ranges.

The ESX server employs an clever technique to avoid this extra copy. It tracks virtual addresses within VMs that are frequently copied to these low I/O-mapped memory regions. Once identified, the ESX server directly maps those specific virtual addresses to the low memory regions. This direct mapping eliminates the need for bounce buffers and the associated copying overhead.

Naturally, not all VMs can be granted access to these lower memory regions.

Paper Review

This paper is a fascinating read, showcasing several innovative ideas and illustrating their practical implementation within the ESX server. It stands out as a great experience paper from VMware and worth the time of operating systems enthusiasts.

Comments

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman .

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field.