Released by VMware in 2002 and presented at ACM SIGOPS, this paper details the memory management techniques employed by VMware ESX Server, a Type 1 hypervisor enabling full virtualization. I encountered this influential paper in Stanford's CS240 class. There are lots of beautiful techniques discussed in the paper.
Recommended Read: Paper Insights - A Comparison of Software and Hardware Techniques for x86 Virtualization
Let's begin with some basic concepts on computer memory.
Process Memory
A Linux process operates within a complex memory space composed of several distinct regions. Let's examine each component individually.
Binary Region
A Linux process is built from various files, including the executable, object code, and shared libraries. These modules adhere to a common format known as the ELF (Executable and Linkable Format). On Windows, the equivalent format is Portable Executable.
The ELF format structures a module into key sections:
- Header: Contains fundamental information about the file, such as its type, architecture, and entry point.
- Program Header: Describes the segments of the program that should be loaded into memory during execution, including their permissions and locations.
- Section Header: Provides a table detailing the various sections within the file, including their names, sizes, and offsets.
- Sections: These contain the actual data and instructions of the program:
- .text: The code segment, holding the executable instructions of the program.
- .data: The data segment, storing the program's initialized global and static variables.
- .rodata: The read-only data segment, typically used for storing string literals and constant values.
- .bss: The Block Started by Symbol (BSS) segment, reserved for uninitialized global and static variables, which are set to zero by the kernel at program start.
Thread Stacks
Beyond the program's binary image, each thread within a process has its own stack, typically a few megabytes in size (e.g., 2MB default). The stack is allocated upon thread creation. It serves as storage for local variables and maintains the execution state during function calls.
Heap Region
The heap is a large memory region dedicated to dynamic memory allocation. Objects created at runtime are allocated on the heap. Various memory allocators manage this space, with popular choices including the default PTMalloc, Google's TCMalloc, and Facebook's JEMalloc.
All allocators keep track of the available heap regions using various metadata.
PTMalloc
TCMalloc
TCMalloc also organizes the heap into slabs of fixed sizes. Each thread maintains a local lock-free free list of available slabs in a cache for efficient object allocation. A central, locked free list serves as a fallback for threads that exhaust their list in thread-local cache. Additionally, a central heap handles allocations of very large objects.
JEMalloc
JEMalloc categorizes memory requests into three main sizes:
- Small Objects: 8, 16, 32, 128 bytes, up to 512 bytes. Directly handled by the thread-local cache, allowing for fast allocation.
- Large Objects: Ranging from 4 KB up to 4 MB. Leverages thread-local cache or arenas. Each arena is an independent memory pool, typically 4 MB in size, that manages its own set of memory chunks.
- Huge Objects: Any allocation 4 MB or larger (e.g., 4 MB, 8 MB, 12 MB). These are allocated directly from the underlying operating system's free memory. JEMalloc uses a global red-black tree to track these huge allocations.
MMap Memory
Not to be confused with Memory-Mapped I/O.
mmap is a mechanism for mapping files from disk directly into a process's address space. This creates a memory region linked to the file. MMap regions can be:
- Clean Region: Represents files that have only been read. These regions can be readily discarded from memory as their content is identical to the on-disk version and can be reloaded if needed.
- Dirty Region: Contains data that has been written to within the mapped memory region. These changes are not immediately reflected on disk and require a sync operation to persist the modifications.
Kernel Memory
Finally, each process is associated with a portion of kernel memory. This memory is managed by the operating system kernel on behalf of the process and often includes I/O buffers, such as network TCP/IP buffers, used for communication.
Paging
The system's memory, both primary (RAM) and secondary (disk), is organized into fixed-size blocks called pages. We can broadly categorize these pages based on their primary location and purpose:
- Memory Pages: These are pages intended to reside in the main memory (RAM) for active use.
- Disk Pages: These are pages primarily stored on secondary storage (disk), often corresponding to files or other persistent data.
Swapping
When the system's primary memory becomes low, a process called swapping occurs. During swapping, memory pages that are currently in RAM are moved to a dedicated area on the secondary storage called the swap space. This frees up physical RAM.
When the data on a swapped-out page is needed again, it is read back from the swap space into the main memory. This on-demand retrieval is known as demand paging.
Page Caching
Page caching is a distinct mechanism where disk pages (pages originating from files on disk) are read into the main memory. This is often done through mmap() discussed above. The goal of page caching is to improve performance by keeping frequently accessed file data readily available in RAM, avoiding the slower process of repeatedly reading from the disk.
Non-Anonymous v/s Anonymous Pages
Pages in memory can be further classified into two types:
- Non-Anonymous Pages (File-Backed): These are the pages that constitute the page cache – they correspond to files on disk. Because their content is ultimately backed by a file, clean (unmodified) non-anonymous pages can be discarded from memory if space is needed, as their original state is preserved on disk. Dirty (modified) non-anonymous pages must be written back to their corresponding disk files before they can be evicted from RAM.
- Anonymous Pages: All other pages in main memory that are not file-backed are considered anonymous. These pages typically hold data that doesn't have a direct persistent storage on disk, such as the heap, stack, and binary segments of processes. When the system runs out of memory, anonymous pages are the ones written to the swap space to free up RAM.
Private v/s Shared Memory
Processes in a system have a considerable amount of memory that is exclusively their own, allowing them unrestricted read and write access. However, a notable portion of memory is also shared across different processes. This shared memory primarily consists of commonly used:
- ELF files, such as shared libraries.
- Clean pages residing in the page cache.
Memory Accounting
Linux employs several metrics to track memory usage by processes.
Resident Set Size (RSS)
RSS represents the portion of a process's memory that is currently residing in physical RAM. It includes all memory pages belonging to the process that are resident and have not been swapped out to disk.
Includes:
- All non-swapped binary, stack, and heap pages.
- Disk pages currently held in the page cache.
- Shared memory segments. Note: this can lead to double counting when multiple processes share the same memory.
Excludes:
Kernel memory used to support the process. This memory is managed from a central kernel memory pool and is not attributed to individual processes.
Proportional Set Size (PSS)
PSS is similar to RSS but accounts for shared memory more accurately. Instead of counting the entire shared memory segment for each process, PSS divides the size of each shared memory segment equally among all the processes sharing it. This provides a more realistic view of a process's "fair share" of memory usage.
Virtual Memory Size (VMS)
VM Size encompasses the total virtual address space used by a process.
Includes:
All memory the process has access to, regardless of whether it's currently in RAM or on disk (swapped out). This includes the resident pages (counted in RSS) and the non-resident pages.
Excludes:
Kernel memory used to support the process.
Virtual Memory
Virtual memory provides programs with an abstraction of physical memory, creating the illusion of a vast, contiguous address space. A virtual address can map to any location: physical memory, secondary storage (like a hard drive), or even to a non-existent location, as the virtual address space may exceed the available physical resources. Virtual address space is also divided into pages of same size as the physical pages.
A page table is a data structure used for translating virtual page numbers (VPNs) to physical page numbers (PPNs) in main memory:
Page Table: VPN -> PPN
The Memory Management Unit (MMU) is a dedicated hardware component integrated into the CPU, playing a pivotal role in virtual memory management. It helps in translating virtual address from CPU to physical address. The translation makes use of page table along with support from translation look-aside buffer (TLB) to speed up.
Memory Mapped I/O
Not to be confused with mmap.
Port-Mapped I/O was a common method for CPU communication with peripheral devices in earlier computing systems. This technique involved the CPU using specialized instructions to interact with dedicated I/O ports of the devices. For instance, reading data from a device might involve the following x86 assembly instructions:
mov %dx, 0x3F8
in %al, %dx
The instructions above loads I/O port address into %dx and then reads a byte from the port in %dx into %al register. Similarly, writing data to a device could be accomplished with instructions like:
mov %dx, 0x3FC
mov %al, 0x01
out %dx, %al
Port-Mapped I/O is less efficient. For example, transferring data from memory to a disk would require the CPU to read data from memory into its registers and then write that data out to the disk's I/O port. The CPU is directly involved in every step of the data transfer. The in and out instructions specifically target the I/O ports of the connected devices.
In contrast, Memory-Mapped I/O (MMIO) simplifies device interaction by mapping the physical registers and memory of peripheral devices into the kernel's virtual address space.
Once this mapping is established, communication with these devices occurs through standard memory read and write operations. For example:
mov %eax, 0x01
mov 0xf0, %eax
Advantages:
- A single address bus serves both memory and all I/O devices, eliminating the need for separate I/O ports and buses.
- Devices can benefit from CPU and bus optimizations designed for memory access.
The kernel's virtual memory space allocated to each memory-mapped devices can be logically divided into input regions (for data coming into the device), output regions (for data going out from the device), and control.
Direct Memory Access (DMA)
Following the widespread adoption of Memory-Mapped I/O, DMA controllers became prevalent. DMA enables peripherals to directly access system RAM without constant CPU intervention. The CPU initiates a DMA transfer by providing the DMA controller with the physical address in RAM, the I/O device's input/output region, and the number of bytes to transfer. The DMA controller then handles the data transfer autonomously. It's important to note that traditional DMA controllers operate using physical addresses and do not inherently understand virtual addresses.
Some older DMA controllers had limitations, such as only being able to address the first 4 GB of physical memory due to 32-bit addressing. In such cases, the CPU might need to re-copy data between this lower physical memory region and the actual memory locations used by user applications.
To overcome this limitation and enhance flexibility, Input/Output Memory Management Units (IOMMUs) was integrated. An IOMMU functions similarly to the CPU's MMU but performs address translation for I/O devices, allowing DMA controllers to work with virtual addresses.
Device Drivers
Device drivers are essentially software modules containing instructions that enable the operating system kernel to interact with specific I/O devices.
In the era of Port-Mapped I/O, device drivers often contained extensive low-level code to manage the intricacies of communicating through dedicated I/O ports. However, with the advent of Memory-Mapped I/O and DMA, device drivers have become significantly simpler. Their primary tasks involve:
- Reading control and status bits from device registers (via MMIO).
- Instructing the DMA controller to initiate data transfers between specified physical RAM addresses and the I/O device's input/output regions (address and number of bytes).
Device drivers are responsible for translating the virtual addresses used by applications into the physical addresses required by DMA controllers.
Note that in systems without a DMA controller, device drivers will directly perform data transfers to and from memory-mapped I/O regions using standard memory access instructions.
Also note that, user-space applications are generally restricted from directly accessing I/O devices. Instead, they interact with devices by reading and writing to buffers within the kernel's address space. The kernel then initiates the actual I/O operations to the device based on these buffer accesses.
Memory Management in ESX Server
Recap: Memory Virtualization
Hidden Page Fault
1. Memory Overcommitment and Reclamation Strategies
Ballooning
Random Demand Paging
Evaluation
2. Content-Based Page Sharing
- hash: The hash of the page content.
- MPN: The machine page number.
- VM: The VM it belongs to.
- PPN: The physical page number known to the VM.
Evaluation
Considerations
3. Proportional Memory Distribution
The Problem of Memory Distribution
Calculating Share-Per-Page (ρ)
- If τ = 0, then k = 1. In this scenario, idle pages and used pages have the same price in shares.
- If τ = 0.99, then k = 100. This means idle pages are 100x more expensive in terms of shares.
Determining the Fraction of Idle Pages (f)
- A slow exponentially weighted moving average of the ratio of faults to samples (t/n) over many samples.
- A faster weighted average that adapts more quickly to changes.
- Another version of the faster average that incorporates samples from the current period.
Evaluation
- No Taxes (τ = 0): When no taxes are applied, each VM is allocated a similar amount of memory, even if one VM is idle and not processing anything.
- Tax (τ = 0.75, so k = 4): With a tax rate of 0.75, idle pages become four times more expensive than used pages. In this scenario, if VM2 has nearly 100% utilization, its memory limit is increased. Conversely, if VM1 has idle pages, its memory limit is decreased because it's paying a higher price per page. If both VM1 and VM2 are given the same number of shares, VM1 will exhaust its shares on fewer pages.
4. Adhering to the User Limits
Min Memory and Reclamation
- First, ballooning will kick in to reclaim pages. The guest OS will be "fooled" into believing that it is using too much memory (almost close to its max), and it will itself trigger demand paging.
- If ballooning doesn't work, then the hypervisor will itself page out the pages.
Page Sharing and Memory Accounting
Memory Pressure States
- High (6% free): Plenty of free memory; no constraints are applied.
- Soft (4% free): Ballooning is initiated to reclaim pages and deflate VMs that are operating above their minimum memory limits. Again, note that only the minimum limit is strictly honored.
- Hard (2% free): The system begins paging out memory to disk.
- Low (1% free): Execution of VMs is blocked, preferably those that are above their maximum limits.
Evaluation
- Startup: Upon boot, Windows accesses all pages to zero them out. This causes all VMs to initially access their maximum memory limits. Since many pages are zeroed, this also creates opportunities for page sharing, which helps keep overall memory usage in check.
- Post-Startup: After the initial startup phase, ballooning activates to reclaim pages allocated during start-up. This is necessary because otherwise, VMs would continue to use up to their maximum allocated limits. Page sharing becomes less effective after startup as more pages contain non-zero data.
- SQL Server Behavior: When Microsoft SQL Server is idle, it is still allocated its minimum memory limit. However, when it processes a large query, many of its pages become active. At this point, proportional share allocation ensures it receives more pages.
Comments
Post a Comment