Paper Insights #9 - CacheSack: Admission Optimization for Google Datacenter Flash Caches

This paper, presented by Google at Usenix Annual Technical Conference (ATC) '22. Usenix ATC is recognized as one of the major conferences in the field of computer systems. The paper presents and resolves an interesting challenge that arises within the domain of distributed file systems. While machine learning models are commonly employed for statistical problem-solving, this paper presents a distinct mathematical approach. This was not unexpected as author Tzu-Wei Yang has a Ph.D. in Mathematics from Stanford.

Paper Link

Recommended Reads:

Paper Insights - F2FS: A New File System for Flash Storage where I introduced how flash storages work.
Paper Insights - The Google File System where I introduced a distributed file system (GFS).

Let's start with some basic concepts.

1. Fractional Knapsack Problem

The problem involves a knapsack with a capacity of N, and a set of items, each with a cost C_i and a size S_i.

Goal: Minimize the total cost of filling the knapsack completely, with the key feature that items can be taken in fractions.

Solution:

Calculate the cost-to-size ratio (C_i/S_i) for each item.
Sort the items in ascending order based on their cost-to-size ratio.
Fill the knapsack by taking items in this sorted order, taking fractions of items as needed to completely fill the knapsack.

2. Convex Hull

For a given set of points, the convex hull is the smallest convex polygon that encloses all the points, as shown in illustration 1.

Illustration 1: A convex hull.

The lower convex hull is the portion of the convex hull that, when projected onto the x-axis, casts the same shadow as the entire convex hull. Essentially, it's the "bottom" part of the hull, as shown in illustration 2.

Illustration 2: A lower convex hull.

2.1. Graham's Scan Algorithm

Graham's scan is a classic algorithm to compute the convex hull of a set of points, as shown in illustration 3:

Initialization:
- Find the point with the lowest y-coordinate (and leftmost if there are ties). This point is guaranteed to be part of the convex hull.
- Push this lowest point onto a stack.
Sorting:
- Sort the remaining points in counterclockwise order around the initial lowest point. This sorting is based on the angle formed by the x-axis and the line connecting the initial point to each other point.
Hull Construction:

Iterate through the sorted points:
- For each point, push it onto the stack.
- While the stack contains at least three points and the last three points on the stack form a non-left turn (i.e., a right turn or straight line), pop the second-to-last point from the stack.
Once the iteration is complete, the points remaining on the stack form the convex hull.

Illustration 3: Graham’s scan for convex hull.

The non-left turn check ensures that the hull remains convex. A left turn indicates that the points are forming a convex angle.

3. Flash Memory

As discussed in a previous article, flash is the technology underpinning SSDs. We will go over Flash technology briefly in this article.

3.1. SSD Organization

SSDs organize data into pages, typically 2-4 KB (analogous to blocks in HDDs), which are grouped into blocks of 32-128 pages, as shown in illustration 4.

Illustration 4: SSD organization.

A key characteristic of flash memory (which utilizes EEPROM) is that a flash cell must be erased explicitly before it can be re-written. With SSDs, there is a disparity between write and erase operations: writes occur at the page level, while erasures can be performed at the block level only.

3.2. Write Performance

SSDs employ a Flash Translation Layer (FTL) to map Logical Block Addresses from the operating system to Physical Block Addresses (PBAs). New blocks are mapped to free blocks. When free blocks are depleted, the SSD reclaims space by finding blocks where data has been deleted (trimmed), allowing for block erasure.

FTL effectively manages sequential writes. However, random writes present a challenge. Modifying even a single byte in an already-written block triggers a block copy with the updated byte to a new block. This is known as write amplification.

3.3. Read Performance

SSDs perform reads in parallel and at significantly higher speeds than HDDs due to the absence of a physical read head.

3.4. Life Span

Flash memory cells have a finite limit on the write/erase cycles before wear out, as a result SSD's lifespan is limited. SSDs employ FTL to distribute write operations evenly across all blocks, extending the drive's lifespan. However, it can still wear out after continuous writes.

4. Colossus Caching

Note that the Google File System has been rebranded as Colossus.

Problems with SSDs:

Have less capacity and are more costly as compared to HDDs.
Additionally, they wear-out after a certain number of writes.

Problem with HDDs:

Throughout is limited due to the use of a spindle head.

In the vast infrastructure of data centers like Google's, the sheer scale of storage demands makes an all-SSD approach economically impractical. Therefore, Colossus chunk servers (D servers) primarily rely on HDDs for their storage capacity.

To mitigate HDD's read bottleneck, caching becomes essential. Caching frequently accessed data blocks allows for improved read speeds without the need to deploy a prohibitively large number of HDDs just to get good read performance. This strategy exemplifies how software effectively addresses hardware limitations in response to ever-changing industry needs.

The caching system for Colossus has multiple layers: the D Buffer Cache and the Flash Cache, as shown in figure 1.

Figure 1: Copy of figure 1 from the paper.

4.1. D Buffer Cache

The D Buffer Cache stores blocks directly in DRAM, offering fast access. However, due to DRAM's high cost, its capacity is inherently limited.

4.2. Flash Cache

The Flash Cache, on the other hand, uses SSDs to store recently accessed blocks.

Optimizing this layer presents significant challenges, particularly in selecting an appropriate eviction algorithm. System research has extensively explored the nuances of SSD caching. For instance:

The FIFO (First-In, First-Out) algorithm tends to produce sequential write patterns on SSDs, which are efficient.
Conversely, algorithms like LRU (Least Recently Used) can lead to random write patterns, resulting in substantial write amplification and reduced SSD lifespan.

5. Colossus Flash Cache

The addition of a flash cache layer significantly enhances Colossus's performance. However, its limited capacity necessitates strategic caching decisions.

5.1. Total Cost of Ownership

The primary optimization goal for this layer is minimizing the Total Cost of Ownership (TCO). In this context, TCO represents the overall expense incurred by Google for hosting the SSDs used for caching. A key factor influencing TCO is the lifespan of the SSDs, which is inversely proportional to the number of write operations. Therefore, the caching algorithm must intelligently select which data to cache, aiming to reduce writes and extend SSD longevity.

5.2. Client Interaction

All reads from clients are sent to cache service before Colossus. This design allows the cache service to accurately track cache misses and efficiently retrieve the required data blocks from the underlying HDD storage into the SSD cache.

By carefully managing the data stored within the limited SSD capacity, the Colossus Flash Cache strives to balance performance gains with cost efficiency, ultimately minimizing the overall TCO.

5.3. Categorization & Admission Policies

The flash cache system allows users to define categories for data, enabling fine-grained control over caching behavior. For example, databases like BigTable and Spanner have categories based on table name and locality group combinations.

Each category can have four admission policies for blocks, determining when a block is added to the cache:

AdmitOnWrite: Blocks are inserted into the cache immediately upon being written or when a read request results in a cache miss.

AdmitOnMiss: Blocks are cached only when a read request results in a cache miss.

AdmitOnSecondMiss: Blocks are cached after experiencing two consecutive cache misses.

NeverAdmit: Blocks are never cached, effectively bypassing the flash cache layer.

6. CacheSack

The goal of CacheSack is to find an optimal assignment of <categories, policy> → size, i.e. the space that each <category, policy> would be assigned in the cache.

6.1. The Fractional Knapsack

CacheSack is essentially a fractional knapsack problem:

Knapsack: Represents the flash cache with a total space of N.

Items: Represent the different user-defined categories. Each category is associated with the size for each of its four possible admission policies (AdmitOnWrite, AdmitOnMiss, AdmitOnSecondMiss, NeverAdmit).

This problem involves assigning cache space to categories and policies. Unlike the traditional knapsack problem:

The cost is not known beforehand, necessitating an online learning approach.
Cost can be computed at the category level only. However, the space allocated to a category must be distributed among its associated policies.

6.2. Online Learning

The solution begins with a random initial allocation of cache space across all <category, policy> pairs. Subsequently, the system monitors events for a 5-minute interval, using these events to calculate the cost. Based on this calculated cost, a knapsack optimization is performed to reallocate cache space to categories and policies.

For example, if a category experiences frequent reads of recently written blocks, and its AdmitOnWrite policy has a small initial allocation, the optimization process will likely increase the size allocated to that category's AdmitOnWrite policy.

6.3. Category Cost Computation

The cost function incorporates two key factors:

HDD Reads: The number of disk reads, representing HDD spindle resources time burn.
SSD Writes: The volume of data written to SSD, contributing to SSD wear.

The cost metric must balance these competing concerns.

6.3.1. LRU Approximations

CacheSack approximates flash cache behavior as an LRU cache. It employs a threshold D. This threshold determines whether a block access is considered a cache hit or miss:

If the time elapsed since the last access, d, is less than or equal to D (d <= $D$ ), the access is assumed to be a cache hit.
If the time elapsed since the last access, d, is greater than D (d > $D$ ), the access is considered a cache miss.

The paper asserts that this LRU approximation is crucial and effective.

To maintain the last access time and, crucially, the last two access times for the AdmitOnSecondMiss policy, the system utilizes a ghost cache. It's important to note that this ghost cache is an approximation, as it cannot feasibly store every single entry. Instead, it maintains access time data for the most recent four hours. Any block that remains unaccessed for a period exceeding four hours has its access information lost. This means that if a block experiences a miss, and then, after a four-hour interval, experiences another miss, the system requires one additional miss to occur before that block becomes eligible for admission into the cache under the AdmitOnSecondMiss policy.

D Buffer Cache Simulator

Note that the LRU approximation above is only applicable to the flash cache. The authors make use of a blackbox tool called D buffer cache simulator to determine if a flash cache miss also caused a D buffer cache miss. Only then is it added to HDD read cost.

6.3.2. Cost Computation

Within each category, individual blocks are assigned to specific policies. This assignment can be probabilistic, proportional to the amount of cache space allocated to that policy.

Once a block is assigned a policy, the cost associated with accessing that block is calculated according to the chosen policy's behavior, as follows:

Disk Read Cost

AdmitOnWrite: 0 if the block was written or accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
AdmitOnMiss: 0 if the block was accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
AdmitOnSecondMiss: 0 if the block was accessed twice within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
NeverAdmit: 0 if the block was found in D buffer cache (simulation); the total number of read requests from the client otherwise.

Bytes Written Cost

AdmitOnWrite: The total number of write operations from the client.
AdmitOnMiss/AdmitOnSecondMiss: 1 if a write operation occurred within the monitoring interval (5 minutes) as a result of miss/second miss.
NeverAdmit: 0 (blocks are never written to the cache).

6.4. Solving the Knapsack

As noted earlier, one difference between regular knapsack and Cachesack is the fact that there are 4 fractions that need to be computed for each item (i.e., category). This makes the algorithm a little bit more involved to optimize.

The mathematical derivation is provided in Appendix A. It may not seem very trivial and so the following sections help walk through some of the complex bits.

6.4.1. Formal Cost Representation

Essentially, we have the following cost representation:

Here D is the modeled threshold described previously.

The total cost (V_C^P) for category C and policy P is calculated as sum of the disk read cost (S_C^P) and the bytes write cost (W_C^P).

Let α_C^P is the proportion of blocks admitted under a category C and policy P. The total cost is then the weighted sum of cost for each category and policy. This total cost needs to be minimized through the linear programming:

This cost minimization needs to happen under the following constraints which are fairly trivial to understand. The second constraint ensures the fractions sum to one, while the third limits the total cache size.

Now, we need to define S and W for each category and policy. They are trivial for AdmitOnWrite and NeverAdmit. Let's walkthrough the case of AdmitOnMiss and AdmitOnSecondMiss

Under the AdmitOnMiss policy:

A disk read (S) occurs when the last access time exceeds the threshold D, and the block is missing in the buffer cache (B).
The cache-byte time usage (U) is calculated as the block's byte size multiplied by its residency time. The residency time is the sum of all the times during which the block was in the cache hence it is a summation over, for all accesses i the minimum of d_i (calculated as the time difference between consecutive access times, t_i and t_i-1) and D.
Lastly, the write-cost (W) is calculated as block's byte size multiplied by number of actual reads. A read will happen whenever d_i(time since last access) exceeds D.

These are then summed over all the blocks to get the corresponding value for a policy.

Under the AdmitOnSecondMiss policy:

A block will be in the cache if two consecutive accesses are within the threshold D (i.e., d_i−1 <= D and d_i <= D). Otherwise, when max(d_i, d_i-1) > D) and if the block is not in the buffer cache (B), a read is required.

And for U, the residency time is d_i if it was a cache hit, or it is D (the max) if the block was inserted at t_i-1.
The reasoning about W is left as an exercise to the readers.

6.4.2. Solving the Linear Program using Convex Hull

The linear program is solved using a convex hull approach, which can be broken down into the following steps:

For each category, we calculate U^P (usage) and V^P (cost) for each policy P.
These values are then plotted as points (U^P, V^P) on a graph, with each category generating its own plot. Since there are four policies, each category's plot will contain four points.

For each category's plot, we construct the lower convex hull. This hull represents the set of efficient policy combinations for that category, minimizing cost for a given usage. We identify the lines (segments) that make up the lower convex hulls of all categories.

These lines are then sorted in decreasing order of the negative slope (−V^P/U^P). This slope represents the cost-to-usage ratio, or the cost reduction per unit usage.
We greedily assign fractions to the points on these lines, starting with the line having the steepest negative slope, until the total usage constraint (∑U_C^P) is met. Essentially, at each step, we are choosing the category and policy combination that offers the greatest cost reduction for each unit of usage.

For example, as shown in illustration 5, the lines are selected in order of decreasing slopes across convex hulls of different categories.

Illustration 5: Linear program solution using convex hull.

6.5. Online Learning

The above knapsack formulation is solved for all possible values of D. The one that obtains the minimum cost is chosen and the assignment corresponding to it is the final assignment.

7. Evaluation

Note that users are not given the ability to define their own space allocation for categories. This control is entirely managed by the flash cache, meaning that the allocation of cache space is determined solely by the system's automation. This lack of user-configurable tuning actually serves to incentivize users to adopt the cache more readily, as they are relieved of the burden of manual optimization.

The implementation of CacheSack has demonstrated a reduction in the overall TCO. In production environments, CacheSack has consistently outperformed the cost associated with statically assigned policies, indicating its effectiveness in resource management.

Simulation results provide further insight into CacheSack's adaptive behavior. Specifically:

For smaller cache sizes, the simulations suggest that AdmitOnSecondMiss should receive a larger portion of the available space. This strategy is good for minimizing the impact of transient read blocks within the cache.
Conversely, for larger cache sizes, the simulations indicate that AdmitOnMiss and AdmitOnWrite should be prioritized.

CacheSack's learning algorithm effectively navigates these varying scenarios by dynamically adjusting allocations based on the cost of each operation, thereby maintaining a high hit ratio across different cache configurations.

8. Paper Review

My interest in this paper stemmed from observing high spindle utilization in a Colossus-related project, prompting me to investigate caching inefficiencies. This led me to discover and explore this paper. It offers valuable insights into SSD caching, but its niche focus makes it relevant primarily to those directly working in this area.

It's a fascinating read for anyone dealing with SSD caching, but its specialized nature might not appeal to a broader audience.

Pico

Search This Blog