This paper, presented by Google at Usenix ATC '22. Usenix ATC is recognized as one of the major conferences in the field of computer systems. The paper presents and resolves an interesting challenge that arises within the domain of distributed file systems. While machine learning models are commonly employed for statistical problem-solving, this paper presents a distinct mathematical approach. This was not unexpected as author Tzu-Wei Yang has a Ph.D. in Mathematics from Stanford.
Recommended Reads:
- Paper Insights - F2FS: A New File System for Flash Storage where I introduced how flash storages work.
- Paper Insights - The Google File System where I introduced a distributed file system (GFS).
Fractional Knapsack Problem
The problem involves a knapsack with a capacity of N, and a set of items, each with a cost Ci and a size Si.
Goal: Minimize the total cost of filling the knapsack completely, with the key feature that items can be taken in fractions.
Solution:
- Calculate the cost-to-size ratio (Ci/Si) for each item.
- Sort the items in ascending order based on their cost-to-size ratio.
- Fill the knapsack by taking items in this sorted order, taking fractions of items as needed to completely fill the knapsack.
Convex Hull
For a given set of points, the convex hull is the smallest convex polygon that encloses all the points.Graham's Scan Algorithm
Graham's scan is a classic algorithm to compute the convex hull of a set of points:
Initialization:
- Find the point with the lowest y-coordinate (and leftmost if there are ties). This point is guaranteed to be part of the convex hull.
- Push this lowest point onto a stack.
Sorting:
- Sort the remaining points in counterclockwise order around the initial lowest point. This sorting is based on the angle formed by the x-axis and the line connecting the initial point to each other point.
Hull Construction:
- Iterate through the sorted points:
- For each point, push it onto the stack.
- While the stack contains at least three points and the last three points on the stack form a non-left turn (i.e., a right turn or straight line), pop the second-to-last point from the stack.
- Once the iteration is complete, the points remaining on the stack form the convex hull.
The non-left turn check ensures that the hull remains convex. A left turn indicates that the points are forming a convex angle.
Flash Memory
As discussed in a previous article, flash is the technology underpinning SSDs. I will go over Flash technology briefly in this article.
SSD Organization
SSDs organize data into pages, typically 2-4 KB (analogous to blocks in HDDs), which are grouped into blocks of 32-128 pages.
A key characteristic of flash memory (which utilizes EEPROM) is that a flash cell must be erased explicitly before it can be re-written. With SSDs, there is a disparity between write and erase operations: writes occur at the page level, while erasures can be performed at the block level only.
Write Performance
SSDs employ a Flash Translation Layer (FTL) to map Logical Block Addresses from the operating system to Physical Block Addresses (PBAs). New blocks are mapped to free blocks. When free blocks are depleted, the SSD reclaim space by finding blocks where data has been deleted (trimmed), allowing for block erasure.
FTL effectively manages sequential writes. However, random writes present a challenge. Modifying even a single byte in an already-written block triggers a block copy with the updated byte to a new block. This is known as write amplification.
Read Performance
SSDs perform reads in parallel and at significantly higher speeds than HDDs due to the absence of a physical read head.
Life Span
Flash memory cells have a finite limit on the write/erase cycles before wear out, as a result SSD's lifespan are limited. SSDs employ FTL to distribute write operations evenly across all blocks, extending the drive's lifespan. However, it can still wear out after continuous writes.
Colossus Caching
Note that the Google File System has been rebranded as Colossus.
Problems with SSDs:
- Have less capacity and are more costly as compared to HDDs.
- Additionally, they wear-out after certain number of writes.
- Throughout is limited due to the use of spindle head.
In the vast infrastructure of data centers like Google's, the sheer scale of storage demands makes an all-SSD approach economically impractical. Therefore, Colossus chunk servers (D servers) primarily rely on HDDs for their storage capacity.
To mitigate HDD's read bottleneck, caching becomes essential. Caching frequently accessed data blocks allows for improved read speeds without the need to deploy a prohibitively large number of HDDs just to get good read performance. This strategy exemplifies how software effectively addresses hardware limitations in response to ever-changing industry needs.
The caching system for Colossus has multiple layers: the D Buffer Cache and the Flash Cache.
D Buffer Cache
The D Buffer Cache stores blocks directly in DRAM, offering fast access. However, due to DRAM's high cost, its capacity is inherently limited.
Flash Cache
The Flash Cache, on the other hand, uses SSDs to store recently accessed blocks.
Optimizing this layer presents significant challenges, particularly in selecting an appropriate eviction algorithm. System research has extensively explored the nuances of SSD caching. For instance:
- The FIFO (First-In, First-Out) algorithm tends to produce sequential write patterns on SSDs, which are efficient.
- Conversely, algorithms like LRU (Least Recently Used) can lead to random write patterns, resulting in substantial write amplification and reduced SSD lifespan.
Colossus Flash Cache
While the addition of a flash cache layer significantly enhances Colossus's performance, its limited capacity necessitates strategic caching decisions.
Total Cost of Ownership
The primary optimization goal for this layer is minimizing the Total Cost of Ownership (TCO). In this context, TCO represents the overall expense incurred by Google for hosting the SSDs used for caching. A key factor influencing TCO is the lifespan of the SSDs, which is inversely proportional to the number of write operations. Therefore, the caching algorithm must intelligently select which data to cache, aiming to reduce writes and extend SSD longevity.
Client Interaction
All reads from client are sent to cache service before Colossus. This design allows the cache service to accurately track cache misses and efficiently retrieve the required data blocks from the underlying HDD storage into the SSD cache.
By carefully managing the data stored within the limited SSD capacity, the Colossus Flash Cache strives to balance performance gains with cost efficiency, ultimately minimizing the overall TCO.
Categorization & Admission Policies
The flash cache system allows users to define categories for data, enabling fine-grained control over caching behavior. For example, databases like BigTable and Spanner have categories based on table name and locality group combinations.
Each category can have four admission policies for blocks, determining when a block is added to the cache:
- AdmitOnWrite: Blocks are inserted into the cache immediately upon being written or when a read request results in a cache miss.
- AdmitOnMiss: Blocks are cached only when a read request results in a cache miss.
- AdmitOnSecondMiss: Blocks are cached after experiencing two consecutive cache misses.
- NeverAdmit: Blocks are never cached, effectively bypassing the flash cache layer.
CacheSack
The goal of CacheSack is to find an optimal assignment of <categories, policy> => size, i.e. the space that that <category, policy> would be assigned in the cache.
The Fractional Knapsack
CacheSack is essentially a fractional knapsack problem:
- Knapsack: Represents the flash cache with a total space of N.
- Items: Represent the different user-defined categories. Each category is associated with the size for each of its four possible admission policies (AdmitOnWrite, AdmitOnMiss, AdmitOnSecondMiss, NeverAdmit).
This problem involves assigning cache space to categories and policies. Unlike the traditional knapsack problem:
- The cost is not known beforehand, necessitating an online learning approach.
- Cost can be computed at the category level only. However, the space allocated to a category must be distributed among its associated policies.
Online Learning
The solution begins with a random initial allocation of cache space across all <category, policy> pairs. Subsequently, the system monitors events for a 5-minute interval, using these events to calculate the cost. Based on this calculated cost, a knapsack optimization is performed to reallocate cache space to categories and policies.
For example, if a category experiences frequent reads of recently written blocks, and its AdmitOnWrite policy has a small initial allocation, the optimization process will likely increase the size allocated to that category's AdmitOnWrite policy.
Category Cost Computation
The cost function incorporates two key factors:
- HDD Reads: The number of disk reads, representing HDD spindle resources time burn.
- SSD Writes: The volume of data written to SSD, contributing to SSD wear.
The cost metric must balance these competing concerns.
LRU Approximations
CacheSack approximates flash cache behavior as an LRU cache. It employs a threshold D. This threshold determines whether a block access is considered a cache hit or miss:
- If the time elapsed since the last access, d, is less than or equal to D (d <= ), the access is assumed to be a cache hit.
- If the time elapsed since the last access, d, is greater than D (d > ), the access is considered a cache miss.
The paper asserts that this LRU approximation is crucial and effective. However, the reasoning behind its efficacy was not intuitive for me!
To maintain the last access time and, crucially, the last two access times for the AdmitOnSecondMiss policy, the system utilizes a ghost cache. It's important to note that this ghost cache is an approximation, as it cannot feasibly store every single entry. Instead, it maintains access time data for the most recent four hours. Any block that remains unaccessed for a period exceeding four hours has its access information lost. This means that if a block experiences a miss, and then, after a four-hour interval, experiences another miss, the system requires one additional miss to occur before that block becomes eligible for admission into the cache under the AdmitOnSecondMiss policy.
D Buffer Cache Simulator
Cost Computation
Within each category, individual blocks are assigned to specific policies. This assignment can be probabilistic, proportional to the amount of cache space allocated to that policy.
Once a block is assigned a policy, the cost associated with accessing that block is calculated according to the chosen policy's behavior, as follows:
Disk Read Cost
- AdmitOnWrite: 0 if the block was written or accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
- AdmitOnMiss: 0 if the block was accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
- AdmitOnSecondMiss: 0 if the block was accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
- NeverAdmit: The total number of read requests from the client.
Bytes Written Cost
- AdmitOnWrite: The total number of write operations from the client.
- AdmitOnMiss/AdmitOnSecondMiss: 1 if a write operation occurred within the monitoring interval (5 minutes) as a result of miss/second miss.
- NeverAdmit: 0 (blocks are never written to the cache).
Solving the Knapsack
As noted earlier, one difference between regular knapsack and Cachesack is the fact that there are 4 fractions that needs to be computed for each item (i.e., category). This makes the algorithm a little bit more involved to optimize.
The mathematical derivation is provided in Appendix A. It may not seem very trivial, but I can help walkthrough some of the complex bits.
Formal Cost Representation
Essentially, we have the following cost representation:
Here D is the modeled threshold described previously.
The total cost (VCP) for category C and policy P is calculated as sum of the disk read cost (SCP) and the bytes write cost (WCP).
Let αCP is the proportion of blocks admitted under a category C and policy P. The total cost is then the weighted sum of cost for each category and policy. This total cost needs to be minimized through the linear programming:
This cost minimization needs to happen under the following constraints which are fairly trivial to understand. The second constraint ensures the fractions sum to one, while the third limits the total cache size.
Now, we need to define S and W for each category and policy. They are trivial for AdmitOnWrite and NeverAdmit. Let's walkthrough the case of AdmitOnMiss and AdmitOnSecondMiss
Under the AdmitOnMiss policy:
- A disk read (S) occurs when the last access time exceeds the threshold D, and the block is missing in the buffer cache (B).
- The cache-byte time usage (U) is calculated as the block's byte size multiplied by its residency time. The residency time is the sum of all the times during which the block was in the cache hence it is a summation over, for all accesses i the minimum of di (calculated as the time difference between consecutive access times, ti and ti-1) and D.
- Lastly, the write-cost (W) is calculated as block's byte size multiplied by number of actual reads. A read will happen whenever di (time since last access) exceeds D.
- A block will be in the cache if two consecutive accesses are within the threshold D (i.e., di−1 <= D and di <= D). Otherwise, when max(di, di-1) > D) and if the block is not in the buffer cache (B), a read is required.
- And for U, the residency time is di if it was a cache hit, or it is D (the max) if the block was inserted at ti-1.
- I will leave reasoning about W as an exercise to the readers.
Solving the Linear Program using Convex Hull
The linear program is solved using a convex hull approach, which can be broken down into the following steps:
- For each category, we calculate UP (usage) and VP (cost) for each policy P.
- These values are then plotted as points (UP, VP) on a graph, with each category generating its own plot. Since there are four policies, each category's plot will contain four points.
- For each category's plot, we construct the lower convex hull. This hull represents the set of efficient policy combinations for that category, minimizing cost for a given usage. We identify the lines (segments) that make up the lower convex hulls of all categories.
- These lines are then sorted in decreasing order of the negative slope (−VP/UP). This slope represents the cost-to-usage ratio, or the cost reduction per unit usage.
- We greedily assign fractions to the points on these lines, starting with the line having the steepest negative slope, until the total usage constraint (∑UCP) is met. Essentially, at each step, we are choosing the category and policy combination that offers the greatest cost reduction for each unit of usage.
Online Learning
Evaluation
It is worth noting that users are not given the ability to define their own space allocation for categories. This control is entirely managed by the flash cache, meaning that the allocation of cache space is determined solely by the system's automation. This lack of user-configurable tuning actually serves to incentivize users to adopt the cache more readily, as they are relieved of the burden of manual optimization.
The implementation of CacheSack has demonstrated a reduction in the overall TCO. In production environments, CacheSack has consistently outperformed the cost associated with statically assigned policies, indicating its effectiveness in resource management.
Simulation results provide further insight into CacheSack's adaptive behavior. Specifically:
- For smaller cache sizes, the simulations suggest that AdmitOnSecondMiss should receive a larger portion of the available space. This strategy is good for minimizing the impact of transient read blocks within the cache.
- Conversely, for larger cache sizes, the simulations indicate that AdmitOnMiss and AdmitOnWrite should be prioritized.
Comments
Post a Comment