Skip to main content

Paper Insights #9 - CacheSack: Admission Optimization for Google Datacenter Flash Caches

This paper, presented by Google at Usenix ATC '22. Usenix ATC is recognized as one of the major conferences in the field of computer systems. The paper presents and resolves an interesting challenge that arises within the domain of distributed file systems. While machine learning models are commonly employed for statistical problem-solving, this paper presents a distinct mathematical approach. This was not unexpected as author Tzu-Wei Yang has a Ph.D. in Mathematics from Stanford.

Paper Link

Recommended Reads:

Let's start with some basic concepts.

Fractional Knapsack Problem

The problem involves a knapsack with a capacity of N, and a set of items, each with a cost Ci and a size Si

Goal: Minimize the total cost of filling the knapsack completely, with the key feature that items can be taken in fractions.

Solution:

  1. Calculate the cost-to-size ratio (Ci/Si) for each item.
  2. Sort the items in ascending order based on their cost-to-size ratio.
  3. Fill the knapsack by taking items in this sorted order, taking fractions of items as needed to completely fill the knapsack.

Convex Hull

For a given set of points, the convex hull is the smallest convex polygon that encloses all the points.



The lower convex hull is the portion of the convex hull that, when projected onto the x-axis, casts the same shadow as the entire convex hull. Essentially, it's the "bottom" part of the hull.



    Graham's Scan Algorithm

    Graham's scan is a classic algorithm to compute the convex hull of a set of points:

    1. Initialization:

      • Find the point with the lowest y-coordinate (and leftmost if there are ties). This point is guaranteed to be part of the convex hull.
      • Push this lowest point onto a stack.
    2. Sorting:

      • Sort the remaining points in counterclockwise order around the initial lowest point. This sorting is based on the angle formed by the x-axis and the line connecting the initial point to each other point.
    3. Hull Construction:

      • Iterate through the sorted points:
        • For each point, push it onto the stack.
        • While the stack contains at least three points and the last three points on the stack form a non-left turn (i.e., a right turn or straight line), pop the second-to-last point from the stack.
      • Once the iteration is complete, the points remaining on the stack form the convex hull.

    The non-left turn check ensures that the hull remains convex. A left turn indicates that the points are forming a convex angle.

    Flash Memory

    As discussed in a previous article, flash is the technology underpinning SSDs. I will go over Flash technology briefly in this article.

    SSD Organization

    SSDs organize data into pages, typically 2-4 KB (analogous to blocks in HDDs), which are grouped into blocks of 32-128 pages.


    A key characteristic of flash memory (which utilizes EEPROM) is that a flash cell must be erased explicitly before it can be re-written. With SSDs, there is a disparity between write and erase operations: writes occur at the page level, while erasures can be performed at the block level only.

    Write Performance

    SSDs employ a Flash Translation Layer (FTL) to map Logical Block Addresses from the operating system to Physical Block Addresses (PBAs). New blocks are mapped to free blocks. When free blocks are depleted, the SSD reclaim space by finding blocks where data has been deleted (trimmed), allowing for block erasure. 

    FTL effectively manages sequential writes. However, random writes present a challenge. Modifying even a single byte in an already-written block triggers a block copy with the updated byte to a new block. This is known as write amplification.

    Read Performance

    SSDs perform reads in parallel and at significantly higher speeds than HDDs due to the absence of a physical read head.

    Life Span

    Flash memory cells have a finite limit on the write/erase cycles before wear out, as a result SSD's lifespan are limited. SSDs employ FTL to distribute write operations evenly across all blocks, extending the drive's lifespan. However, it can still wear out after continuous writes.

    Colossus Caching

    Note that the Google File System has been rebranded as Colossus.

    Problems with SSDs:

    • Have less capacity and are more costly as compared to HDDs. 
    • Additionally, they wear-out after certain number of writes.
    Problem with HDDs:
    • Throughout is limited due to the use of spindle head.

    In the vast infrastructure of data centers like Google's, the sheer scale of storage demands makes an all-SSD approach economically impractical. Therefore, Colossus chunk servers (D servers) primarily rely on HDDs for their storage capacity.

    To mitigate HDD's read bottleneck, caching becomes essential. Caching frequently accessed data blocks allows for improved read speeds without the need to deploy a prohibitively large number of HDDs just to get good read performance. This strategy exemplifies how software effectively addresses hardware limitations in response to ever-changing industry needs.

    The caching system for Colossus has multiple layers: the D Buffer Cache and the Flash Cache


    D Buffer Cache

    The D Buffer Cache stores blocks directly in DRAM, offering fast access. However, due to DRAM's high cost, its capacity is inherently limited.

    Flash Cache

    The Flash Cache, on the other hand, uses SSDs to store recently accessed blocks.

    Optimizing this layer presents significant challenges, particularly in selecting an appropriate eviction algorithm. System research has extensively explored the nuances of SSD caching. For instance:

    • The FIFO (First-In, First-Out) algorithm tends to produce sequential write patterns on SSDs, which are efficient.
    • Conversely, algorithms like LRU (Least Recently Used) can lead to random write patterns, resulting in substantial write amplification and reduced SSD lifespan.

    Colossus Flash Cache

    While the addition of a flash cache layer significantly enhances Colossus's performance, its limited capacity necessitates strategic caching decisions. 

    Total Cost of Ownership

    The primary optimization goal for this layer is minimizing the Total Cost of Ownership (TCO). In this context, TCO represents the overall expense incurred by Google for hosting the SSDs used for caching. A key factor influencing TCO is the lifespan of the SSDs, which is inversely proportional to the number of write operations. Therefore, the caching algorithm must intelligently select which data to cache, aiming to reduce writes and extend SSD longevity.

    Client Interaction

    All reads from client are sent to cache service before Colossus. This design allows the cache service to accurately track cache misses and efficiently retrieve the required data blocks from the underlying HDD storage into the SSD cache.

    By carefully managing the data stored within the limited SSD capacity, the Colossus Flash Cache strives to balance performance gains with cost efficiency, ultimately minimizing the overall TCO.

    Categorization & Admission Policies

    The flash cache system allows users to define categories for data, enabling fine-grained control over caching behavior. For example, databases like BigTable and Spanner have categories based on table name and locality group combinations.

    Each category can have four admission policies for blocks, determining when a block is added to the cache:

    • AdmitOnWrite: Blocks are inserted into the cache immediately upon being written or when a read request results in a cache miss. 
    • AdmitOnMiss: Blocks are cached only when a read request results in a cache miss.
    • AdmitOnSecondMiss: Blocks are cached after experiencing two consecutive cache misses.
    • NeverAdmit: Blocks are never cached, effectively bypassing the flash cache layer.

    CacheSack

    The goal of CacheSack is to find an optimal assignment of <categories, policy> => size, i.e. the space that that <category, policy> would be assigned in the cache.

    The Fractional Knapsack

    CacheSack is essentially a fractional knapsack problem:

    • Knapsack: Represents the flash cache with a total space of N.
    • Items: Represent the different user-defined categories. Each category is associated with the size for each of its four possible admission policies (AdmitOnWrite, AdmitOnMiss, AdmitOnSecondMiss, NeverAdmit).

      This problem involves assigning cache space to categories and policies. Unlike the traditional knapsack problem:

      • The cost is not known beforehand, necessitating an online learning approach.
      • Cost can be computed at the category level only. However, the space allocated to a category must be distributed among its associated policies.

      Online Learning

      The solution begins with a random initial allocation of cache space across all <category, policy> pairs. Subsequently, the system monitors events for a 5-minute interval, using these events to calculate the cost. Based on this calculated cost, a knapsack optimization is performed to reallocate cache space to categories and policies.

      For example, if a category experiences frequent reads of recently written blocks, and its AdmitOnWrite policy has a small initial allocation, the optimization process will likely increase the size allocated to that category's AdmitOnWrite policy.

      Category Cost Computation

      The cost function incorporates two key factors:

      • HDD Reads: The number of disk reads, representing HDD spindle resources time burn.
      • SSD Writes: The volume of data written to SSD, contributing to SSD wear.

      The cost metric must balance these competing concerns.

      LRU Approximations

      CacheSack approximates flash cache behavior as an LRU cache. It employs a threshold D. This threshold determines whether a block access is considered a cache hit or miss:

      • If the time elapsed since the last access, d, is less than or equal to D (d <= ), the access is assumed to be a cache hit.
      • If the time elapsed since the last access, d, is greater than D (d > ), the access is considered a cache miss.

      The paper asserts that this LRU approximation is crucial and effective. However, the reasoning behind its efficacy was not intuitive for me!

      To maintain the last access time and, crucially, the last two access times for the AdmitOnSecondMiss policy, the system utilizes a ghost cache. It's important to note that this ghost cache is an approximation, as it cannot feasibly store every single entry. Instead, it maintains access time data for the most recent four hours. Any block that remains unaccessed for a period exceeding four hours has its access information lost. This means that if a block experiences a miss, and then, after a four-hour interval, experiences another miss, the system requires one additional miss to occur before that block becomes eligible for admission into the cache under the AdmitOnSecondMiss policy.

      D Buffer Cache Simulator

      Note that the LRU approximations above is only applicable to the flash cache. The authors make use of a blackbox tool called D buffer cache simulator to determine if a flash cache miss also caused a D buffer cache miss. Only then is it added to HDD read cost.

      Cost Computation

      Within each category, individual blocks are assigned to specific policies. This assignment can be probabilistic, proportional to the amount of cache space allocated to that policy.

      Once a block is assigned a policy, the cost associated with accessing that block is calculated according to the chosen policy's behavior, as follows:

      Disk Read Cost

      • AdmitOnWrite: 0 if the block was written or accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
      • AdmitOnMiss: 0 if the block was accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise.
      • AdmitOnSecondMiss: 0 if the block was accessed within the last D minutes (LRU approximation); 0 if the block was found in D buffer cache (simulation); 1 otherwise. 
      • NeverAdmit: The total number of read requests from the client.

      Bytes Written Cost

      • AdmitOnWrite: The total number of write operations from the client.
      • AdmitOnMiss/AdmitOnSecondMiss: 1 if a write operation occurred within the monitoring interval (5 minutes) as a result of miss/second miss.
      • NeverAdmit: 0 (blocks are never written to the cache).

      Solving the Knapsack

      As noted earlier, one difference between regular knapsack and Cachesack is the fact that there are 4 fractions that needs to be computed for each item (i.e., category). This makes the algorithm a little bit more involved to optimize.

      The mathematical derivation is provided in Appendix A. It may not seem very trivial, but I can help walkthrough some of the complex bits.

      Formal Cost Representation

      Essentially, we have the following cost representation:


      Here D is the modeled threshold described previously.

      The total cost (VCP) for category C and policy P is calculated as sum of the disk read cost (SCP) and the bytes write cost (WCP).

      Let αCP is the proportion of blocks admitted under a category C and policy P. The total cost is then the weighted sum of cost for each category and policy. This total cost needs to be minimized through the linear programming:

      This cost minimization needs to happen under the following constraints which are fairly trivial to understand. The second constraint ensures the fractions sum to one, while the third limits the total cache size.

      Now, we need to define S and W for each category and policy. They are trivial for AdmitOnWrite and NeverAdmit. Let's walkthrough the case of AdmitOnMiss and AdmitOnSecondMiss

      Under the AdmitOnMiss policy:

      • A disk read (S) occurs when the last access time exceeds the threshold D, and the block is missing in the buffer cache (B).
      • The cache-byte time usage (U) is calculated as the block's byte size multiplied by its residency time. The residency time is the sum of all the times during which the block was in the cache hence it is a summation over, for all accesses i the minimum of di (calculated as the time difference between consecutive access times, ti and ti-1) and D.
      • Lastly, the write-cost (W) is calculated as block's byte size multiplied by number of actual reads. A read will happen whenever d(time since last access) exceeds D

      These are then summed over all the blocks to get the corresponding value for a policy.

      Under the AdmitOnSecondMiss policy:
      • A block will be in the cache if two consecutive accesses are within the threshold D (i.e., di−1 <= D and di <= D). Otherwise, when max(di, di-1) > D) and if the block is not in the buffer cache (B), a read is required.

      • And for U, the residency time is di if it was a cache hit, or it is D (the max) if the block was inserted at ti-1.
      • I will leave reasoning about W as an exercise to the readers.

      Solving the Linear Program using Convex Hull

      The linear program is solved using a convex hull approach, which can be broken down into the following steps:

      • For each category, we calculate UP (usage) and VP (cost) for each policy P.
      • These values are then plotted as points (UP, VP) on a graph, with each category generating its own plot. Since there are four policies, each category's plot will contain four points.
      • For each category's plot, we construct the lower convex hull. This hull represents the set of efficient policy combinations for that category, minimizing cost for a given usage. We identify the lines (segments) that make up the lower convex hulls of all categories.
      • These lines are then sorted in decreasing order of the negative slope (−VP/UP). This slope represents the cost-to-usage ratio, or the cost reduction per unit usage.
      • We greedily assign fractions to the points on these lines, starting with the line having the steepest negative slope, until the total usage constraint (∑UCP) is met. Essentially, at each step, we are choosing the category and policy combination that offers the greatest cost reduction for each unit of usage.
      Example:

      Online Learning

      The above knapsack formulation is solved for all possible values of D. The one that obtains the minimum cost is chosen and the assignment corresponding to it is the final assignment.

      Evaluation

      It is worth noting that users are not given the ability to define their own space allocation for categories. This control is entirely managed by the flash cache, meaning that the allocation of cache space is determined solely by the system's automation. This lack of user-configurable tuning actually serves to incentivize users to adopt the cache more readily, as they are relieved of the burden of manual optimization.

      The implementation of CacheSack has demonstrated a reduction in the overall TCO. In production environments, CacheSack has consistently outperformed the cost associated with statically assigned policies, indicating its effectiveness in resource management.

      Simulation results provide further insight into CacheSack's adaptive behavior. Specifically:

      • For smaller cache sizes, the simulations suggest that AdmitOnSecondMiss should receive a larger portion of the available space. This strategy is good for minimizing the impact of transient read blocks within the cache. 
      • Conversely, for larger cache sizes, the simulations indicate that AdmitOnMiss and AdmitOnWrite should be prioritized.
      CacheSack's learning algorithm effectively navigates these varying scenarios by dynamically adjusting allocations based on the cost of each operation, thereby maintaining a high hit ratio across different cache configurations.

      Paper Review

      My interest in this paper stemmed from observing high spindle utilization in a Colossus-related project, prompting me to investigate caching inefficiencies. This led me to discover and explore this paper. It offers valuable insights into SSD caching, but its niche focus makes it relevant primarily to those directly working in this area.

      It's a fascinating read for anyone dealing with SSD caching, but its specialized nature might not appeal to a broader audience.

      Comments

      Popular Posts

      Paper Insights #25 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

      Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s. Paper Link Let's begin with some basic concepts. Network Interface Card (NIC) The NIC facilitates data reception and transmission. Understanding its operation requires examining the fundamental interaction between the CPU and memory. CPU <-> Memory Communication In a Von Neumann Architecture , the CPU and memory are core components, enabling Turing computation. Their communication relies on the system bus (e.g. PCIe ), a set of electrical pathways connecting the CPU, memory, and I/O devices. The system bus comprises three primary logical components: Data Bus : Bidirectional, carrying the actual data being tran...

      Paper Insights #26 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

      This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman . Paper Link Let's begin with some basic concepts. Causal Ordering In 1978, Leslie Lamport published Time, Clocks, and the Ordering of Events in a Distributed System , a seminal paper that significantly impacted distributed system design. This work, alongside Paxos and TLA+ , stands as one of Lamport's most influential contributions. A fundamental challenge in distributed systems is clock synchronization . Perfect synchronization is unattainable, a fact rooted in both computer science and physics. However, the goal isn't perfect synchronization itself, but rather the ability to totally order even...

      Paper Insights #24 - Spanner: Google's Globally-Distributed Database

      This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field. Paper Link I would highly recommend reading the following before jumping into this article: 1.  Practical Uses of Synchronized Clocks in Distributed Systems where I introduced why clock synchronization is necessary but not sufficient for external consistency. 2.  A New Presumed Commit Optimization for Two Phase Commit where I introduced two-phase commits (2PC) and how it is solved in a distributed system. 3.  Amazon Aurora: Design Considerations for High Th...