This paper from Google was presented at Eurosys 2020. It has a lot of statistics. However, it represents one of the most important concepts in cluster/cloud computing - scaling - and it is important to explore those concepts in system design.
Recommended Read: Paper Insights - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center where I introduced cluster computing and resource allocation.
Borg
Borg is a cluster orchestrator developed by Google for managing its clusters. The design of Kubernetes was significantly influenced by Borg.
Jobs and Tasks
A Borg cluster consists of roughly 100 to 10,000 physical machines connected through a high-speed network fabric. Users submit jobs for execution on these machines, and these jobs are categorized into several types:
- Services: These are long-duration jobs that frequently constitute components of larger systems, such as those employing a microservices architecture. A service is generally replicated, and an individual instances is called task.
- Batch: These jobs are typically composed of short-lived data-processing tasks.
A task executes as a containerized process on the cluster machines.
Given Google's extensive infrastructure for internet-scale services, most jobs running within Borg are services. Services occupy the majority of machine capacity, with batch jobs utilizing the remaining available capacity.
Scaling
Every job running in a cluster/cloud requires resources. The most critical of these are:
- CPU
- RAM
Note that with the rise of data-intensive and compute-intensive applications (such as ML training), memory bandwidth is also becoming a crucial resource. Similarly, network bandwidth is critical for many network-intensive applications. However, this paper focuses solely on CPU and RAM, assuming an abundance of network and memory bandwidth. These assumptions will likely evolve with changes in the computing landscape.
When users submit jobs to cluster/cloud, they need to specify resource requirements. However, these specifications can often be inaccurate, either overestimating or underestimating the actual needs of the job. For instance, if a service experiences a 5x increase in user requests compared to the expected volume, its resources will need to scale up. Conversely, if the service receives 10x fewer requests than anticipated, resources should be scaled down.
Types
Resource scaling for a job can be categorized into two primary types:
- Vertical Scaling: Increasing or decreasing the resources allocated to each task.
- Horizontal Scaling: Adding or removing tasks.
Which to prefer?
Numerous online articles discuss the optimal scenarios for each approach. However, based on my experience with large-scale distributed systems, I can confidently state that the decision is context-dependent. It hinges on numerous factors, and there is no universally ideal solution.
Cluster orchestrators often favor vertical scaling as it tends to keep the number of tasks lower, which can expedite the computation of optimal resource assignments. On the other hand, I've encountered applications that had to scale out horizontally because the NICs on individual machines couldn't handle the required network capacity.
Therefore, my answer to the question of preference is: experiment to determine what yields the best results for your specific use case.
Limits & Oversubscriptions
Even with resource scaling, establishing limits is crucial. For example, users should define upper bounds for the RAM and CPU that each task can consume. A user might set a task limit to <2 CPU, 4 GB RAM>.
These limits are often based on the worst-case observed scenario, which may not accurately reflect a task's actual, typical resource needs. The average resource utilization can be significantly lower than the defined limits. Many tasks exhibit diurnal usage patterns, with high usage during specific hours of a day and low usage at other times.
To address this discrepancy, cluster/cloud providers employs oversubscription on machines. Consider a machine with a capacity of 256 GB RAM and 64 CPU, hosting tasks with the following limits:
- 20 CPU, 100 GB RAM
- 40 CPU, 50 GB RAM
- 40 CPU, 100 GB RAM
All these tasks can be scheduled on the same machine, with CPU resources being oversubscribed under the assumption that not all tasks will simultaneously utilize their maximum allocated CPU.
Oversubscription is vital; without it, machines often remain underutilized.
Furthermore, these limits are often soft limits. For example, if the first task uses 50 CPU while other tasks are idle, this is permissible (soft limiting). However, soft limiting transitions to hard limiting when there's resource contention. If other tasks begin to use their allocated 40 CPU, the first task will experience preemption to prevent it from exceeding its 20 CPU limit. The presence of priorities further complicates this dynamic.
Enforceable vs. Non-enforceable Resources
CPU is an enforceable resource. If a task exceeds its hard CPU limit, it will be descheduled to enforce adherence to the limit. In contrast, RAM limits are non-enforceable. Cluster machines typically have limited swap space (due to diskless architecture choices), so if a task surpasses a certain hard RAM limit, it must be Out-Of-Memory (OOM) terminated.
Chargeback Model
Users consuming resources on a cluster/cloud are typically charged for their usage, as computing operation isn't free. There are real costs associated with acquiring machines, setting up networks, and maintaining real estate, among other expenses.
A straightforward chargeback model could involve charging users based on the resource limits they set. However, this approach might not lead to competitive pricing. Charging users based on their actual resource usage might seem more equitable. Yet, even usage-based charging presents challenges, as a user's average usage can be significantly lower than their peak demand. For example, a user might average 1 GB of RAM usage but occasionally spike to 100 GB. Ultimately, cluster/cloud providers must provision for these peak demands. Therefore, charging solely on average usage would be unfavorable for service providers.
A more effective approach could be to charge users based on their usage plus a portion of the oversubscription cost, distributed proportionally among the tasks.
In the example above, if the actual average CPU usages are 10, 20, and 25, then:
Base charge for user 1: 10
Base charge for user 2: 20
Base charge for user 3: 25
Slack: 64 - (10 + 20 + 25) = 9
Oversubscription Tax for user 1: (20 / (20 + 40 + 40)) * 9 = 1.8
Oversubscription Tax for user 2: (40 / (20 + 40 + 40)) * 9 = 3.6
Oversubscription Tax for user 3: (40 / (20 + 40 + 40)) * 9 = 3.6
Autopilot
Autopilot is designed to automate the scaling of resources for jobs running on Borg.
The primary goal of Autopilot is to automate two key aspects of resources for a job:
- Setting resource limits for individual tasks (vertical autoscaling) and
- Adjusting the number of running tasks (horizontal autoscaling).
Architecture
Autopilot's architecture is based on a triple closed-loop control system. One loop manages horizontal scaling, while the other two independently control per-task CPU and memory resources.
Each job is treated as an isolated entity, meaning there is no learning or information sharing across different jobs.
The data flow within this closed-loop system is as follows:
- Resource usage data of a task is continuously logged and exported in the form of time series. For instance, the CPU utilization of a specific task over time.
- The resource usage time series for all tasks of a job are fed into recommender systems. These recommenders analyze the data and generate suggested job resource limits, which are then passed to the Autopilot service.
- The Autopilot service takes these recommended job limits and communicates them to an actuator component. The actuator then translates these job limits into task limits and task count, and sends them to the Borgmaster.
- Finally, the Borgmaster processes these updates and takes appropriate action, such as adjusting the resource allocation of the affected tasks or adjusting the number of tasks.
Per-task Vertical Autoscaling
- Moving Window Recommenders
- Machine Learning
- ri[τ] represent the resource usage value exported by task i at time τ.
- si[t] represent a distribution of all data points in ri within the time window [t - 5 minutes, t]. This distribution is essentially a histogram composed of discrete buckets.
Let's illustrate the mathematical formulation with an example. Consider a job comprising two tasks.
The first task exhibits the following CPU usage over a 5-minute window:
- 10 CPU for 2.5 minutes (150 seconds)
- 20 CPU for 2.5 minutes (150 seconds)
The second task shows the following CPU usage over the same 5-minute window:
- 20 CPU for 2.5 minutes (150 seconds)
- 30 CPU for 2.5 minutes (150 seconds)
Assume the CPU usage bucket boundaries are defined as {0, 10, 20, 30, 40}, resulting in the following buckets in a distribution: [0, 10), [10, 20), [20, 30), [30, 40). Then,
Moving Window Recommenders
Peak
Weighted Average
jth Percentile of Adjusted Usage
ML Recommenders
Step 1: Calculate Overrun Cost
Step 2: Calculate Underrun Cost
Step 3: Determine the Best Limit for Each Model
Step 4: Apply Safety Margin
Step 5: Calculate the Cost of Each Model's Recommendation
Step 6: Choose the Best Model and Set the Limit
Model and Hyperparameters
Horizontal Autoscaling
Horizontal autoscaling in Autopilot is triggered by the following metrics:
- CPU Utilization: Similar to vertical autoscaling, users can define a lookback window (T). Within this window, a chosen percentile or the maximum CPU usage is considered the required usage. Users also specify their desired average utilization. Based on the ratio of the required usage to the average utilization, Autopilot determines the necessary number of tasks for the job.
- Target Size (User-Defined Function): Alternatively, users can provide their own function to calculate the desired number of tasks. This function can take various input metrics, such as QPS, into account.
Strategies
It's important to note that horizontal scaling carries inherent costs. Allocating new machines for additional tasks takes time. Furthermore, during the scaling process, fluctuations in the triggering metrics (like CPU utilization or QPS) can occur, potentially leading to instability and further unnecessary scaling actions. To mitigate these issues, Autopilot employs several strategies:
-
Deferred Downscaling: Upscaling (adding tasks) is performed immediately when needed. However, downscaling (removing tasks) is delayed for a specific period. Once new tasks are started, they remain operational for a stabilization period. This is beneficial because a temporary load spike might recur shortly after it subsides, in which case the scaled-up capacity would already be in place. The trade-off is potentially higher resource consumption during the deferral period.
-
Slow Decay: During downscaling, the number of tasks is reduced gradually rather than abruptly. This helps to avoid sudden drops in capacity and potential disruptions.
-
Defer Small Changes: Horizontal scaling can have disruptive side effects on certain types of applications. For example, a distributed cache using consistent hashing to distribute data might experience significant re-sharding and temporary availability impacts even with the addition or removal of a single task. Therefore, small scaling changes are often deferred to avoid these disruptions.
-
Limited Growth: Even during upscaling events, the maximum rate of task growth is limited. This prevents the rapid creation of a large number of new tasks, which could themselves become unavailable during their startup phase and exacerbate instability.
Evaluation
The study evaluates the system's performance directly using real-world workloads. To track this, the authors employ the following metrics:
- Footprint: Calculated as the normalized sum of average task limits for a job, expressed in the number of machines (where number of machines = RAM bytes / memory per machine).
- Relative slack: Defined as (limit - usage) / limit.
- Absolute slack: The difference between limit and usage, normalized to the number of machines.
- Relative OOM rate: The ratio of the number of OOM events to the total number of tasks.
- Number of job-days without OOMs.
The results show:
- Non-autopiloted jobs exhibit a relative slack ranging from 46% to 60%, while autopiloted jobs show a tighter range of 23% to 31%, with ML recommenders demonstrating superior performance.
- Autopilot significantly reduces absolute slack to approximately 500 machines compared to 12,000 machines for non-autopiloted jobs, resulting in estimated savings of tens of millions of USD for Google. Figure 3(c) shows that Autopilot is indeed responsible for the smaller footprints observed in autopiloted jobs (rather than sampling errors).
- Migrating 500 previously non-autopiloted jobs led to substantial performance improvements, saving 1708 machines over a four-month period.
- OOM events are infrequent in autopiloted jobs, with 99.5% experiencing no OOMs. Figure 5 is somewhat difficult to interpret. The y-axis is the percentage of days when no OOMs were observed. It indicates that all algorithms achieved zero OOMs on at least 97% of days, with the moving window approach exhibiting the fewest days with OOMs.
- Figure 6 reveals an inverse relationship between memory slack and relative OOM rate. Moving window recommenders excel at managing OOMs while maintaining low slack, whereas ML recommenders tend to reduce limits too aggressively, leading to more OOMs.
- Recommendations demonstrate stability, with no changes occurring on 70% of job-days and only 6 to 7 changes observed at the 99th percentile (Figure 7).
- Long-running jobs exhibit lower slack compared to new jobs, suggesting that Autopilot cautiously adjusts limits for new jobs to mitigate the risk of OOMs.
Comments
Post a Comment