This paper from Google was presented at Eurosys 2020. It has a lot of statistics. However, it represents one of the most important concepts in cloud computing - autoscaling - and hence it is important to explore those concepts in system design.
Recommended Read: Paper Insights - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center where I introduced cluster computing and resource allocation.
Borg
Borg, a cluster manager developed at Google, is a prominent example of a cluster orchestrator. It is important to note that the design of Kubernetes was significantly influenced by Borg.
Jobs and Tasks
A Borg cluster consists of roughly 100 to 10,000 physical machines connected through a high-speed network fabric. Users submit jobs for execution on these machines, and these jobs are categorized into several types:
- Services: These are long-duration jobs that frequently constitute components of larger systems, such as those employing a microservices architecture.
- Batch: These jobs are typically short-duration data-processing tasks.
Jobs are generally replicated, and their individual instances are termed tasks. A task executes as a containerized process on the cluster machines.
Given Google's extensive infrastructure for internet-scale services, most jobs running within Borg are services. Services occupy the majority of machine capacity, with batch jobs utilizing the remaining available capacity.
Scaling
Every job running in a cluster requires resources. The most critical of these are:
- CPU
- RAM
It's important to note that with the rise of data-intensive and compute-intensive applications (such as ML training), memory bandwidth is also becoming a crucial resource. Similarly, network bandwidth is critical for many network-intensive applications. However, this paper focuses solely on CPU and RAM, assuming an abundance of network and memory bandwidth. These assumptions will likely evolve with changes in the computing landscape.
When users submit jobs to clusters, they need to specify resource requirements. However, these specifications can often be inaccurate, either overestimating or underestimating the actual needs of the job. For instance, if a service experiences a 5x increase in user requests compared to the expected volume, its resources will need to scale up. Conversely, if the service receives 10x fewer requests than anticipated, resources should be scaled down.
Types
Resource scaling can be categorized into two primary types:
- Vertical Scaling: Increasing or decreasing the resources allocated to each task.
- Horizontal Scaling: Adding or removing tasks.
Which to prefer?
Numerous online articles discuss the optimal scenarios for each approach. However, based on my experience with large-scale distributed systems, I can confidently state that the decision is context-dependent. It hinges on numerous factors, and there is no universally ideal solution.
Cluster orchestrators often favor vertical scaling as it tends to keep the number of tasks lower, which can expedite the computation of optimal resource assignments. On the other hand, I've encountered applications that had to scale out horizontally because the NICs on individual machines couldn't handle the required network capacity.
Therefore, my answer to the question of preference is: experiment to determine what yields the best results for your specific use case.
Limits
Even with resource scaling, establishing limits is crucial. For example, users should define upper bounds for the RAM and CPU that each task can consume. A user might set a task limit to <2 CPU, 4 GB RAM>.
These limits are often based on the worst-case observed scenario, which may not accurately reflect a task's actual, typical resource needs. The average resource utilization can be significantly lower than the defined limits. Many tasks exhibit diurnal usage patterns, with high usage during specific hours and low usage at other times.
To address this discrepancy, cloud computing providers employs oversubscription on machines. Consider a machine with a capacity of 256 GB RAM and 64 CPU, hosting tasks with the following limits:
- 20 CPU, 100 GB RAM
- 40 CPU, 50 GB RAM
- 40 CPU, 100 GB RAM
All these tasks can be scheduled on the same machine, with CPU resources being oversubscribed under the assumption that not all tasks will simultaneously utilize their maximum allocated CPU.
Oversubscription is vital; without it, machines often remain underutilized.
Furthermore, these limits are often soft limits. For example, if the first task uses 50 CPU while other tasks are idle, this is permissible (soft limiting). However, soft limiting transitions to hard limiting when there's resource contention. If other tasks begin to use their allocated 40 CPU, the first task will experience preemption to prevent it from exceeding its 20 CPU limit. The presence of priorities further complicates this dynamic.
Enforceable vs. Non-enforceable Resources
CPU is an enforceable resource. If a task exceeds its hard CPU limit, it will be descheduled to enforce adherence to the limit. In contrast, RAM limits are non-enforceable. Cluster machines typically have limited swap space (due to diskless architecture choices), so if a task surpasses a certain hard RAM limit, it must be Out-Of-Memory (OOM) terminated.
Chargeback Model
Users consuming resources on a cloud platform are typically charged for their usage. Similarly, users utilizing resources on cluster need to be charged back, as computing operation isn't free. There are real costs associated with acquiring machines, setting up networks, and maintaining real estate, among other expenses.
A straightforward chargeback model could involve charging users based on the resource limits they set. However, this approach might not lead to competitive pricing. Charging users based on their actual resource usage might seem more equitable. Yet, even usage-based charging presents challenges, as a user's average usage can be significantly lower than their peak demand. For example, a user might average 1 GB of RAM usage but occasionally spike to 100 GB. Ultimately, cloud providers must provision for these peak demands. Therefore, charging solely on average usage would be unfavorable for service providers.
A more effective approach could be to charge users based on their usage plus a portion of the oversubscription cost, distributed proportionally among the tasks.
In the example above, if the actual average CPU usages are 10, 20, and 25, then:
Base charge for user 1: 10
Base charge for user 2: 20
Base charge for user 3: 25
Slack: 64 - (10 + 20 + 25) = 9
Oversubscription Tax for user 1: (20 / (20 + 40 + 40)) * 9 = 1.8
Oversubscription Tax for user 2: (40 / (20 + 40 + 40)) * 9 = 3.6
Oversubscription Tax for user 3: (40 / (20 + 40 + 40)) * 9 = 3.6
Autopilot
Autopilot is designed to automate the scaling of jobs running on Borg, Google's internal cluster management system.
The primary goal of Autopilot is to automate two key aspects of resource management: setting resource limits for individual tasks (vertical scaling) and adjusting the number of running tasks (horizontal scaling). Automating the setting of task limits is especially important because these limits directly determine the cost charged back to users. Therefore, optimizing this process can lead to substantial cost reductions for users.
Architecture
Autopilot's architecture is based on a triple closed-loop control system. One loop manages horizontal scaling, while the other two independently control per-task CPU and memory resources. Each job is treated as an isolated entity, meaning there is no learning or information sharing across different jobs.
The data flow within this closed-loop system is as follows:
- Resource usage data is continuously logged and exported in the form of time series. For instance, the CPU utilization of a specific task over time.
- These time series are fed into recommender systems. These recommenders analyze the data and generate suggested resource limits, which are then passed to the Autopilot service.
- The Autopilot service takes these recommended job limits and communicates them to an actuator component. The actuator then translates these job limits into specific task limits and sends them to the Borgmaster.
- Finally, the Borgmaster processes these updates and takes appropriate action, such as adjusting the resource allocation of the affected tasks.
Per-task Vertical Autoscaling
- Moving Window Recommenders
- Machine Learning
- ri[τ] represent the resource usage value exported by task i at time τ.
- si[t] represent a distribution of all data points in ri within the time window [t - 5 minutes, t]. This distribution is essentially a histogram composed of discrete buckets.
Let's illustrate the mathematical formulation with an example. Consider a job comprising two tasks.
The first task exhibits the following CPU usage over a 5-minute window:
- 10 CPU for 2.5 minutes (150 seconds)
- 20 CPU for 2.5 minutes (150 seconds)
The second task shows the following CPU usage over the same 5-minute window:
- 20 CPU for 2.5 minutes (150 seconds)
- 30 CPU for 2.5 minutes (150 seconds)
Assume the CPU usage bucket boundaries are defined as {0, 10, 20, 30, 40}, resulting in the following buckets in a distribution: [0, 10), [10, 20), [20, 30), [30, 40).
Comments
Post a Comment