This paper, published by Google in 2010 and presented at OSDI 2010, is quite challenging due to its highly mathematical nature. I was fortunate to work with one of the co-authors, Florentina Popovici. She's truly the best.
Paper Link
Paper Structure (Re-write)
This paper covers two independent main topics:
- The first section examines the availability and Mean Time To Failure (MTTF) of individual hardware components like disks and nodes. It also delves into failure bursts and domain-related failures (such as rack failures), which are significant sources of correlated failures.
- The second, entirely independent section discusses how to handle failures and thrive. It builds:
- Probabilities of state transitions for stripes accounting for all possible failure burst sizes. It uses cell simulation to verify these probabilities.
- A Markov model that incorporates the calculated probabilities, where each failure is independent but can impact multiple nodes and chunks, thus reflecting correlated failures.
- A non-Markov model with constant failure and recovery rates, treating chunk failures as independent for less accurate calculations.
Part 1: Availability
In a typical data center cluster, thousands to tens of thousands of machines host millions of nodes, which are essentially processes running on these machines. Often, a group of these nodes is dedicated to a specific distributed system, like storage. For example, some nodes might be part of a distributed file system, storing large volumes of data.
This paper specifically examines the availability of distributed storage systems. Storage nodes in a distributed system can become unavailable for a variety of reasons, each with different implications for recovery time and data integrity.
- Node Restarts: These are the quickest to recover from, typically taking only tens of seconds. It's often just a matter of restarting a process.
- Planned Reboots: For maintenance, repairs, or upgrades, machines (running the node) may undergo planned reboots. These can take anywhere from several minutes to a few hours.
- Unplanned Reboots: Unexpected events like kernel crashes also lead to reboots. Similar to planned reboots, these can cause downtime ranging from several minutes to hours.
- Hardware Corruption: Data can become corrupted even on seemingly healthy hardware. Studies have shown a small but present rate of checksum mismatches, indicating corruption in approximately 1 in 106 to 107 data blocks.
- Hardware Failures: Physical hardware failures are a significant concern. The annual replacement rate for disks can range from 2% to 4%. When considering an entire fleet of storage nodes, the overall failure rate can be between 3.9% and 8.3% annually. These failures often necessitate more extensive recovery procedures, potentially involving data reconstruction from replicated or erasure-coded chunks.
From figure 3, Node restarts and planned reboots are quite common occurrences. While less frequent, unplanned reboots also happen.
To quantify availability, two key metrics are used:
- Average Availability: This is calculated as the total uptime divided by the sum of total uptime and total downtime.
Average Availability = ∑iUptime / (∑iUptime + ∑iDowntime)
- Mean Time To Failure (MTTF): This metric represents the average time a system operates correctly before a failure occurs.
MTTF = Number of Failures / Total Uptime
A closer look at the causes of storage node unavailability, as shown in Figure 4, reveals that planned reboots are the most frequent, followed by node restarts.
Table 2 provides insights into the MTTF for various components:
- A disk has an MTTF of 10-50 years.
- A node has a significantly shorter MTTF of 4.3 months.
- A rack has an MTTF of 10.2 years.
Data Replication
In a distributed storage system, files are broken down into stripes, and these stripes are then replicated into chunks to ensure data availability and fault tolerance. There are two popular methods for replicating file chunks: replicated encoding and erasure encoding.
Replicated Encoding
Replicated encoding is where file chunks are simply duplicated, with each copy stored on a separate storage node. A single node can hold chunks from various stripes.

Common replication factors include r=2 or r=3. The storage space required directly increases with the replication factor; for example, r=3 means you're using three times the space of the original data.
Erasure Encoding
An original file or message is divided into k data blocks. From these k blocks, m additional parity blocks are computed. The total number of blocks then becomes n = k + m.
- k: Number of original data blocks.
- m: Number of generated parity blocks.
- n: Total number of blocks (n = k + m).
The key advantage is that to retrieve the original data, you only need any k of the n total blocks. This means you can lose up to m blocks and still recover your data.

Reed-Solomon codes are the most widely used erasure codes, offering the maximum possible fault tolerance for a given amount of redundancy.
Correlated Failures
Correlated failures occur when a single event or root cause leads to a large number of nodes failing simultaneously or in quick succession. These are particularly impactful in distributed storage environments because they can wipe out entire sections of data.
For example, when a top-of-rack (ToR) switch goes down, every storage node connected to that switch within the rack becomes unavailable, leading to the loss of all data stored on them. Similarly, a power outage affecting a specific area or data center can cause widespread, correlated failures.
Failure Burst
A failure burst refers to multiple component failures grouped together within a short timeframe. The authors defined a failure burst as a cluster of failures where the time gap between any two successive failures is less than 120 seconds. While individual failures within a burst must be close (within 120 seconds), the total duration of a burst itself can exceed 120 seconds.
As illustrated in Figure 6, when analyzing the percentage of node failures for different clustering window sizes (with a minimum failure size of 10 nodes), the curve tends to flatten out after 120 seconds. This suggests that 120 seconds is an optimal burst size for effectively capturing most clustered failure events.
With this 120-second burst size, the likelihood of a purely random (non-correlated) failure being included in a burst is relatively low, at about 8%. The probability of a random failure being part of a burst involving at least 10 nodes is a minuscule 0.068%. These low probabilities indicate that failure bursts are effective at pinpointing large, correlated failures rather than just isolated random events.
Figure 7 visualizes different types of failure bursts. The steep curve represents scenarios like power outages, where many nodes fail rapidly in quick succession. The slower, more even-paced curve often points to controlled activities such as rolling reboots or upgrade processes.
Rack Affinity
A data cluster consists of machines organized into racks. Within each rack, all machines are interconnected in a mesh topology. For communication across racks, machines connect via top-of-the-rack (ToR) switches. These ToR switches are often single points of failure and a significant source of outages, highlighting the importance of quantifying their failure rate.
The Rack Score (R) quantifies the concentration of failures of N nodes within a data center rack. It's calculated as:
R = ∑iki⋅(ki - 1) / 2
where ki is the number of failures on rack i. R will always be an integer. A larger R value indicates a higher concentration of failures within a single rack.
Rack Affinity for a given rack score R is a probability metric defined as:
Rack Affinity = P(rack score of a randomly chosen N nodes < R)
+ 0.5⋅P(rack score of a randomly chosen N nodes = R)
In simpler terms, it measures how likely it is for a random failure event of the same size (N) as the observed failure to result in a rack score less than or equal to the observed R.
Higher Rack Affinity (closer to 1):
- Implies a greater probability of observing the given rack score or a lower one.
- This means the observed R value is large for a given number of failed nodes (N).
- Indicates that the failure burst is highly concentrated within one or a few racks.
Lower Rack Affinity (closer to 0.5):
- Suggests that the observed R value is relatively small for a given N.
- Indicates that the failure burst is more spread out across multiple racks.
This probability (Rack Affinity) can be efficiently calculated using dynamic programming, considering the total number of failure nodes (
N) and the observed Rack Score (
R).
Part 2: Handling Failures
To safeguard against data unavailability caused by failures, data replication across different nodes is crucial. When a node fails, a recovery process begins to restore the data chunks stored on it. These chunks are recovered from their existing copies on other nodes. The recovery queues prioritize data stripes that have lost the most chunks. The speed of this recovery is constrained by the bandwidth of individual disks, nodes, and even entire racks.
Rack-Aware Placement Policy
As previously mentioned, rack failures are a common cause of storage node loss. To mitigate this, a rack-aware placement policy ensures that each chunk of a data stripe is replicated on nodes located in different racks.
This policy allows us to define the probability that k chunks out of n are affected by failures as:
P = Total number of ways to place a stripe of size n in a cell /
Number of ways to place a stripe of size n with k failures
This probability can be calculated using dynamic programming.
Figure 10 illustrates the expected MTTF of a data stripe under different failure burst sizes:
- Small: Up to 0.1% of nodes impacted.
- Medium: 0.1% to 1% of nodes impacted.
- Large: 1% to 10% of nodes impacted.
The data shows that as the failure burst size increases, the stripe's MTTF significantly decreases by orders of magnitude. A higher replication factor, such as RS=(20, 10), provides greater resilience to failures. Conversely, with a replication factor of R=1, the MTTF is considerably smaller.
Cell Simulation
The authors also conducted a simulation of node failures to assess their impact on stripe availability (defined as the number of stripes unavailable for 15 minutes). The results, displayed in Figure 11, demonstrate that the combinatorial calculations (in the previous section) closely match the actual observed unavailability.
Markov Chains
A stochastic process is a mathematical concept that describes a sequence of random variables within a probability space, often interpreted as evolving over time.
A Markov chain is a specific type of stochastic process. Its key characteristic is that the probability of any future event depends only on its most recent state, not on the sequence of events that led to it.
Consider a two-state (A and B) Markov process as an example:
States: We can represent the current state as a vector. For instance, [ 1 0 ] means there's a 100% probability of being in state A. An initial state could also be probabilistic, like [ 0.25 0.75 ].
Transitions: The movement between states is governed by a transition matrix (T). For example:
T = [[ 0.6 0.7 ]
[ 0.4 0.3 ]]
In this matrix, each column represents the probabilities of transitioning from a specific state, and these probabilities must sum to 1. For instance, from state A (first column), there's a 60% chance of staying in A and a 40% chance of moving to B.
Predicting Future States: With the current state vector (Si), one can predict any future state (Si+1) using the formula:
Si+1 = T × Si
To find the state after n steps (Sn) from an initial state (S0):
Sn = Tn × S0
Interestingly, the state vector as n approaches infinity (S∞) converges to an eigenvector of the transition matrix T. This eigenvector represents the steady-state probabilities of being in each state after a very long time, regardless of the initial starting point.
Markov Model for Stripe Availability
Previous failure models, which focused on calculating probabilities, considered failure bursts of various sizes but didn't account for any correlations between them. In contrast, the Markov model introduces correlated failures, like those affecting multiple disks or machines.
The Markov model makes a key assumption: individual correlated failure events are independent. For example, a single disk failure might impact multiple data chunks (thereby making it a correlated event), however, the model assumes this failure is independent of any other disk failures. This assumption places the model's robustness in a specific range:
- It's weaker than a "perfect" model where failure events are interdependent. Consider a scenario where one disk fails, and others from the same manufacturing batch follow shortly after their warranty expires - the Markov model doesn't fully capture this dependency.
- It's stronger than a model where every single chunk failure is considered independent. We'll explore that specific case in more detail later.
In this Markov model, a stripe transitions into an unavailable state once fewer than r chunks remain available.
Let's define the probabilities and rates of these transitions:
Pi,j represents the probability of transitioning from state i to state j due to a random failure. This failure results in the loss of i - j chunks.
If λ is the overall failure rate across all possible independent failures (like node and disk failures), then λi,j is the rate of transition for stripes from state i to state j. Since each failure has a Pi,j probability and all failures are considered independent, we can express this as:
λi,j = λ⋅Pi,j
The rate of transitioning from any available state i to the unavailable state r - 1 (meaning the stripe becomes unavailable) is the sum of probabilities of all possible transitions from i that result in r - 1 or fewer chunks. This is expressed as:
λi,r−1 = λ ∑j=0r-1 Pi,j
This means we sum the probabilities of going from state i -> i - 1, i -> i - 2, i -> i - 3, and so on, all the way down to r - 1 (the threshold for unavailability).
The authors also model recovery as a serial recovery process operating at a constant per-chunk rate,
ρ. Using both the failure rate (
λ) and this recovery rate (
ρ), we can calculate the mean time to failure for the system. This calculation essentially involves determining the expected time for a
random walk to reach the
r - 1 state, signifying a failure.
Findings
The authors employed a Markov model to calculate the MTTF of data stripes under various independent correlated failures, yielding following observations:
Model Accuracy in Failure Bursts
The model effectively captures the impact of failure bursts. For instance, in a scenario where dozens of nodes failed, the actual MTTF was 1.76e+6 days, while the model predicted 5e+6 days. This demonstrates that the model's predictions are accurate in terms of magnitude.
Rack Failures
The model successfully differentiates between failure bursts that threaten availability (by spanning multiple racks) and those that do not. In an event involving multiple rack failures, the model predicted an MTTF of 5.77e+8 days, compared to an actual MTTF of 29.52e+8 days.
Recovery Rate Impact
The authors also theoretically examined a scenario where chunks fail independently. In this case, the transition rate from state i to i - 1 is iλ, where λ is the failure rate of a single chunk. This is because any of the i available chunks can fail independently. The transition rate from state i to i + 1 is ρ.
Interestingly, the stripe MTTF derived from this independent failure model (though the derivation is not included in the paper) reveals that reducing the recovery rate can significantly improve the MTTF. Specifically, if the recovery rate is reduced by a factor of μ, the stripe MTTF increases by a factor of μs−r.
Impact of Correlated Failures on Unavailability and Replication
The presence or absence of correlated failures profoundly impacts unavailability and the effectiveness of replication:
- Without correlated failures: A 10% reduction in recovery time leads to a 19% decrease in unavailability. The effect of replication also greatly enhances availability.
- With correlated failures: A 90% reduction in recovery time only results in a 6% decrease in unavailability. Furthermore, replication does not increase the MTTF as significantly when correlated failures are present.
Extension to Multiple Cells
<TODO>
Paper Review
This paper, while conceptually straightforward, is mathematically dense. The main challenge lies in the omission of numerous mathematical derivations, which, due to space constraints, makes it difficult to fully grasp the intricacies of the models. Nevertheless, it's an excellent resource for understanding how cloud storage systems achieve data availability.
Comments
Post a Comment