This paper, published by Google in 2010 and presented at Operating System Design and Implementation (OSDI) 2010, is quite challenging due to its highly mathematical nature. I was fortunate to work with one of the co-authors, Florentina Popovici.
1. Paper Structure
This paper covers two independent main topics:
- The first section examines the availability and Mean Time To Failure (MTTF) of individual hardware components like disks and nodes. It also delves into failure bursts and domain-related failures (such as rack failures), which are significant sources of correlated failures.
- The second, entirely independent section discusses how to handle failures and thrive. It builds:
- Probabilities of state transitions for stripes accounting for all possible failure burst sizes. It uses cell simulation to verify these probabilities.
- A Markov model that incorporates the calculated probabilities, where each failure is independent but can impact multiple nodes and chunks, thus reflecting correlated failures.
- A non-Markov model with constant failure and recovery rates, treating chunk failures as independent for less accurate calculations.
2. Part 1: Availability
In a typical data center cluster, thousands to tens of thousands of machines host millions of nodes, which are essentially processes running on these machines. Often, a group of these nodes is dedicated to a specific distributed system, like storage. For example, some nodes might be part of a distributed file system, storing large volumes of data.
This paper specifically examines the availability of distributed storage systems. Storage nodes in a distributed system can become unavailable for a variety of reasons, each with different implications for recovery time and data integrity:
- Node Restarts: These are the quickest to recover from, typically taking only tens of seconds. It's often just a matter of restarting a process.
- Planned Reboots: For maintenance, repairs, or upgrades, machines (running the node) may undergo planned reboots. These can take anywhere from several minutes to a few hours.
- Unplanned Reboots: Unexpected events like kernel crashes also lead to reboots. Similar to planned reboots, these can cause downtime ranging from several minutes to hours.
- Hardware Corruption: Data can become corrupted even on seemingly healthy hardware. Studies have shown a small but present rate of checksum mismatches, indicating corruption in 1 in 106 to 107 data blocks in Google’s data centers.
- Hardware Failures: Physical hardware failures are a significant concern. The annual replacement rate for disks ranges from 2% to 4%. When considering an entire fleet of storage nodes, the overall failure rate is between 3.9% and 8.3% annually. These failures necessitate more extensive recovery procedures, potentially involving data reconstruction from replicated or erasure-coded chunks.
To quantify availability, two key metrics are used:
- Average Availability: This is calculated as the total uptime divided by the sum of total uptime and total downtime.
Average Availability = ∑iUptime / (∑iUptime + ∑iDowntime)
- Mean Time To Failure (MTTF): This metric represents the average time a system operates correctly before a failure occurs.
MTTF = Number of Failures / Total Uptime
A closer look at the causes of storage node unavailability, as shown in Figure 4, reveals that planned reboots are the most frequent, followed by node restarts.
Table 2 provides insights into the MTTF for various components:
- A disk has an MTTF of 10-50 years.
- A node has a significantly shorter MTTF of 4.3 months.
- A rack has an MTTF of 10.2 years.
2.1. Data Replication
In a distributed storage system, files are broken down into stripes, and these stripes are then replicated into chunks to ensure data availability and fault tolerance. There are two popular methods for replicating file chunks: replicated encoding and erasure encoding.
2.1.1. Replicated Encoding
Replicated encoding is where file chunks are simply duplicated, with each copy stored on a separate storage node, as shown in illustration 1. A single node can hold chunks from various stripes.
Illustration 1: Replicated encoding.
Common replication factors include r=2 or r=3. The storage space required directly increases with the replication factor.
2.1.2. Erasure Encoding
A file is divided into k data blocks. From these k blocks, m additional parity blocks are computed. The total number of blocks then becomes n = k + m, as shown in illustration 2.
- k: Number of original data blocks.
- m: Number of generated parity blocks.
- n: Total number of blocks (n = k + m).
The key advantage is that to retrieve the original data, you only need any k of the n total blocks. This means you can lose up to m blocks and still recover your data.
Illustration 2: Erasure encoding.
Reed-Solomon codes are the most widely used erasure codes, offering the maximum possible fault tolerance for a given amount of redundancy.
2.2. Correlated Failures
2.2.1. Failure Burst
2.2.2. Rack Affinity
Higher Rack Affinity (closer to 1):
- Implies a greater probability of observing the given rack score or a lower one.
- This means the observed R value is large for a given number of failed nodes (N).
- Indicates that the failure burst is highly concentrated within one or a few racks.
Lower Rack Affinity (closer to 0.5):
- Suggests that the observed R value is relatively small for a given N.
- Hence, it indicates that the failure burst is more spread out across multiple racks.
3. Part 2: Handling Failures
3.1. Rack-Aware Placement Policy
- Small: Up to 0.1% of nodes impacted.
- Medium: 0.1% to 1% of nodes impacted.
- Large: 1% to 10% of nodes impacted.
3.1.1. Cell Simulation
3.2. Markov Chains
A stochastic process is a mathematical concept that describes a sequence of random variables within a probability space, often interpreted as evolving over time.
A Markov chain is a specific type of stochastic process. Its key characteristic is that the probability of any future event depends only on its most recent state, not on the sequence of events that led to it.
Consider a two-state (A and B) Markov process as an example, as shown in illustration 4.
Illustration 4: A two-state Markov process.
States: We can represent the current state as a vector. For instance, [ 1 0 ] means there's a 100% probability of being in state A. An initial state could also be probabilistic, like [ 0.25 0.75 ].
Transitions: The movement between states is governed by a transition matrix (T). For example:
T = [[ 0.6 0.7 ]
[ 0.4 0.3 ]]
In this matrix, each column represents the probabilities of transitioning from a specific state, and these probabilities must sum to 1. For instance, from state A (first column), there's a 60% chance of staying in A and a 40% chance of moving to B.
Predicting Future States: With the current state vector (Si), one can predict any future state (Si+1) using the formula:
Si+1 = T × Si
To find the state after n steps (Sn) from an initial state (S0):
Sn = Tn × S0
Interestingly, the state vector as n approaches infinity (S∞) converges to an eigenvector of the transition matrix T. This eigenvector represents the steady-state probabilities of being in each state after a very long time, regardless of the initial starting point.
3.2.1. Markov Model for Stripe Availability
- It's weaker than a "perfect" model where failure events are interdependent. Consider a scenario where one disk fails, and others from the same manufacturing batch follow shortly after their warranty expires - the Markov model doesn't fully capture this dependency.
- It's stronger than a model where every single chunk failure is considered independent. We'll explore that specific case in more detail later.
3.2.2. Findings
Model Accuracy in Failure Bursts
Rack Failures
Recovery Rate Impact
Impact of Correlated Failures on Unavailability and Replication
- Without correlated failures: A 10% reduction in recovery time leads to a 19% decrease in unavailability. The effect of replication also greatly enhances availability.
- With correlated failures: A 90% reduction in recovery time only results in a 6% decrease in unavailability. Furthermore, replication does not increase the MTTF as significantly when correlated failures are present.






Comments
Post a Comment