Skip to main content

Paper Insights #36 - ORCA: A Distributed Serving Systems for Transformer-Based Generative Models

This recent paper, presented at Usenix OSDI '22 by Seoul National University, offers an excellent introduction to modern ML serving systems.

Paper Link

Let's begin with some basic concepts.

Hardware Accelerators

Modern computing systems rely on hardware accelerators to speed up computationally intensive tasks. These specialized components execute complex instruction sets far more efficiently than a general-purpose CPU using RISC instructions.

Accelerators typically connect to the system bus via a standard PCIe interface, where they receive data and instructions from the CPU.

Common examples of hardware accelerators include:
  • GPUs (Graphics Processing Units): Originally designed for SIMD (Single Instruction, Multiple Data) processing, making them ideal for image manipulation. More recently, GPUs have become crucial for training and inference in AI and machine learning applications.
  • FPGAs (Field-Programmable Gate Arrays): These integrated circuits offer post-manufacturing reconfigurability. They consist of a grid of reconfigurable logic blocks and programmable interconnects, enabling users to implement custom digital circuits tailored to specific applications.
  • Cryptographic Accelerators: Co-processors specifically engineered to handle computationally demanding cryptographic operations.

Graphics Processing Unit (GPUs)

GPUs were initially designed for rendering graphics. However, they have evolved into powerful parallel processors. CPUs typically use a Single Instruction, Single Data (SISD) approach, whereas, GPUs excel at Single Instruction, Multiple Data (SIMD) computations. This means they can apply the same operation to many data points simultaneously, which is ideal for tasks like image processing, where identical computations are performed on numerous pixels.

The Rise of General-Purpose GPUs

A significant shift occurred in 2007 with the introduction of NVIDIA's TESLA architecture. This marked the advent of the first General-Purpose GPUs (GPGPUs), specifically designed for non-graphics computations. Unlike consumer-oriented GeForce cards, TESLA GPUs were built for high-performance SIMD operations. This innovation paved the way for subsequent generations of powerful GPU architectures like the V100 (2017), A100 (2020), and H100 (2022), which are now widely used in data centers and for scientific computing. One can program TESLA and later GPUs using CUDA.

Architecture


A GPU's power comes from its highly parallel architecture. GPUs are comprised of many Streaming Multiprocessors (SMs), often referred to as cores. For example, a V100 GPU has 80 SMs. Each SM, in turn, contains multiple subcores. A V100, for instance, has four subcores per core.


Each subcore is equipped with specialized units to handle various computations:

  • Fetch/decode units for instructions.
  • Arithmetic units for integer and floating-point operations.
  • Tensor units for matrix computations, crucial for deep learning.
  • Load/store units for memory access.
  • A set of warps, where a warp is a collection of registers.

A CUDA thread's state is stored in registers organized into warps. These threads execute a single, common instruction simultaneously on different data, embodying the SIMD principle.


For example, In V100, each warp can store the state of 32 CUDA threads. There are 16 such warps per subcore. This allows a V100 subcore to execute 512 threads in parallel; an entire core can execute 2,048 threads concurrently, and the entire GPU can execute 163,840 threads in parallel.

Memory Model

GPUs feature an on-chip memory hierarchy similar to a NUMA (Non-Uniform Memory Access) architecture, with three levels:

  • Global memory: Accessible by all subcores.
  • Shared memory: Accessible only within a specific SM (core).
  • Local memory: Accessible only by a single thread.

CUDA

CUDA (Compute Unified Device Architecture) is a programming language used to develop GPU kernels. A kernel is a routine specifically compiled to run on accelerators like GPUs. 

Consider a simple loop like 

for (int i = 0; i < n; i++)
    a[i] = 2 * a[i];

The body of the loop is a prime example of a kernel, as it involves the same operation applied to multiple data points.

When a program runs, it typically starts on the CPU. To utilize the GPU, the CPU first copies the necessary data to the GPU's global memory. Then, it initiates the creation of threads on the GPU.

For very large datasets (e.g., billions of data items), the data is divided into blocks. Each block is mapped to an SM core in a round-robin fashion, with the context stored in warp registers. This allows for efficient distribution of work across the numerous parallel processing units of the GPU. While the entire dataset resides in global memory, individual threads can copy portions to their respective shared or local memory for faster access during computation.

Synchronization Primitives

To manage parallel execution and data consistency, CUDA provides synchronization primitives:

  • Barriers: Used to synchronize threads within a thread block. For instance, one thread might copy data from global to SM memory, and the other threads in that block will wait until the copy is complete.
  • Atomics: Used to synchronize access to shared and global variables, ensuring that concurrent writes or reads don't lead to data corruption.

Specifications

GPUs have grown increasingly powerful over time.

Feature

V100 (2017)

H100 (2022)

CUDA Cores

5,120

7,296

Transistors

21.1 billion

80 billion

Process Node

12nm

4nm

Power

250W

350W

FP32 Performance

14.1 TFLOPS

25 TFLOPS

Memory Capacity

16GB

80GB

Memory Bandwidth

900 GB/s

1,280 GB/s


Note: The listed memory bandwidth refers to on-die bandwidth within the GPU itself. Bandwidth between the system and GPU memory is dependent on the PCIe interface speed (e.g., PCIe Gen 5 offers around 200 GB/s).

GPU <-> GPU Communication

NVIDIA Collective Communications Library (NCCL), pronounced "nickel", is a software library that facilitates communication between GPUs. NCCL is the GPU counterpart to the Message Passing Interface library, which is widely used for communication in traditional High-Performance Computing (HPC) systems over regular network.

However, NCCL isn't designed for communication over regular network (such as ethernet). Instead, it's optimized for a dedicated hardware network, specifically NVLink, which directly connects GPUs within a machine in a mesh network for fast data exchange. The CPU <-> GPU communication happens over PCIe whereas the GPU <-> GPU communication happens over NVLink.


ML Serving

Once a machine learning model is trained, the real work of ML serving (or model serving) begins. This isn't about the training process itself; instead, it's an infrastructure challenge focused on taking that trained model, packaging it up, and deploying it as an API. The goal is to make ML models readily available to generate predictions at scale.

Every ML serving system operates with two distinct layers:

  • The Serving Layer (Frontend): Responsible for receiving incoming prediction requests from other systems and then handling the responses back to them.
  • The Execution Layer (Backend): Stores the actual machine learning model and runs the inputs through it to generate the output, which are then passed back to the serving layer.

For cutting-edge ML models like Transformers (a widely used modern architecture), specific systems have emerged to handle each layer. For the serving layer, popular choices include Triton and TensorFlow Serving. For execution layer, FasterTransformer and LightSeq are frequently used.

Transformers

Disclaimers:
  • I'm not an ML expert, and my knowledge of Transformers is limited to what's essential for grasping the underlying training and serving system architectures. This section will offer a concise overview of only the most crucial aspects.
  • There are many variations of Transformers used for variety of applications like language processing, image processing, etc. Here we are going to explore a simple transformer for language processing only.
  • The section assumes that readers have a basic understanding of neural networks.
A transformer takes one or more input tokens and produces a single output token. Mathematically, a token is represented as an embedding which is as a H-dimensional vector, where H (often called the embedding size or hidden model state size) is the total number of features.

At its core, a Transformer processes information through a series of interconnected layers. 


Inside each layer, there are several interconnected neural networks. All these layers transform one or more H-dimensional input vectors into another H-dimensional vector, primarily by applying the learned weights to them and then performing some normalization.

The Attention Layer

The Attention layer is a critical component within Transformer-based neural networks. For text (token-based) data, a single-headed attention layer operates with three key elements:
  • Query (Q): What is being looked at? In the context of text processing, this represents the current token that the attention mechanism is focusing on.
  • Keys (K): What are the elements to pay attention to? These are all the tokens in the sequence that the query might attend to.
  • Values (V): What are values associated with the keys?
Let's use the sentence: "a quick brown fox jump over a lazy dog" to illustrate. When the attention layer processes the word "jump", it's essentially asking: "How relevant is 'jump' to every other word in this sentence?" It then evaluates this relationship with respect to all the tokens: ["a", "quick", "brown", "fox", "jump", "over", "a", "lazy", "dog"].

While there are also multi-headed attention layers that consider multiple tokens as queries simultaneously, we'll focus on the single-headed version here for simplicity.

The Mathematics Behind Attention

At the heart of the attention mechanism are three core matrices:
  • Query (Q): A 1 x H dimensional vector (or matrix). This is essentially the input to the attention layer, representing the token whose relationships with others we're trying to understand.
  • Key (K): An L x H dimensional matrix, where L is the total number of tokens.
  • Value (V): An L x H dimensional matrix.
For each of the L tokens being evaluated against the query, its individual key and value are H-dimensional vectors. These are typically generated by multiplying the token's embedding with weight parameters (Wk and Wv). These parameters are learned during the model's training phase. Because there are L tokens to consider, the keys and values form LxH dimensional matrices.

While both keys and values are derived from tokens, they serve distinct purposes. The key is used to calculate how relevant a token is to the query (its "attention weight"). The value, on the other hand, represents the features or content associated with that token, which will be combined based on those attention weights.

The primary goal is to compute 

softmax(QKT)V

Let's break down this calculation:
  • QKT: This matrix multiplication results in a 1 x L dimensional matrix. This matrix quantifies how much "attention" the query should pay to each of the M tokens.
  • softmax(...): Applying the softmax function to this 1 x L matrix normalizes these attention scores, ensuring they sum up to 1.
  • softmax(QKT)V: Finally, multiplying these normalized attention scores by the value matrix (V) yields a 1 x H dimensional matrix (or a vector). This resulting vector is the output of the attention layer for the given input query.
Essentially, the attention layer takes an H-dimensional input vector and produces an H-dimensional output vector, effectively transforming the query based on its relevance to all tokens.

Attention Layer for Autoregressive Models

In autoregressive models, the model used in Generative Pre-trained Transformers (GPT), the goal is to generate text one token at a time, where each new token is predicted based on the preceding context. When Transformers are used for this task, a technique called causal masking is applied. This means that during the attention calculation, a token can only "attend" to (or learn from) tokens that came before it in the sequence, not tokens that appear later.

For a attention layer in GPT, the components are defined slightly differently:
  • Query (Q): This is the embedding of the last token generated, serving as the context for predicting the next one.
  • Keys (K): These are the keys for all tokens that precede the token currently being predicted.
  • Values (V): These are the values for all tokens that precede the token currently being predicted.
The output of this attention layer will be an H-dimensional vector. This vector essentially represents the embedding of the next word in the sequence, ready to be processed further to determine the actual token.

Example Walkthrough

Let's solidify the concepts we've discussed by walking through an example, mirroring the one from the paper.


In the provided diagram, each node represents a Transformer layer. During each iteration of this process, a context and a query are supplied as input, and a single token is produced as output.

Consider the following progression:
  • Iteration 1:
    • Query: "this"
    • Context: "I think this"
    • Output: "is"
  • Iteration 2:
    • Query: "is"
    • Context: "I think this is"
    • Output: "great"
  • Iteration 3:
    • Query: "great"
    • Context: "I think this is great"
    • Output: <EOS> (End of Sequence token)
The context provided in each step is subsequently transformed into the keys and values utilized by the attention layer. Notably, with each successive iteration, the dimensionality (L) of both the keys and values progressively increases.

Note: The paper differentiates between an initiation phase and an increment phase. In the initiation phase, a multi-head attention layer processes all the initial context tokens. For simplicity, however, we'll focus solely on single-head attention in all phases.

Batching

Processing only one <input, context> pair at a time within each iteration significantly underutilizes the capacity of GPUs, which are highly efficient at parallel matrix operations. To boost throughput, processing typically involves batching multiple requests together.

Here's how the matrix dimensions change with batching:
  • Query (Q): A B × H dimensional matrix.
  • Key (K): A B × L × H dimensional matrix.
  • Value (V): A B × L × H dimensional matrix.
Here, B represents the batch size (the number of <input, context> pairs processed simultaneously). Each "kernel" operation, which constitutes a single iteration, processes an entire batch. For effective batching, a critical requirement is that the input shapes of all matrices within the same batch must be consistent. Since H (the embedding dimension) is typically constant, the crucial constraint is that L (the number of tokens or context length) must be identical across all inputs in the batch. This means all input sequences within a batch must have the same length for batching to work efficiently.

Pipelined Parallelism

Pipelined parallelism is a widely used technique, particularly in compiler design for parallel programs, and has also found its way into ML model compilers for efficient model implementation.

The core idea is simple: a large task is broken down into sequential stages. Think of it like an assembly line, similar to how cars are manufactured. As one work item (e.g., a car chassis) finishes stage Si (e.g., welding) and moves on to stage Si+1 (e.g., painting), a succeeding work item (the next car chassis) can immediately enter stage Si. This keeps all stages busy, increasing overall throughput.



We'll now explore later this technique is specifically implemented within different ML execution engines.

Orca

Orca is a serving system designed specifically for Transformer-based ML models. What sets Orca apart is its tight integration of the serving and execution engines into a single, unified system. This unique design delivers significant speedups for individual requests and achieves high overall throughput by maximizing the utilization of GPUs.

We'll start with a conceptual overview of Orca's goals before diving deeper into its architecture.

Design

Conceptually, Orca operates as a sophisticated scheduler for inference requests, where each request involves generating a sequence of tokens. The processing for a given request concludes once all its tokens have been generated through successive iterations of the transformer model.

Limitations of Existing Transformer Serving Systems

Existing transformer serving systems (like FasterTransformer) commonly employ request-level scheduling. This means the execution engine processes requests in fixed batches, where every request within a batch starts with the same number of context tokens. Multiple iterations are then run until all requests in the batch conclude with an <EOS> token.

This approach presents two significant disadvantages:
  • Batch Bottleneck: The performance of an entire batch is bottlenecked by the slowest or longest-running request within it. If one request requires generating significantly more tokens than others (e.g., 100 times more), all other requests in that batch will be delayed until the longest one completes.
  • Forced Padding: To create uniform batches, all requests are typically forced to have an equal number of tokens. Existing systems achieve this by padding shorter contexts to match the length of the longest context in the batch, leading to wasted computation and memory.

Solution 1: Iteration-Level Scheduling

To mitigate the latency issues, Orca introduces iteration-level scheduling. Instead of processing a full request end-to-end, Orca picks requests from a pool, processes them for only a single transformer iteration (generating one new token), and then returns them to the request pool. This allows requests that generate an <EOS> token to be immediately returned to the user, significantly improving latency for shorter outputs.


With iteration-level scheduling, it becomes crucial to manage the Key-Value (KV) cache for each request. For every attention layer and for each token in a request's context, the corresponding KV cache data must be stored on the GPU. This state is retained in Attention Manager until the request fully completes. This retention is essential for incrementally computing KV pairs across different iterations, avoiding redundant computations.

Request Selection for Scheduling

Orca's execution engine maintains a pool of requests eligible for the next iteration run (i.e., those awaiting their next token). Requests are selected based on two primary factors:
  • Maximum Batch Size: This is primarily constrained by the GPU's computational capacity, which depends on its core and subcore count.
  • Available GPU Memory: The most variable element of memory consumption is the KV cache stored in the Attention Manager. Model parameters, code, and other static elements consume a constant amount of memory. Each request has an upper bound on the maximum number of tokens it can generate, which in turn sets an upper bound on its memory footprint within the Attention Manager.
As described in Algorithm 1, the scheduler:
  1. Randomly picks a request from the pool until the batch reaches its maximum size (Lines 21-23).
  2. Selects the requests for processing if their maximum token slots in the Attention Manager does not exceed the total available Attention Manager slots (Lines 24-26).
  3. Once selected, these requests are processed through one iteration of the transformer. After the iteration completes, the newly generated token for each request is appended to it. Requests marked as finished (Line 13-14) are removed from the batch, and their reserved slots in the Attention Manager are freed, allowing new requests to be added.
While the current scheduler doesn't explicitly prioritize requests or factor in computational investment, these considerations could be integrated.

Solution 2: Selective Batching

Orca implements selective batching across different neural network layers. All layers except for the attention layer can process inputs in batches, as they typically operate on matrices of a consistent Batch Size (B) x H.

However, the attention layer is different; it works with matrices of dimensions B x Context Length (L) x H, where L can vary between requests within a batch. To accommodate this, Orca Split-s the input for each individual request before it enters the attention layer. After the attention layer processes these split inputs, their outputs are then Merge-d back together before being passed to the subsequent neural network layers.

Architecture

Orca, like all modern machine learning serving and training systems, employs a distributed architecture to leverage the power of multiple GPUs for efficient inference.

Traditionally, ML serving systems could be scaled by simply creating N copies of the entire model across GPUs and distributing incoming requests among these copies. However, with the increasing size and complexity of contemporary ML models, they often exceed the memory capacity of a single GPU. This necessitates a more sophisticated approach where both the model's trained weight parameters and the inference requests themselves must be split.

Partitioning

Orca utilizes two primary methods for model splitting:
  • Inter-layer split: This approach distributes different layers of the model across various GPUs. The output of one layer on one GPU is then communicated as input to the next layer on another GPU.
  • Intra-layer split: This method involves splitting within a single layer. ML models are fundamentally composed of weight matrices and bias vectors. These often share a common dimension, for instance H, representing the hidden state size or embedding vector length. Intra-layer splitting divides these weights and biases along this H dimension. For example, a key matrix Wk with dimensions (1, H) could be split into (1, H1), (1, H2), and (1, H3), where H1 + H2 + H3 = H. Similarly, the input query vector is also split. Essentially, this approach distributes the computation of embedding features.
For instance, consider a scenario where an inter-layer split is combined with intra-layer parallelism. A given request might first enter GPU 1, GPU 2, and GPU 3. Each of these GPUs would be responsible for processing a specific portion (e.g., 1/3) of the embeddings for the first and second layers of the model. The results from these three GPUs are then sent to GPU 4, GPU 5, and GPU 6, respectively, which then process the third and fourth layers of the model. This coordinated effort allows all six GPUs to collectively process the entire model for the request.

The Request Execution Flow

The scheduler creates a batch of requests and sends the batch over to the engine master. The master then sends these requests, along with crucial control messages, to the first worker's controller. The control message for each request specifies the index of the next token to be generated, which is vital for identifying the correct input for the current iteration.

The first worker's controller schedules this work across its GPUs, taking into account any configured intra-layer parallelism. It passes the request ID and current token index to the GPUs, allowing them to correctly identify inputs and manage key-value pairs within the attention layers.

After scheduling, the first controller forwards the requests and control messages to the next controller in the processing pipeline, which then similarly schedules the work on its GPUs. GPUs in subsequent layers wait for the completion of the preceding layers, receiving their inputs. Once all GPUs in a given layer have completed their processing, they send the newly generated tokens back to their respective controllers, which then forward them to the engine master.


This distributed approach allows Orca to handle the massive computational demands of modern ML models by efficiently distributing the model across multiple high-performance GPUs.

Communication

GPUs communicate their outputs to subsequent GPUs in the processing chain using NCCL. For communication between nodes (e.g., engine master to controller messages), Orca relies on gRPC.

Pipelined Parallelism

Let's explore how pipelining works in both Orca and FasterTransformer. Both systems partition the transformer model into distinct stages, where each stage typically represents a transformer layer.

Pipelining in Orca

Orca achieves pipeline parallelism by feeding the next batch through the layers as the previous batch exits those layers. This is made possible by Orca's iteration-level scheduling. With this approach, an entire batch is designed to pass through all layers only once before being sent back to the scheduler for potential rescheduling.


As depicted in the diagram, for instance, batch CD can enter Stage 1 in Orca as soon as batch AB leaves stage 1. This continuous flow helps keep the pipeline busy.

Pipelining in FasterTransformer

In contrast, FasterTransformer's execution engine processes an entire batch to completion before that batch leaves the engine. This means all requests within a batch must be fully processed. Consequently, only a single batch moves through the different stages at any given time. This can lead to underutilization of GPU capacity, as stage Si−1 remains idle while stage Si is in use.

To mitigate this, FasterTransformer employs microbatching, where larger batches are split into smaller microbatches. These microbatches then move through the stages in a pipelined fashion, offering a key advantage - keeps all parts of the GPU busy and occupied, improving overall hardware utilization. However, this approach comes with certain drawbacks:
  • It doesn't fundamentally solve the problem of latency. The execution engine's output still waits for all requests across all microbatches to complete, due to its request-level scheduling.
  • It can negatively impact latency. Each batch effectively has to pass through more stages. If there are N microbatches, this can add up to N - 1 additional stages of sequential processing for the entire batch to complete.

Evaluation

Microbenchmarks 

In this scenario, iteration-level scheduling was disabled in Orca for a direct comparison with FasterTransformer.
  • Orca performed slightly worse compared to FasterTransformer. This was mostly because Orca doesn't use batching for Attention layers, which increases the overhead.
  • However, as the model size increases, which requires its deployment across multiple GPUs and hence also necessitates cross-CPU communication, Orca outperforms FasterTransformer by 47% (175B model, 16 GPUs). This is primarily because of Orca's use of NCCL for cross-GPU communication, reserving it for data only. Control messages are on a separate (gRPC) channel.

Synthetic Requests

  • The per-request latency increased as the model size increased due to increased communication across GPUs.
  • Due to miscellaneous overheads, the per-request latency also increased as the throughput increased up to a point, after which the latency shot up too high.
  • Orca offered much higher throughput support than FasterTransformer. In both the 175B and 341B models, Orca supported at least 2 req/s without latency significantly shooting up.
  • With the 101B model and very low throughput, FasterTransformer performed better than Orca in terms of latency. This was because FasterTransformer used batching in Attention layers.

Effect of Batch Size

As batch size increased, Orca offered higher throughput. The per-request latency remained constant across different batch sizes. The tipping point for each batch size was at different throughput levels. With a batch size of 32, 6 req/s throughput was achievable.

As batch size increased, FasterTransformer also provided better throughput. Microbatching was not very useful; in fact, maximum throughput was achieved by setting microbatch sizes equal to the batch size. Increasing batch size can help, but this also increases the likelihood of including a request that has significantly more tokens, thereby slowing down the entire pipeline.

Paper Review

This paper is an excellent starting point for anyone looking to build ML systems. It offers a detailed look at how a system for a cutting-edge ML model, the Transformer, was constructed. In doing so, it provides valuable insights into how systems can be optimized. While simplification often involves breaking down monolithic systems, achieving scale through optimization can sometimes benefit from tight integration.

I highly recommend this paper to any ML and systems enthusiasts. Anyone with a solid background in neural networks and computer systems will find it easy to follow.


Comments

Popular Posts

Paper Insights #26 - CliqueMap: Productionizing an RMA-Based Distributed Caching System

Memcached is a popular in-memory cache, but I'd like to discuss CliqueMap, Google's caching solution. Having worked closely with CliqueMap, I have a deep understanding of its architecture. One major difference from Memcached is CliqueMap's use of RMA for reads. We'll also take a closer look at RDMA, a crucial cloud technology that emerged in the 2010s.

Paper Insights #27 - Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

This work provides a strong foundation for understanding causality , both within distributed systems and more broadly. Its principles underpin systems achieving causal consistency, a powerful form of consistency that ensures high availability. Presented at SOSP 2011, this paper features contributions from prominent distributed systems researchers Wyatt Lloyd and Michael Freedman .

Paper Insights #24 - Spanner: Google's Globally-Distributed Database

This landmark paper, presented at ODSI '12, has become one of Google's most significant contributions to distributed computing. It didn't solve the long-standing core problem of scalability of 2PC in distributed systems, rather, it introduced  TrueTime  that revolutionized system assumptions. Authored by J.C. Corbett , with contributions from pioneers like Jeff Dean and Sanjay Ghemawat , this paper effectively ended my exploration of distributed SQL databases. It represents the leading edge of the field.