Transformer KV Cache Sharing: Heads & Layers Explained

Learn how KV cache sharing optimizes Transformer inference across heads & layers. Dive into MHA, MQA, and Grouped Query Attention for efficient LLM performance.

Efficient memory and computation usage during inference is critical for transformer-based language models. A significant optimization technique involves the handling of the Key-Value (KV) cache. This guide explains the principles of KV cache sharing across attention heads and layers, offering insights into various attention mechanisms like Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA), and cross-layer sharing.

KV Cache and Its Space Complexity

During inference, transformers store the keys (K) and values (V) from attention layers in a KV cache. This cache stores the computed K and V matrices for each token in the input sequence, allowing them to be reused in subsequent attention computations without recalculating them for every new token.

For a transformer model with:

L layers
τ (tau) attention heads per layer
dh hidden dimension
m tokens in the sequence

The space complexity of storing the full KV cache is:

O(L * τ * d_h * m)

This becomes increasingly demanding for large models and long input sequences, significantly impacting memory consumption.

Optimizing the KV Cache

There are two primary strategies to reduce the memory footprint of the KV cache:

Reducing the Token Count (m): Techniques like sliding window attention limit the attention mechanism to a local window of a fixed size (e.g., m_w). This reduces the number of tokens for which K and V need to be stored, leading to a reduced space complexity of:
```
O(L * τ * d_h * m_w)
```
Sharing Keys and Values: This involves sharing the K and V matrices across different attention heads within a layer, or even across different layers of the transformer.

Different attention mechanisms leverage KV cache sharing to varying degrees:

Multi-Head Attention (MHA)

In standard MHA, each attention head independently computes its own set of queries (Q), keys (K), and values (V). The output is a concatenation and linear transformation of the outputs from each head.

The computation for each head j is:

head_j = Attention(q_j, K_j, V_j)

The final output is then:

Output = Merge(head_1, ..., head_τ) * W_head

This requires storing a separate (K, V) pair for each of the τ heads in every layer. Consequently, the KV cache size remains:

O(L * τ * d_h * m)

MHA offers high model expressiveness but is memory-intensive.

Multi-Query Attention (MQA)

MQA optimizes KV cache usage by sharing a single set of keys (K) and values (V) across all attention heads within a layer. Each head still computes its own distinct queries (Q).

The computation for each head j becomes:

head_j = Attention(q_j, K, V)

Here, all τ heads utilize the same (K, V) cache. This significantly reduces the space complexity to:

O(L * d_h * m)

MQA dramatically reduces memory usage while retaining multi-head diversity through separate queries.

Grouped Query Attention (GQA)

GQA is a hybrid approach that balances the expressiveness of MHA with the efficiency of MQA. It groups the attention heads into n_g groups. Each group shares a unique set of keys and values.

If g(j) denotes the group ID for head j, the computation is:

head_j = Attention(q_j, K_{g(j)}, V_{g(j)})

The KV cache size becomes:

O(L * n_g * d_h * m)

This approach allows tuning the number of groups (n_g) to strike a balance between computational efficiency and model performance:

When n_g = τ, GQA behaves like standard MHA.
When n_g = 1, GQA behaves like MQA.

GQA offers a flexible trade-off, often achieving performance close to MHA with significantly reduced memory requirements compared to MHA.

Beyond sharing within a layer across heads, keys and values can also be shared across different layers. This strategy is related to concepts like shared weights and shared activations. It enables:

Reduced Memory Usage: Higher layers reuse the KV cache computed by lower layers, thus avoiding redundant storage.
Faster Inference: Fewer unique KV caches to manage and compute.
Consistent Representations: Promotes more consistent feature representations across layers.

In a cross-layer sharing model, a query in a given layer can directly access the keys and values from a lower layer’s cache, or a single shared cache across multiple layers. This is particularly effective in models where representations evolve predictably across layers.

Visualization of Attention Mechanisms

The following diagrams illustrate how K and V are handled in different attention mechanisms:

(a) Single-Head Attention: A single set of Q, K, and V.
(b) Multi-Head Attention (MHA): Independent Q, K, and V for each of the τ heads.
(c) Multi-Query Attention (MQA): Unique queries (Q) per head, but a single shared K and V for all heads.
(d) Grouped Query Attention (GQA): Unique queries (Q) per head, with n_g shared K and V sets, where each set is used by a group of heads.
(e) Cross-Layer Attention: Higher layers utilize KV caches computed by earlier (lower) layers.

Summary Table

Attention Type	Query Storage	Key/Value Storage	Cache Size	Notes
Single-Head	1 set	1 set	`O(L * d_h * m)`	Simple, but limited expressiveness.
Multi-Head (MHA)	`τ` sets	`τ` sets	`O(L * τ * d_h * m)`	Full expressiveness, highest memory cost.
Multi-Query (MQA)	`τ` sets	1 shared set	`O(L * d_h * m)`	Highly efficient, potential minor quality trade-off.
Grouped Query (GQA)	`τ` sets	`n_g` shared sets (where `1 <= n_g <= τ`)	`O(L * n_g * d_h * m)`	Balances speed, memory, and quality. `n_g=1` is MQA, `n_g=τ` is MHA.
Cross-Layer Sharing	Varies	Shared across layers	Varies (often reduced)	Reduces memory usage by reusing caches from lower layers.

Conclusion

KV cache optimization in transformers plays a crucial role in enabling efficient inference, especially for large language models handling long sequences. Strategies like MQA, GQA, and cross-layer sharing significantly reduce memory complexity without severely impacting model expressiveness. These methods are widely adopted in production systems and form the backbone of scalable, high-performance transformer architectures.

Technical Keywords

KV cache optimization in transformers
Multi-query attention (MQA) in LLMs
Grouped query attention (GQA) explained
Transformer KV cache space complexity
Cross-layer KV cache sharing in transformers
Efficient memory usage in transformers
Multi-head attention vs. multi-query attention
Inference optimization in large language models
KV cache sharing across attention heads
Memory-efficient transformer architecture
Sliding window attention

Potential Interview Questions

What is the KV cache in transformers, and why is it necessary during inference?
Compare Multi-Head Attention (MHA) and Multi-Query Attention (MQA) in terms of cache usage and performance.
How does Grouped Query Attention (GQA) strike a balance between MHA and MQA?
Explain the space complexity of the KV cache in a standard transformer model with L layers, τ heads, d_h hidden dimension, and m tokens.
How does sliding window attention help reduce KV cache memory usage, and what is its complexity?
What are the trade-offs between MHA, MQA, and GQA in transformer models concerning memory, computation, and output quality?
Describe how cross-layer KV cache sharing can improve transformer efficiency.
What does the cache size become in GQA when the number of groups (n_g) equals 1 or τ?
Why is it beneficial to share keys and values across attention heads, and which mechanisms achieve this?
In what scenarios would you prioritize using cross-layer KV cache sharing over other KV cache optimization techniques?

Transformer KV Cache Sharing: Heads & Layers Explained

On this page