Transformer Add & Norm Component Explained | LLM Architecture

Understand the Add & Norm component, a key part of Transformer encoder layers. Learn its role after multi-head attention and feedforward networks in LLMs.

Add and Norm Component

The Add and Norm component is a crucial element within the Transformer encoder architecture. It is applied sequentially after each of the two primary sublayers:

  1. Multi-Head Attention Sublayer: Processes input sequences to capture dependencies between different positions.
  2. Feedforward Network Sublayer: A position-wise fully connected network that further processes the output of the attention sublayer.

This component plays a vital role in maintaining model stability, improving training efficiency, and enabling the successful training of deep Transformer networks.

Function of Add and Norm

The Add and Norm operation effectively performs two essential tasks:

1. Residual Connection (Add)

  • Mechanism: The output of a sublayer is added directly to its original input.

  • Purpose: This is known as a residual connection. It allows information from earlier layers to be preserved and passed through to subsequent layers. This significantly helps to prevent the vanishing gradient problem during backpropagation, making it easier to train deeper networks.

    Let $x$ be the input to a sublayer, and $Sublayer(x)$ be the output of that sublayer. The residual connection computes: $$x + Sublayer(x)$$

2. Layer Normalization (Norm)

  • Mechanism: The result of the addition (the output with the residual connection) is then passed through a layer normalization process.

  • Purpose: Layer normalization stabilizes the learning process by reducing internal covariate shift. It normalizes the activations across the features for a given sample, ensuring that the input distributions to subsequent layers remain consistent. This leads to faster convergence and more stable training.

    For a given layer's activations, layer normalization computes the mean and variance across the features, then normalizes the activations using these statistics.

Where Add and Norm is Applied

In the Transformer encoder, the Add and Norm component is strategically applied in two key locations:

  1. After the Multi-Head Attention Sublayer:

    • The input to the multi-head attention mechanism is added to its output.
    • The combined result is then normalized.
  2. After the Feedforward Network Sublayer:

    • The input to the feedforward network is added to its output.
    • The combined result is then normalized.

These operations are often visually represented by dotted lines in diagrams of the encoder block, illustrating how the original input to a sublayer is "connected back" to its output before normalization.

Benefits of Add and Norm

The integration of Add and Norm offers several significant advantages:

  • Improved Gradient Flow: Residual connections facilitate the flow of gradients through the network, mitigating the vanishing gradient problem and allowing for effective learning in deep architectures.
  • Preservation of Contextual Information: By adding the original input back to the sublayer's output, crucial contextual information from earlier stages of processing is maintained.
  • Training Stability and Faster Convergence: Layer normalization helps to stabilize the internal distributions of activations, leading to more consistent learning signals and accelerating the model's convergence.
  • Support for Deeper Architectures: The combination of residual connections and layer normalization enables the construction and effective training of much deeper Transformer models without significant performance degradation.

Conclusion

The Add and Norm component is a synergistic combination of residual connections and layer normalization. Its primary function is to enhance training stability and efficiency within Transformer models. By preserving the input signal through residual connections and standardizing output distributions with layer normalization, it ensures that each encoder sublayer can learn more effectively. Understanding this component provides a complete picture of the building blocks within an encoder block, paving the way to comprehending the overall encoder's functionality.