LLM Internals: Transformers, Attention & Positional Encoding

Dive into LLM internals! Explore the Transformer architecture, self-attention mechanism, and positional encoding that power modern large language models.

LLM Internals: Transformers, Attention, and Positional Encoding

This document provides a comprehensive overview of the core components that power Large Language Models (LLMs), focusing on the Transformer architecture, the self-attention mechanism, and positional encoding.

The Transformer Architecture

The Transformer is a groundbreaking neural network architecture introduced by Vaswani et al. in their 2017 paper, "Attention is All You Need." It revolutionized sequence processing by eschewing recurrent layers (like RNNs and LSTMs) in favor of attention mechanisms, enabling parallel processing of input sequences.

Key Components of the Transformer

  • Input Embedding: Converts input tokens (words or sub-words) into dense vector representations.
  • Positional Encoding: Injects information about the order of tokens in the sequence, as the self-attention mechanism itself is permutation-invariant.
  • Multi-Head Self-Attention: The core mechanism that allows the model to weigh the importance of different tokens in the input sequence when processing each token.
  • Feedforward Neural Networks: Position-wise fully connected networks that process the output of the attention layers.
  • Layer Normalization and Residual Connections: Techniques used to stabilize training and allow for deeper networks by preventing vanishing gradients and aiding information flow.
  • Encoder and Decoder Stacks (in full Transformer models): The complete Transformer architecture consists of an encoder stack (for processing input sequences) and a decoder stack (for generating output sequences). Many modern LLMs utilize variations of these, such as decoder-only (e.g., GPT) or encoder-only (e.g., BERT) architectures.

Simplified Transformer Encoder Layer (PyTorch Example)

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleTransformerEncoderLayer(nn.Module):
    """
    A simplified implementation of a Transformer Encoder Layer.
    """
    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1):
        """
        Args:
            d_model: The dimensionality of the input and output features.
            nhead: The number of attention heads.
            dim_feedforward: The dimension of the feedforward network model.
            dropout: The dropout probability.
        """
        super().__init__()
        # Multi-head self-attention layer
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Feedforward network layers
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        # Layer normalization and residual connection components
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass of the Transformer Encoder Layer.

        Args:
            src: The input tensor. Shape: (seq_len, batch_size, d_model)

        Returns:
            A tuple containing:
                - The processed output tensor. Shape: (seq_len, batch_size, d_model)
                - The attention weights. Shape: (batch_size, seq_len, seq_len)
        """
        # Self-attention block
        # src2 is the output of multi-head attention, attn_weights are the attention scores
        src2, attn_weights = self.self_attn(src, src, src)
        # Add residual connection and apply dropout
        src = src + self.dropout1(src2)
        # Apply layer normalization
        src = self.norm1(src)

        # Feedforward block
        # Apply the first linear layer, ReLU activation, dropout, and the second linear layer
        src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
        # Add residual connection and apply dropout
        src = src + self.dropout2(src2)
        # Apply layer normalization
        src = self.norm2(src)

        return src, attn_weights

# Example usage:
seq_len, batch_size, d_model = 5, 1, 16
src = torch.rand(seq_len, batch_size, d_model) # Input sequence

# Initialize the encoder layer
encoder_layer = SimpleTransformerEncoderLayer(d_model=d_model, nhead=4)

# Perform the forward pass
output, attn_weights = encoder_layer(src)

print(f"Encoder output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")

The Self-Attention Mechanism

Self-attention is the cornerstone of the Transformer, allowing it to dynamically weigh the importance of each token in a sequence with respect to every other token. This enables the model to capture long-range dependencies and contextual relationships effectively.

How Self-Attention Works

For each input token, three vectors are computed:

  • Query (Q): Represents what information the current token is looking for.
  • Key (K): Represents what information each token contains.
  • Value (V): Represents the actual content of each token.

These vectors are derived by multiplying the input embeddings with learned weight matrices ($W_Q$, $W_K$, $W_V$).

Attention Formula (Scaled Dot-Product Attention)

The core of the self-attention mechanism is the scaled dot-product attention, calculated as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V $$

Where:

  • $Q, K, V \in \mathbb{R}^{n \times d_k}$ ($n$ = sequence length, $d_k$ = dimension of key/query vectors).
  • $Q K^T$ computes the dot product between all query and key vectors, resulting in a score matrix indicating the relevance of each token to every other token.
  • The division by $\sqrt{d_k}$ is a scaling factor to prevent the dot products from becoming too large, which could lead to vanishing gradients in the softmax function.
  • The softmax function converts these scores into attention weights, summing to 1 for each query.
  • These weights are then used to compute a weighted sum of the Value vectors, producing the output of the attention layer.

Benefits of Self-Attention

  • Captures Long-Range Dependencies: Can directly relate tokens regardless of their distance in the sequence.
  • Enables Parallel Processing: Computations for each token are independent, allowing for significant parallelization.
  • Scalable: The computational complexity is $O(n^2 d)$ where $n$ is sequence length and $d$ is dimension, which is manageable for moderate sequence lengths.

Multi-Head Attention

Instead of performing a single attention operation, Multi-Head Attention runs multiple attention mechanisms ("heads") in parallel. Each head learns different linear projections of the queries, keys, and values, allowing the model to jointly attend to information from different representation subspaces at different positions.

The outputs from these heads are concatenated and then linearly projected to produce the final output.

Multi-Head Attention Formula:

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O $$ $$ \text{where } \text{head}_i = \text{Attention}(Q W_Q^i, K W_K^i, V W_V^i) $$

Here, $W_Q^i$, $W_K^i$, and $W_V^i$ are the weight matrices for the $i$-th attention head, and $W^O$ is the final output projection matrix.

Example for Scaled Dot-Product Attention (PyTorch)

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor, mask: torch.Tensor = None) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Computes scaled dot-product attention.

    Args:
        Q: Query tensor. Shape: (batch_size, seq_len, d_k)
        K: Key tensor. Shape: (batch_size, seq_len, d_k)
        V: Value tensor. Shape: (batch_size, seq_len, d_k)
        mask: Optional mask tensor to prevent attention to certain positions.
              Shape: (batch_size, seq_len, seq_len)

    Returns:
        A tuple containing:
            - The attention output. Shape: (batch_size, seq_len, d_k)
            - The attention weights. Shape: (batch_size, seq_len, seq_len)
    """
    d_k = Q.size(-1)  # Dimension of key vectors
    # Calculate attention scores: (Q * K^T) / sqrt(d_k)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))

    # Apply mask if provided
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)

    # Compute the output by multiplying attention weights with Value vectors
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

# Example usage:
batch_size, seq_len, d_k = 1, 4, 8
Q = torch.rand(batch_size, seq_len, d_k)
K = torch.rand(batch_size, seq_len, d_k)
V = torch.rand(batch_size, seq_len, d_k)

output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"Attention output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")

Positional Encoding

Since the self-attention mechanism treats input tokens as a set (i.e., it's permutation-invariant), it doesn't inherently capture the order of words in a sequence. Positional Encoding is crucial for injecting this positional information.

Problem Addressed

Unlike Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) that process sequences token by token, Transformers process all tokens in parallel. Without explicit positional information, the model would not know the order of words, treating "the cat sat on the mat" the same as "the mat sat on the cat."

Solution: Positional Encoding

Positional Encoding adds a vector to the input embeddings that encodes the position of each token. This allows the model to differentiate between tokens at different positions.

Sinusoidal Positional Encoding Formula

The original Transformer paper proposed using fixed sinusoidal functions of different frequencies to generate these positional vectors.

$$ \text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) $$ $$ \text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) $$

Where:

  • pos: The position of the token in the sequence (0, 1, 2, ...).
  • i: The dimension index within the embedding vector (0, 1, 2, ... $d_{\text{model}}/2$).
  • $d_{\text{model}}$: The total dimensionality of the embedding vectors.

This formulation ensures that for any fixed offset $k$, $\text{PE}(\text{pos}+k)$ can be represented as a linear function of $\text{PE}(\text{pos})$, which helps the model attend to relative positions.

Alternative Methods

  • Learnable Positional Embeddings: Similar to word embeddings, these are learned during training. They can be more flexible but might not generalize as well to sequences longer than seen during training.
  • Rotary Positional Encoding (RoPE): Used in models like LLaMA and GPT-NeoX, RoPE encodes positional information by rotating query and key vectors based on their position. It is known for its effectiveness in capturing relative positional information.

Role of Positional Encoding

It provides the model with a sense of order, allowing it to understand grammatical structures and dependencies that are dependent on word order.

Example for Sinusoidal Positional Encoding (PyTorch)

import math
import torch

def positional_encoding(seq_len: int, d_model: int) -> torch.Tensor:
    """
    Generates sinusoidal positional encoding.

    Args:
        seq_len: The maximum sequence length.
        d_model: The dimensionality of the model's embeddings.

    Returns:
        A tensor of positional encodings. Shape: (seq_len, d_model)
    """
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    # Calculate the division term based on the formula
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

    # Apply sine to even indices and cosine to odd indices
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# Example usage:
seq_len, d_model = 10, 16
pe = positional_encoding(seq_len, d_model)

print(f"Positional Encoding shape: {pe.shape}")
print("Sample Positional Encoding values:")
print(pe)

Transformer Layers

A standard Transformer block (or layer) typically consists of the following sequential components:

  1. Multi-Head Self-Attention: Processes the input sequence to capture contextual relationships.
  2. Layer Normalization + Residual Connection: Adds the output of the attention layer to the original input (src) and then normalizes the result. This helps in gradient flow and training stability.
  3. Feedforward Neural Network (FFN): A simple two-layer fully connected network applied independently to each position.
  4. Layer Normalization + Residual Connection: Again, adds the output of the FFN to its input and then normalizes.

Feedforward Network Formula

The FFN typically follows a structure like:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

This involves a linear transformation, a ReLU activation (or another non-linearity), and another linear transformation.

Applications in LLMs

The Transformer architecture, with its self-attention and positional encoding mechanisms, is the backbone of nearly all modern Large Language Models:

  • GPT Series (e.g., GPT-3, GPT-4): Utilize decoder-only Transformer architectures, excelling at generative tasks.
  • BERT: Employs encoder-only Transformer architectures, designed for understanding and encoding text, making it suitable for tasks like classification and question answering.
  • T5 (Text-to-Text Transfer Transformer): Uses a full encoder-decoder architecture, framing all NLP tasks as text-to-text problems.
  • Modern LLMs (e.g., LLaMA, Falcon, Mistral): These are often optimized variants that incorporate improvements like Rotary Positional Encoding (RoPE), Grouped-Query Attention, and other efficiency enhancements.

Summary of Transformer Internals in LLMs

ComponentPurpose
Self-AttentionUnderstands context by dynamically weighting token importance.
Multi-Head AttentionCaptures diverse contextual features in parallel.
Positional EncodingInjects sequence order information.
Feedforward LayersAdds non-linearity and transforms features position-wise.
LayerNorm + ResidualStabilizes training and improves information flow.

Conclusion

The Transformer architecture, driven by self-attention and complemented by positional encoding, has fundamentally reshaped natural language processing. These core concepts are essential for anyone looking to understand, fine-tune, interpret, or deploy state-of-the-art Large Language Models.


SEO Keywords

  • Transformer architecture explained
  • What is self-attention in transformers?
  • Multi-head attention in NLP
  • Positional encoding in transformers
  • Transformer vs RNN vs LSTM
  • Attention is all you need summary
  • Transformer architecture for large language models (LLMs)
  • Applications of transformers in GPT and BERT

Interview Questions

  • What is the Transformer architecture, and why was it introduced as an alternative to recurrent models?
  • How does the self-attention mechanism work, and what is its mathematical formulation?
  • Explain the role of Queries (Q), Keys (K), and Values (V) in the self-attention mechanism.
  • What problem does positional encoding solve in Transformer networks?
  • Describe the sinusoidal positional encoding formula and explain its intuition.
  • What is multi-head attention, and why is it beneficial compared to single-head attention?
  • How do Layer Normalization and Residual Connections contribute to the performance and stability of Transformer models?
  • Compare and contrast encoder-only, decoder-only, and encoder-decoder Transformer variants, providing examples of models that use each.
  • What are the differences between learnable positional embeddings and sinusoidal positional encodings?
  • How do the internal components of Transformers, such as attention and feedforward layers, contribute to the emergent capabilities of LLMs?