Decoder-Only Transformers: LLM Architecture Explained

Unlock the power of decoder-only Transformers! Learn how GPT-style LLMs process input and generate text autoregressively. Essential for AI and ML.

Decoder-Only Transformers in Large Language Models

Decoder-only Transformers form the architectural core of many widely used Large Language Models (LLMs), including GPT-style models. This documentation explains their structure, how they process input, and how they generate language in a left-to-right autoregressive manner.

1. Language Modeling Setup: Input and Output

In a standard language modeling task, the model receives a sequence of tokens:

$$ {x_0, x_1, \dots, x_{m-1}} $$

At each position $i$, the model aims to predict the next token by outputting a probability distribution over the entire vocabulary ($V$), conditioned on all preceding tokens:

$$ P(\cdot \mid x_0, x_1, \dots, x_{i-1}) $$

This is known as autoregressive generation, where each predicted token is appended to the input context for the subsequent prediction.

During training, the objective is to maximize the log-likelihood of the entire token sequence:

$$ \sum_{i=1}^{m} \log P(x_i \mid x_0, x_1, \dots, x_{i-1}) $$

Note: The term $\log P(x_0)$ is typically ignored or set to zero, as $x_0$ is often a special "start" token.

2. Embeddings: Token and Positional Information

Each input token $x_i$ is first converted into a $d$-dimensional vector $e_i$ by combining its token embedding with its positional embedding:

$$ e_i = \text{TokenEmbedding}(x_i) + \text{PositionalEmbedding}(i) $$

This process ensures that the model captures both the semantic meaning of the token and its position within the sequence. The sequence of these embedding vectors forms the initial input matrix for the Transformer model:

$$ {e_0, e_1, \dots, e_{m-1}} $$

3. Transformer Block Architecture

A decoder-only Transformer is composed of $L$ stacked Transformer blocks. Each block contains two primary sub-layers:

  1. Masked Self-Attention Layer: Allows the model to weigh the importance of different tokens in the input sequence, but with a constraint to only attend to preceding tokens.
  2. Feed-Forward Neural Network (FFN) Layer: A position-wise fully connected network that further processes the representations.

These sub-layers are typically applied with normalization, either pre-norm or post-norm:

  • Post-norm: $$ \text{output} = \text{LayerNorm}(F(\text{input}) + \text{input}) $$
  • Pre-norm: $$ \text{output} = \text{LayerNorm}(F(\text{input})) + \text{input} $$

Here, $F(\cdot)$ represents the core function of the sub-layer (self-attention or FFN), and $\text{LayerNorm}$ is a normalization function used to stabilize training. Each block takes an input matrix of shape $m \times d$ (sequence length $\times$ embedding dimension) and outputs a matrix of the same shape, containing contextualized token representations.

4. Self-Attention Mechanism with Causal Masking

The self-attention sub-layer, typically multi-head attention, enables the model to attend to different aspects of the input representation. The core attention mechanism is calculated as:

$$ \text{Att}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{Mask}\right) V $$

Where:

  • $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input representation $H$.
  • $d_k$ is the dimension of the keys.
  • Mask is crucial for causality. It prevents the model from attending to future tokens. For each position $i$, the mask ensures that the model can only attend to positions $k$ where $k \le i$.
    • $\text{Mask}[i, k] = 0$ if $i \ge k$ (attend to current and previous tokens)
    • $\text{Mask}[i, k] = -\infty$ if $i < k$ (mask out future tokens, making their attention scores zero after Softmax).

4.1 Multi-Head Self-Attention

To capture a richer set of relationships, the attention mechanism is applied in parallel across multiple "heads." Each head $j$ computes attention independently:

$$ \text{head}_j = \text{Att}(Q_j, K_j, V_j) $$

These $Q_j, K_j, V_j$ matrices are obtained through learned linear projections of the input representation $H$:

$$ Q_j = H W_{q_j} \quad K_j = H W_{k_j} \quad V_j = H W_{v_j} $$

Where $W_{q_j}, W_{k_j}, W_{v_j}$ are learnable weight matrices. Typically, the embedding dimension $d$ is split across $\tau$ heads, so these projection matrices have dimensions $d \times (d/\tau)$.

The outputs of all heads are then concatenated and linearly projected back to the original model dimension:

$$ F(H) = \text{Merge}(\text{head}1, \dots, \text{head}\tau) W_{\text{head}} $$

Where $W_{\text{head}}$ is another learned weight matrix.

5. Output Layer and Vocabulary Distribution

The output of the final ($L$-th) Transformer block is a matrix $H^L \in \mathbb{R}^{m \times d}$. This matrix is then passed through a linear layer followed by a Softmax function to produce a probability distribution over the entire vocabulary for each token position:

$$ \text{Softmax}(H^L W_o) $$

Where $W_o \in \mathbb{R}^{d \times |V|}$ is a learnable weight matrix, and $|V|$ is the vocabulary size. This results in a sequence of probability distributions:

$$ [ P(\cdot \mid x_0), P(\cdot \mid x_0, x_1), \dots, P(\cdot \mid x_0, \dots, x_{m-1}) ] $$

6. Autoregressive Generation and Inference

During inference (text generation), the model operates as follows:

  1. Start: The model begins with a special start token ($x_0$).
  2. Predict Next Token: It predicts the probability distribution for the next token ($x_1$) based on $x_0$. The most likely token is selected, often using argmax or sampling.
  3. Append and Repeat: The predicted token $x_1$ is appended to the input sequence, forming ${x_0, x_1}$. The model then predicts $x_2$ based on this new sequence, and so on.

This loop continues until an end-of-sequence token is generated or a predefined maximum length is reached. This autoregressive process is what allows LLMs to generate fluent and coherent text, one token at a time.

7. Key Characteristics of Decoder-Only Transformers

  • Causal Masking: Enforces a strict left-to-right processing order, preventing look-ahead into future tokens and ensuring autoregressive behavior.
  • Stacked Transformer Blocks: The depth provided by multiple stacked blocks allows the model to learn increasingly complex patterns and dependencies within the text.
  • Multi-Head Self-Attention: Captures diverse semantic and syntactic relationships by allowing attention to operate over different representation subspaces simultaneously.
  • "Large" Models: LLMs are termed "large" due to their extensive number of layers (depth) and wide embedding dimensions (width), which contribute to their powerful generative capabilities.

8. Conclusion

Decoder-only Transformers are a fundamental architecture for modern generative Large Language Models. Their design, centered around masked self-attention and feed-forward networks within stacked blocks, enables efficient and coherent left-to-right text generation. This architecture underpins many of the most impactful NLP applications, including chatbots, code generation tools, and creative writing assistants.


SEO Keywords

decoder-only transformer architecture, autoregressive language model generation, transformer self-attention with causal masking, multi-head attention in LLMs, transformer feed-forward network, transformer block with pre-norm, token prediction in GPT models, left-to-right text generation, transformer positional embeddings, vocabulary distribution in language models

Interview Questions

  • What is a decoder-only transformer, and how does it differ from encoder-decoder models?
  • How does causal masking work in a transformer’s self-attention mechanism?
  • What is the role of positional embeddings in decoder-only transformers?
  • Explain the autoregressive token generation process used in GPT models.
  • What is the structure of a transformer block in decoder-only LLMs?
  • How does multi-head attention improve the learning capability of transformers?
  • What is the difference between post-norm and pre-norm in transformer architecture?
  • How are Q, K, and V matrices computed in the attention mechanism?
  • What happens during inference in a decoder-only transformer model?
  • How is the final output vocabulary distribution computed in a language model?