Understand how multi-head attention (encoder-decoder attention) in the Transformer decoder connects target and source sequences for advanced AI and LLM generation.

Multi-Head Attention in the Transformer Decoder

The multi-head attention sublayer within the Transformer's decoder is a crucial component responsible for connecting the target sequence (being generated) with the source sequence (encoded by the encoder). This mechanism is commonly referred to as encoder-decoder attention or cross-attention because it facilitates direct interaction between the encoder's output and the decoder's input.

Purpose of Encoder-Decoder Attention

This attention mechanism enables the decoder to look back at the entire source sentence at each step of generating the target sentence. It allows the model to determine which parts of the source sequence are most relevant to the current target token being predicted, thereby improving the quality and accuracy of sequence-to-sequence tasks like machine translation.

Inputs to Encoder-Decoder Attention

The encoder-decoder attention sublayer in the decoder receives two primary inputs:

R (Encoder Output): This represents the encoded output of the entire source sentence from the encoder. It contains rich contextual information about the source sequence.
A (Decoder Previous Output): This is the output of the preceding masked multi-head attention sublayer within the decoder. It represents the partially generated target sequence so far.

Using these two inputs, each decoder block computes the attention scores, effectively allowing the model to attend to the relevant parts of the source sequence based on the current state of the target sequence generation.

Derivation of Query, Key, and Value Matrices

Unlike self-attention, where all Query (Q), Key (K), and Value (V) matrices originate from the same input sequence, encoder-decoder attention uses distinct sources:

Query (Q): Derived from the decoder's previous output (A). This represents what the decoder is currently looking for in the source sequence.
Key (K) and Value (V): Derived from the encoder's output (R). The Keys represent identifiers for the information present in the source sequence, and the Values contain the actual contextual representations of the source sequence elements.

This specific design ensures that:

The Query holds the representation of the target sentence (or the part of it being generated).
The Key and Value matrices hold the representation of the source sentence.

This setup allows the model to effectively align each token in the target sentence with the most relevant tokens in the source sentence during the generation process.

Step-by-Step: The Encoder-Decoder Attention Mechanism

The process of encoder-decoder attention can be broken down into the following steps:

Compute Dot Product Between Query and Key: The dot product between the Query matrix (Q) from the decoder and the transpose of the Key matrix (Kᵀ) from the encoder is calculated: Q · Kᵀ. This operation computes a similarity score between each element in the target sequence (represented by Q) and every element in the source sequence (represented by K).
- Example: If the decoder is generating the word "Je" (French), its query vector will be compared against the key vectors of all words in the English source sentence (e.g., "I", "am", "good"). This results in a row of similarity scores for "Je" against each English word.
Scale the Scores: The dot product results are divided by the square root of the dimension of the key vectors, denoted as √dₖ. This scaling factor (1/√dₖ) helps to stabilize the gradients during training and prevent the dot products from becoming too large, especially with high-dimensional vectors. $$ \text{Scaled Scores} = \frac{Q \cdot K^\top}{\sqrt{d_k}} $$
Apply Softmax: The scaled similarity scores are then passed through a softmax function. Softmax converts these scores into a probability distribution, where each value represents the attention weight assigned to a specific source token for a given target token. This results in the score matrix (S), indicating how much attention each target token should pay to each source token. $$ S = \text{softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) $$
Multiply Score Matrix by Value Matrix: Finally, the score matrix (S) is multiplied by the Value matrix (V) from the encoder output. This weighted sum of the Value vectors produces the attention output matrix. Each output vector is a contextual representation that emphasizes information from the source sequence deemed most relevant by the attention weights. $$ \text{Attention Output} = S \cdot V $$

Example: Attending to Source for a Target Word

Suppose the decoder is generating the French word "Je" and the source sentence is "I am good." The attention mechanism might determine the following weighting for the "Je" output:

98% contribution from the value vector of the source word "I".
2% contribution from the value vector of the source word "am".
0% contribution from the value vector of the source word "good".

This distribution indicates a strong alignment between "Je" in the target sequence and "I" in the source sequence, which is a critical feature for accurate machine translation.

Final Multi-Head Attention Output

The core attention mechanism described above is performed multiple times in parallel with different learned linear projections for Q, K, and V. These parallel computations are known as attention heads.

Multiple Heads: The input embeddings are linearly projected into different subspaces for each head, and the attention mechanism is applied independently to each head.
Concatenation: The output vectors from all attention heads are concatenated.
Final Linear Projection: The concatenated output is then passed through a final linear projection layer with a learned weight matrix (W_O) to produce the final output of the multi-head attention sublayer.

This final output, which encapsulates a richer, multi-faceted representation of the attended source information, is then passed to the next sublayer in the decoder block: the feedforward network.

Importance of Encoder-Decoder Attention

Source-Target Alignment: It explicitly models the relationships and alignments between tokens in the source and target sequences.
Contextual Relevance: It allows the decoder to dynamically focus on the most relevant parts of the input sentence for each step of output generation.
Improved Accuracy: By leveraging relevant source context, it significantly enhances the accuracy and coherence of generated sequences, particularly in tasks like machine translation.
Handling Long Dependencies: It helps the model capture dependencies between distant words in the source and target sequences more effectively than traditional recurrent models.

Conclusion

The multi-head encoder-decoder attention mechanism is fundamental to the Transformer's success in sequence-to-sequence tasks. By enabling the decoder to intelligently reference and weigh information from the entire source sentence at each generation step, it produces more accurate, contextually aware, and fluent outputs.

The next step in understanding the Transformer decoder involves exploring the feedforward network sublayer, which further processes the output of the attention mechanism to refine the representations.

Multi-Head Attention in Transformer Decoder Explained