LLM Pre TrainingSelf Supervised Tasks

Self Supervised Tasks

Explore essential self-supervised pre-training tasks for Transformer architectures in NLP. Discover key methods powering modern LLMs and AI language models.

Self-Supervised Pre-training Tasks for Transformer Architectures

This document explores self-supervised pre-training techniques commonly employed in modern Natural Language Processing (NLP) models, with a specific focus on Transformer-based architectures. These architectures form the bedrock of most state-of-the-art language models.

While self-supervised learning is a vast and dynamic research field, this discussion is intentionally limited to the most prevalent and impactful methods utilized within the Transformer paradigm. We will examine approaches applied to three primary types of Transformer architectures:

Decoder-Only Models: These models, exemplified by the GPT series, are primarily utilized for generative tasks. Their architecture is inherently suited for predicting the next token in a sequence, making them ideal for text generation, creative writing, and conversational AI.
Encoder-Only Models: Architectures like BERT are a prime example of encoder-only models. They are typically employed for tasks requiring deep understanding and analysis of input text, such as text classification, named entity recognition, and sentiment analysis.
Encoder-Decoder Models: Models such as T5 and BART fall into this category. They are designed for sequence-to-sequence tasks, where the input sequence needs to be transformed into an output sequence. Common applications include machine translation, text summarization, and question answering.

Key Pre-training Tasks

The effectiveness of self-supervised learning hinges on carefully designed pre-training tasks that allow models to learn rich representations of language without explicit human annotation. Here, we detail prominent tasks categorized by the Transformer architecture they are most commonly associated with.

Decoder-Only Pre-training

The primary pre-training objective for decoder-only models is Causal Language Modeling (CLM), also known as Next Token Prediction.

Causal Language Modeling (Next Token Prediction)

In this task, the model is trained to predict the next token in a sequence given the preceding tokens. The training objective is to maximize the probability of the correct next token.

Mechanism:

The model processes the input sequence from left to right. At each position, it predicts the probability distribution over the entire vocabulary for the next token, conditioned on all previously seen tokens.

Objective Function:

The model is trained to minimize the negative log-likelihood of the correct next token:

$$ \mathcal{L}(\theta) = -\sum_{i=1}^{N} \log P(x_i | x_1, x_2, \dots, x_{i-1}; \theta) $$

where:

$N$ is the length of the sequence.
$x_i$ is the $i$-th token in the sequence.
$P(\cdot)$ is the probability predicted by the model with parameters $\theta$.
$x_1, \dots, x_{i-1}$ are the preceding tokens.

Example:

Given the input sequence "The cat sat on the", the model attempts to predict the next token. The correct next token might be "mat". The model learns to assign a high probability to "mat" based on the preceding words.

Encoder-Only Pre-training

Encoder-only models typically employ tasks that involve reconstructing masked portions of the input or predicting relationships between tokens.

Masked Language Modeling (MLM)

MLM is a cornerstone of encoder-only pre-training. In this task, a certain percentage of input tokens are randomly replaced with a special [MASK] token, and the model is trained to predict the original identity of these masked tokens.

Mechanism:

The Transformer encoder processes the entire input sequence, allowing it to attend to both left and right context for each token. This bidirectional context is crucial for understanding word meanings and relationships.

Objective Function:

The model aims to maximize the probability of correctly predicting the masked tokens:

$$ \mathcal{L}(\theta) = -\sum_{i \in \text{masked_indices}} \log P(x_i | x_{\text{corrupted}}; \theta) $$

where:

$x_i$ is the original $i$-th token.
$x_{\text{corrupted}}$ is the input sequence with some tokens masked.
$\text{masked_indices}$ is the set of indices of the masked tokens.

Example:

Input: "The [MASK] sat on the mat." Model predicts: "cat" for the [MASK] token.

Next Sentence Prediction (NSP) (Less common now, but historically important)

NSP was an auxiliary task used in models like BERT. The model was given two sentences, A and B, and tasked with predicting whether sentence B was the actual next sentence following sentence A in the original text.

Mechanism:

Two sentences are concatenated with special tokens ([CLS] and [SEP]) in between. The [CLS] token's final representation is used to classify the relationship between the two sentences.

Objective:

Binary classification: IsNext or NotNext.

Example:

Input 1: Sentence A: "The cat sat on the mat." Sentence B: "It was a sunny day." -> Label: IsNext
Input 2: Sentence A: "The cat sat on the mat." Sentence B: "Dogs are great pets." -> Label: NotNext

Note: While NSP was influential, later research (e.g., RoBERTa) found that simply scaling up MLM with more data and longer sequences often yielded better results, and NSP was sometimes omitted or modified.

Encoder-Decoder Pre-training

Encoder-decoder models leverage pre-training tasks that are suitable for sequence-to-sequence transformations.

Denoising Autoencoding

This is a common pre-training objective for encoder-decoder architectures. The model is trained to reconstruct the original input text from a corrupted version. Various corruption strategies can be employed.

Mechanism:

The encoder processes a corrupted input sequence, and the decoder uses the encoder's output to reconstruct the original, uncorrupted sequence.

Common Corruption Strategies:

Token Masking (Span Masking): Similar to MLM, but entire spans of text are masked.
- Example (BART): Input: "The [MASK] sat on the [MASK]." -> Target: "The cat sat on the mat."
Token Deletion: Randomly delete tokens from the input.
- Example (BART): Input: "The sat on the mat." -> Target: "The cat sat on the mat."
Text Infilling: Replace spans of text with a single sentinel token, and the model needs to generate the missing spans.
- Example (T5): Input: "The <extra_id_0> sat on the <extra_id_1>." -> Target: "cat <extra_id_0> mat <extra_id_1>"
Sentence Permutation: Shuffle the order of sentences in a document.
- Example (BART): Input: "It was a sunny day. The cat sat on the mat." -> Target: "The cat sat on the mat. It was a sunny day."
Document Rotation: Rotate a document by choosing a random token and making it the first token, then masking the original start of the document.
- Example (BART): If document is "A B C D E", rotated could be "C D E A B", with the task of recovering the original order.

Translation Language Modeling (TLM) (For multilingual models)

While not strictly an encoder-decoder specific task, TLM is relevant in pre-training models designed for cross-lingual tasks. It involves concatenating text from two languages and masking tokens in both, requiring the model to predict masked tokens across languages.

Mechanism:

Two parallel sentences (one in language A, one in language B) are concatenated. Tokens are masked in both sentences. The model learns to fill in the masked tokens, leveraging context from both languages.

Example:

Input: [CLS] English sentence [SEP] French sentence [SEP] Masked tokens in both. The model predicts the masked English word using French context and vice-versa.

Comparison of Pre-training Tasks

This overview covers the foundational self-supervised pre-training tasks that empower Transformer models to achieve remarkable performance across a wide spectrum of NLP applications. The choice of task is deeply intertwined with the intended use case of the model and its underlying architecture.