Explore encoder-only pre-training in NLP. Learn how these models create rich, contextualized representations for NLU tasks like text classification and Q&A.

Encoder-Only Pre-training: Learning Contextual Representations in NLP

Encoder-only models are a fundamental architecture in Natural Language Understanding (NLU). Unlike decoder-only models, which are designed for text generation, encoder-only models excel at producing rich, contextualized representations of input sequences. These models are instrumental for a wide array of NLU tasks, including text classification, sentence similarity assessment, question answering, and more.

What is an Encoder?

At its core, an encoder, denoted as $ \text{Encoder}_\theta(\cdot) $, is a function that transforms an input sequence of tokens into a sequence of real-valued vectors.

Given an input sequence of tokens: $ x = [x_0, x_1, \dots, x_m] $

The encoder outputs a sequence of contextualized vectors: $ H = [h_0, h_1, \dots, h_m] $

These output vectors, $ h_i $, encapsulate the semantic and syntactic characteristics of each token $ x_i $ within its surrounding context.

Training an encoder effectively is non-trivial. Unlike supervised tasks with explicit labels, we lack direct "ground truth" for what the "correct" contextual vector for a given token should be. To overcome this, we employ proxy tasks, also known as pretext objectives. These objectives enable the encoder to learn valuable representations in an unsupervised or self-supervised manner.

Common Encoder-Only Pre-training Architecture

In typical pre-training setups, the encoder is augmented with a lightweight output layer, such as a softmax classifier. This layer provides the necessary supervision for the pre-training task.

Input Tokens:    [x₀, x₁, x₂, ..., xₘ]
                     │
                     ▼
           ┌────────────────────┐
           │   Transformer      │
           │      Encoder       │
           └────────────────────┘
                     │
                     ▼
      Contextual Representations: [h₀, h₁, ..., hₘ]
                     │
                     ▼
           Softmax / Output Heads
                     │
                     ▼
              Pre-training Tasks

This architectural pattern is the foundation for influential models like BERT (Bidirectional Encoder Representations from Transformers) and its numerous successors.

Encoder Pre-training Objectives

Several pretext tasks are commonly used to pre-train encoder-only models. These tasks are designed to leverage large amounts of unlabeled text data.

1. Masked Language Modeling (MLM)

Masked Language Modeling is the most prevalent pre-training objective for encoder-only models, notably pioneered by BERT.

How It Works:

Masking: A small percentage (typically 15%) of the input tokens are randomly selected. These selected tokens are then replaced in one of three ways:
- 80% of the time, they are replaced with a special [MASK] token.
- 10% of the time, they are replaced with a random token from the vocabulary.
- 10% of the time, they remain unchanged (to mitigate the mismatch between pre-training and fine-tuning where [MASK] tokens are absent).
Prediction: The encoder is trained to predict the original tokens at the masked positions, utilizing the context provided by the unmasked tokens.

Example:

Input: "The quick [MASK] fox jumps over the lazy [MASK]."
Target Output: The model aims to predict the original tokens: ["brown", "dog"].

Loss Function:

Given an input sequence $ x = [x_0, \dots, x_m] $ and a set of masked positions $ M $, the MLM loss is defined as:

$ L_{\text{MLM}} = - \sum_{i \in M} \log \Pr(x_i | x_{\text{masked}}, \theta) $

Where $ x_{\text{masked}} $ is the sequence with masked tokens, and $ \theta $ represents the model's parameters.

2. Permuted Language Modeling (PLM)

Permuted Language Modeling, popularized by XLNet, addresses some of the inherent limitations of MLM.

How It Works:

Permutation: Instead of masking tokens, PLM predicts tokens in a random permutation of the original sequence.
Bidirectional Context: The model predicts each token based on all tokens that appear before it in the permuted order, effectively capturing bidirectional context without relying on artificial [MASK] tokens.

Example:

Consider the sequence $ [x_0, x_1, x_2, x_3] $. A possible permutation could be $ [x_2, x_0, x_3, x_1] $. The model would then predict:

$ x_0 $ based on $ [x_2] $.
$ x_3 $ based on $ [x_2, x_0] $.
$ x_1 $ based on $ [x_2, x_0, x_3] $.

Benefits:

Avoids Masking Discrepancy: It eliminates the artificial corruption introduced by [MASK] tokens during pre-training, leading to better alignment with fine-tuning.
Improved Contextual Likelihood: It provides a more accurate estimation of the full-context likelihood of the sequence.

3. Pre-training Encoders as Classifiers

Certain pre-training strategies frame the encoder's learning objective as a classification task, often utilizing specific tokens or sentence pairs.

Examples:

Next Sentence Prediction (NSP): In BERT's original pre-training, pairs of sentences were fed to the model. The task was to predict whether the second sentence was the actual next sentence in the original document or a random sentence. This aimed to help models understand sentence relationships.
Sentence Order Prediction (SOP): ALBERT refined the NSP task by focusing on predicting the correct order of two consecutive sentences from the same document, which was found to be a more effective signal for understanding discourse coherence.
Segment Classification: Models can be trained to classify the relationship between two text segments, such as predicting entailment, contradiction, or neutrality.

Classification Objective:

Typically, a special token (e.g., [CLS] in BERT) is prepended to the input sequence. The final hidden state corresponding to this [CLS] token ($ h_{\text{cls}} $) is used as a pooled representation of the entire sequence. This pooled representation is then fed into a classifier to predict the relevant label.

$ L_{\text{cls}} = - \log \Pr(y | h_{\text{cls}}, \theta) $

Where $ y $ is the ground truth label for the classification task (e.g., "Is Next Sentence" = 0 or 1).

Summary Table

Pre-training Task	Description	Model Examples
Masked Language Modeling (MLM)	Predict randomly masked tokens in the input sequence.	BERT, RoBERTa
Permuted Language Modeling (PLM)	Predict tokens in a random permutation of the sequence, avoiding masking.	XLNet
Classification Objectives	Pre-train on tasks like next sentence prediction or sentence ordering.	BERT, ALBERT, RoBERTa

Conclusion

Encoder-only pre-training represents a highly effective strategy for acquiring deep, contextualized text embeddings. By employing self-supervised objectives such as Masked Language Modeling and various sentence classification tasks, models like BERT and RoBERTa have achieved state-of-the-art performance across a broad spectrum of downstream Natural Language Processing (NLP) tasks.

SEO Keywords:

Encoder-only Transformer, Masked Language Modeling (MLM), Permuted Language Modeling (PLM), BERT architecture, Transformer encoder, Next Sentence Prediction (NSP), Sentence Order Prediction (SOP), Contextual embeddings, Self-supervised learning NLP, Encoder pre-training tasks.

Interview Questions:

What is an encoder-only model in NLP, and how does it differ from decoder-only models? Encoder-only models process input sequences to generate contextual representations, ideal for NLU tasks like classification and QA. Decoder-only models generate sequences, suitable for tasks like text generation.
Explain the Masked Language Modeling (MLM) objective used in BERT. MLM involves masking a percentage of input tokens and training the model to predict the original tokens based on the surrounding context.
How does Permuted Language Modeling (PLM) in XLNet improve over MLM? PLM predicts tokens based on a permuted sequence, avoiding the artificial [MASK] tokens and offering a more natural bidirectional context learning mechanism.
What is the role of the [CLS] token in encoder-only models like BERT? The [CLS] token's final hidden state is typically used as a pooled representation of the entire input sequence for classification tasks.
Describe the Next Sentence Prediction (NSP) task and its purpose in BERT pre-training. NSP trained BERT to determine if two sentences were consecutive in the original text, aiming to improve its understanding of sentence relationships.
What are the benefits of using sentence classification tasks like Sentence Order Prediction (SOP)? SOP helps models learn better about discourse coherence and sentence relationships by predicting the correct ordering of adjacent sentences, which is more robust than NSP.
How do encoder-only models generate contextualized token embeddings? Through their attention mechanisms and multi-layer processing, encoders capture dependencies between tokens, allowing each output vector to reflect the token's meaning within its specific context.
What are the main components of a Transformer encoder? A Transformer encoder typically consists of multi-head self-attention mechanisms and position-wise feed-forward networks, often with residual connections and layer normalization.
Why is self-supervised learning important for encoder pre-training? Self-supervised learning allows models to learn rich representations from vast amounts of unlabeled text data, which is abundant and significantly reduces the reliance on expensive labeled datasets.
Can you compare MLM and PLM in terms of their training mechanisms and advantages? MLM masks tokens and predicts them, potentially creating a pre-train/fine-tune discrepancy. PLM permutes sequences and predicts tokens in order, avoiding masking issues and offering more comprehensive contextual understanding.

Encoder-Only Pre-training: Contextual NLP Representations