BERT Model: Bidirectional Encoder Representations Explained

Explore the BERT model, a Transformer-based NLP cornerstone. Understand its pre-training tasks like Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

The BERT Model: A Deep Dive into Bidirectional Encoder Representations from Transformers

The BERT (Bidirectional Encoder Representations from Transformers) model, introduced by Devlin et al. in 2019, has become a cornerstone architecture in Natural Language Processing (NLP). It is built entirely upon the Transformer encoder and leverages self-supervised learning for pre-training.

Core Pre-training Tasks

BERT is pre-trained using two primary self-supervised objectives:

  1. Masked Language Modeling (MLM)
  2. Next Sentence Prediction (NSP)

The total loss for BERT's pre-training is the sum of the losses from these two tasks:

$$ \text{Loss}{\text{BERT}} = \text{Loss}{\text{MLM}} + \text{Loss}_{\text{NSP}} $$

1. Masked Language Modeling (MLM)

Objective: MLM enables the model to learn deep contextual representations of words by predicting missing words within a sentence.

Process:

  • Approximately 15% of the tokens in each input sequence are randomly selected for masking.
  • These selected tokens are then modified as follows:
    • 80% are replaced with the [MASK] token.
    • 10% are replaced with a random token from the vocabulary.
    • 10% are left unchanged.

This strategy prevents the model from solely relying on the [MASK] token and encourages it to learn robust contextual dependencies.

Example:

  • Original: [CLS] It is raining . [SEP] I need an umbrella . [SEP]
  • Masked: [CLS] It is [MASK] . [SEP] I need [MASK] umbrella . [SEP]

MLM Loss Function: For an input token sequence $x$ with a modified version $\bar{x}$ and a set of masked positions $A(x)$, the MLM loss is calculated as:

$$ \text{Loss}{\text{MLM}} = -\sum{i \in A(x)} \log P(x_i | \bar{x}) $$

Example Program (using transformers library):

from transformers import pipeline

# Load the masked language modeling pipeline
mlm_pipeline = pipeline("fill-mask")

# Example sentence with a masked word
sentence = "I want to buy a new [MASK]."

# Perform masked language modeling
results = mlm_pipeline(sentence)

# Output the predictions
for result in results:
    print(f"Prediction: {result['sequence']}, Score: {result['score']:.4f}")

2. Next Sentence Prediction (NSP)

Objective: NSP trains the model to understand the relationships between sentences, which is crucial for downstream tasks like Question Answering and Natural Language Inference.

Process:

  • Each input consists of two segments, SentA and SentB, separated by [SEP] tokens.
  • For pre-training:
    • 50% of the time, SentB is the actual sentence that follows SentA (label: IsNext).
    • 50% of the time, SentB is a random sentence from the corpus (label: NotNext).

Example:

  • [CLS] It is raining . [SEP] I need an umbrella . [SEP] $\rightarrow$ IsNext
  • [CLS] The cat sleeps . [SEP] Apples grow on trees . [SEP] $\rightarrow$ NotNext

NSP Loss Function: Using the output vector $h_{\text{CLS}}$ corresponding to the [CLS] token, the NSP loss is:

$$ \text{Loss}{\text{NSP}} = -\log P(c{\text{gold}} | h_{\text{CLS}}) $$ where $c_{\text{gold}}$ is the true label (IsNext or NotNext).

Example Program (using transformers library):

from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

# Provide two sentences
sentence_a = "I went to the market today."
sentence_b = "I bought some fresh vegetables."

# Encode the pair of sentences
# The tokenizer adds the [CLS] and [SEP] tokens automatically
encoding = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')

# Get model predictions
# labels=torch.LongTensor([0]) is a dummy label to ensure the model outputs loss during inference
outputs = model(**encoding, labels=torch.LongTensor([0]))
logits = outputs.logits

# Convert logits to probabilities
probs = torch.softmax(logits, dim=1)

print(f"Is the second sentence likely to follow the first? (Next Sentence probability): {probs[0][0].item():.4f}")
print(f"Is the second sentence random? (Not Next Sentence probability): {probs[0][1].item():.4f}")

BERT Model Architecture

BERT is fundamentally based on the Transformer encoder architecture. Each input token is enriched with three types of embeddings:

$$ \mathbf{e} = \mathbf{x} + \mathbf{e}{\text{pos}} + \mathbf{e}{\text{seg}} $$

Where:

  • $\mathbf{x}$: Token embedding (learned from the vocabulary).
  • $\mathbf{e}_{\text{pos}}$: Positional embedding (captures the order of tokens).
  • $\mathbf{e}_{\text{seg}}$: Segment embedding (distinguishes between SentA and SentB in paired inputs).

Transformer Layers

Each Transformer encoder layer is composed of two main sub-layers:

  1. Multi-Head Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing a given word.
  2. Position-wise Feed-Forward Network (FFN): A fully connected feed-forward network applied independently to each position.

These sub-layers are augmented with:

  • Residual Connections: Help in training deep networks by facilitating gradient flow.
  • Layer Normalization: Stabilizes training by normalizing the inputs to each sub-layer.

Output

At each layer, the model outputs contextualized vector representations for each input token. For classification tasks, the output vector of the special [CLS] token is typically used as the aggregate representation of the entire sequence.

Key Hyperparameters and Model Variants

Several hyperparameters define the size and capabilities of BERT models:

ParameterDescription
$V$Vocabulary size
$d_e$Token embedding size (also output dimension of Transformer layers)
$n_{head}$Number of self-attention heads
$d_{ffn}$Hidden size of the Feed-Forward Network (typically $4 \times d_e$)
$L$Number of Transformer encoder layers (depth of the model)

Common BERT Configurations:

ModelHidden Size ($d_e$)Layers ($L$)Attention Heads ($n_{head}$)Parameters
BERT-base7681212110 Million
BERT-large10242416340 Million

Training Strategy

The pre-training of BERT employs the following key strategies:

  • Mini-batch Training: Utilizes stochastic gradient descent or its variants for optimization.
  • Progressive Training: Initially trains on shorter sequences, followed by fine-tuning on full-length sequences.
  • Large-Scale Unlabeled Data: Leverages vast amounts of unlabeled text data (e.g., BooksCorpus, Wikipedia) for pre-training, enabling it to learn general language representations.

This comprehensive pre-training allows BERT to act as a versatile language representation model that can be effectively fine-tuned for a wide array of specific NLP tasks, such as sentiment analysis, text classification, question answering, and named entity recognition.

Conclusion

BERT marked a significant advancement in NLP by utilizing self-supervised learning to achieve a deep, bidirectional understanding of language context. Through its combination of Masked Language Modeling and Next Sentence Prediction, powered by the scalable Transformer architecture, BERT has established a new benchmark for pre-trained language models.

While BERT has driven substantial progress, ongoing research continues to address challenges such as efficient training with limited data and enhancing reasoning capabilities. Nevertheless, the BERT framework provides a robust foundation for developing sophisticated AI-driven NLP systems.

Interview Questions

  • What are the two main pre-training objectives used in BERT?
  • How does Masked Language Modeling (MLM) work in BERT, and why is it important?
  • Explain the process and purpose of Next Sentence Prediction (NSP) in BERT.
  • What are the components of BERT’s input embeddings?
  • Describe the architecture of a single Transformer encoder layer in BERT.
  • What role does the [CLS] token play in BERT-based models?
  • Compare BERT-base and BERT-large in terms of architecture and performance.
  • What loss functions are used in BERT’s pre-training phase?
  • How does BERT’s training strategy enable it to generalize across NLP tasks?
  • What are some limitations of BERT, and what research areas remain open?