Discover Masked Language Modeling (MLM), the core pre-training task behind BERT. Learn how it enables bidirectional understanding of language context and semantics for advanced AI.

Masked Language Modeling (MLM)

Masked Language Modeling (MLM) is a fundamental pre-training task that enables Bidirectional Encoder Representations from Transformers (BERT) to achieve a deep understanding of language context and semantics. As a bidirectional model, BERT processes text from both left-to-right and right-to-left, significantly enhancing its ability to predict masked words within a sentence.

What is Masked Language Modeling?

In MLM, BERT is trained by randomly masking a portion of the input words and learning to predict these hidden words using the context provided by the unmasked words surrounding them. This process allows the model to develop a nuanced understanding of language structure, word dependencies, and contextual relationships.

Key Characteristics:

Bidirectional Context: BERT considers words appearing both before and after the masked word.
Prediction Task: The model aims to predict the original identity of the masked words.
Contextual Understanding: By filling in the blanks, BERT learns rich, contextualized word representations.

Example: How Masked Language Modeling Works

Let's illustrate the MLM process with an example. Consider the following sentences:

"Paris is a beautiful city."
"I love Paris."

Step 1: Tokenization

The sentences are first broken down into individual tokens:

tokens = ["Paris", "is", "a", "beautiful", "city", "I", "love", "Paris"]

Step 2: Add Special Tokens

Special tokens are added to signify the start and end of sentences and segments:

[CLS] (Classification Token): Added at the beginning of the first sentence.
[SEP] (Separator Token): Added at the end of each sentence and between sentences if multiple are used.

tokens = ["[CLS]", "Paris", "is", "a", "beautiful", "city", "[SEP]", "I", "love", "Paris", "[SEP]"]

Step 3: Mask a Percentage of Tokens

Typically, 15% of the tokens are randomly selected for masking. For instance, the word "city" is masked:

tokens = ["[CLS]", "Paris", "is", "a", "beautiful", "[MASK]", "[SEP]", "I", "love", "Paris", "[SEP]"]

Now, BERT's task is to predict the original word, "city", leveraging the surrounding unmasked tokens ("beautiful", "is", "a", "Paris", "I", "love", "Paris").

Addressing the Pre-training vs. Fine-tuning Discrepancy

A crucial aspect of BERT's MLM implementation is managing the difference between pre-training and fine-tuning.

Pre-training: During pre-training, the [MASK] token is explicitly used to signal what the model needs to predict.
Fine-tuning: In downstream tasks (like sentiment analysis or question answering), the input sequences do not contain [MASK] tokens.

This mismatch can potentially lead to a performance drop. To mitigate this, BERT employs an 80-10-10 masking strategy for the selected 15% of tokens:

80% of the time: The selected token is replaced with [MASK].

tokens = ["[CLS]", "Paris", "is", "a", "beautiful", "[MASK]", "[SEP]", "I", "love", "Paris", "[SEP]"]

10% of the time: The selected token is replaced with a random word from the vocabulary.

tokens = ["[CLS]", "Paris", "is", "a", "beautiful", "love", "[SEP]", "I", "love", "Paris", "[SEP]"]

10% of the time: The selected token is left unchanged.

tokens = ["[CLS]", "Paris", "is", "a", "beautiful", "city", "[SEP]", "I", "love", "Paris", "[SEP]"]

This strategy ensures that the model learns to predict masked words (80%), but also learns representations for words that are randomly substituted or remain unchanged, making it more robust and adaptable to scenarios without [MASK] tokens during fine-tuning.

Embeddings and Model Architecture

After tokenization and masking, the input undergoes several embedding transformations before being fed into the BERT architecture:

Token Embeddings: Representations of individual tokens.
Segment Embeddings: Distinguish between different sentences or segments in the input.
Position Embeddings: Encode the positional information of each token in the sequence.

These three embeddings are summed to create a single input embedding for each token, which is then processed by BERT's transformer encoder layers.

BERT-Base Configuration Example:

Encoder Layers: 12
Attention Heads: 12
Hidden Units: 768

Each token's representation becomes a 768-dimensional vector after passing through the BERT model.

Predicting the Masked Word

The process of predicting the masked word involves the following steps:

Extract [MASK] Representation: The final hidden state (vector representation) for the token at the [MASK] position is extracted from BERT's output.
Feedforward Network with Softmax: This extracted representation is passed through a feedforward neural network. A softmax layer is applied to the output of this network.
Probability Distribution: The softmax layer generates a probability distribution over the entire vocabulary, indicating the likelihood of each word being the original masked word.
Prediction: The word with the highest probability is selected as the predicted token.

Initially, these predictions might be inaccurate. However, through backpropagation during training, the model learns to adjust its weights, gradually improving the accuracy of its predictions.

Additional Note: Whole Word Masking (WWM)

A variation known as Whole Word Masking (WWM) offers enhanced contextual understanding. In standard MLM, subword tokens (e.g., "un-", "break", "-able") can be masked independently. WWM, however, masks all subword tokens belonging to a single word together. This ensures that the model learns to predict based on a more holistic understanding of the entire word's meaning and context.

Conclusion

Masked Language Modeling is a pivotal pre-training technique that empowers BERT to achieve a profound understanding of natural language. By learning to predict masked tokens using comprehensive bidirectional context, BERT generates rich, contextualized word representations that are highly transferable to a wide array of downstream Natural Language Processing (NLP) tasks.

SEO Keywords for Masked Language Modeling (MLM)

Masked Language Modeling BERT
BERT MLM pretraining
Bidirectional language model BERT
80-10-10 masking strategy BERT
Predict masked tokens BERT
Whole Word Masking BERT
BERT token embedding MLM
BERT natural language understanding

Interview Questions on Masked Language Modeling (MLM)

What is Masked Language Modeling (MLM) in BERT?
How does MLM enable bidirectional understanding in BERT?
Explain the 80-10-10 masking strategy used in MLM pretraining.
Why is the [MASK] token important during BERT’s pretraining but not used during fine-tuning?
How are token, segment, and position embeddings used together in MLM?
What is the significance of masking 15% of the tokens during MLM?
How does Whole Word Masking differ from standard MLM?
What are the advantages of bidirectional context in MLM compared to auto-regressive models?
How does BERT predict the original word for a masked token?
Can you describe the potential mismatch between pretraining with MLM and downstream fine-tuning tasks?

Masked Language Modeling (MLM) Explained | BERT