WordPiece Tokenization: BERT's Subword Algorithm Explained
Unlock the power of WordPiece tokenization! Understand this key subword algorithm used in BERT and LLMs, its differences from BPE, and how it drives NLP.
WordPiece Tokenization: A Comprehensive Guide
WordPiece is a widely adopted subword tokenization algorithm, originally developed by Google for neural network-based language models such as BERT. It shares many similarities with Byte Pair Encoding (BPE) but introduces a key distinction that enhances its effectiveness in specific contexts.
This guide will cover:
- What WordPiece tokenization is
- How it differs from Byte Pair Encoding (BPE)
- The step-by-step process of building a WordPiece vocabulary
- How WordPiece tokenizes unseen words
What Is WordPiece?
WordPiece is a data-driven algorithm used to tokenize text into subword units. Similar to BPE, it breaks words into smaller units to effectively handle out-of-vocabulary (OOV) words. The fundamental difference lies in the criteria used for selecting symbol pairs during vocabulary construction.
WordPiece vs. Byte Pair Encoding (BPE)
Feature | Byte Pair Encoding (BPE) | WordPiece |
---|---|---|
Merging Criterion | Merges symbol pairs based on frequency (most frequent adjacent character pairs are combined). | Merges symbol pairs based on likelihood, derived from a language model trained on the dataset. |
This probabilistic approach allows WordPiece to learn more meaningful subword units, leading to improved generalization and language understanding in NLP models.
Step-by-Step: How WordPiece Tokenization Works
Let's illustrate the process with a clear example.
Example Dataset
Consider the following word counts from a small dataset:
(cost, 2)
(best, 2)
(menu, 1)
(men, 1)
(camel, 1)
Step 1: Define the Vocabulary Size
Let's assume our target vocabulary size is 14 tokens.
Step 2: Convert Words into Character Sequences
Initially, each word is split into its constituent characters:
cost
→c o s t
best
→b e s t
menu
→m e n u
men
→m e n
camel
→c a m e l
Step 3: Add Unique Characters to the Vocabulary
The initial vocabulary is populated with all unique characters present in the dataset:
vocab = {a, b, c, e, l, m, n, o, s, t, u}
This provides a starting vocabulary of 11 tokens.
Step 4: Train a Language Model on the Dataset
A language model is trained on the character sequences derived from the dataset. This model learns the probability of symbol pairs occurring together, indicating how likely certain character combinations are.
Step 5: Merge Symbol Pairs Based on Maximum Likelihood
WordPiece merges the symbol pair with the highest likelihood, as determined by the trained language model, rather than solely relying on frequency as BPE does.
Example Merging Process:
- Highest Likelihood Pair: Suppose
s
andt
form the pair with the highest likelihood.- Merge
s + t
→st
- Add
st
to the vocabulary:vocab = {a, b, c, e, l, m, n, o, s, t, u, st}
- Merge
- Next Highest Likelihood Pair: If
m
ande
exhibit high likelihood:- Merge
m + e
→me
- Add
me
to the vocabulary:vocab = {a, b, c, e, l, m, n, o, s, t, u, st, me}
- Merge
- Further Merging: To reach the target vocabulary size, subsequent merges are performed. For instance, merging
me
andn
:- Merge
me + n
→men
- Add
men
to the vocabulary:vocab = {a, b, c, e, l, m, n, o, s, t, u, st, me, men}
- Merge
This merging process continues until the desired vocabulary size of 14 tokens is achieved.
How WordPiece Tokenization Works at Inference
Once the vocabulary is built, WordPiece tokenizes new input text. Let's consider an example:
Input Word: stem
Assumed Vocabulary: {a, b, c, e, l, m, n, o, s, t, u, st, me}
The word stem
is not directly present in the vocabulary. WordPiece tokenizes it using the following strategy:
-
Longest Matching Subword: It attempts to match the longest possible subword from the beginning of the word that exists in the vocabulary.
st
is found in the vocabulary.
-
Remaining Part: The remaining part of the word is
em
. Sinceem
is not in the vocabulary, it is further broken down. -
Further Tokenization: The remaining part
em
is split into its constituent characters,e
andm
, both of which are present in the vocabulary.
Final Token Sequence:
[st, ##e, ##m]
Note: The ##
prefix signifies that the token is a continuation of a previous subword and not the beginning of a word.
Summary: WordPiece Algorithm Steps
- Extract Words: Collect words from the dataset along with their frequencies.
- Define Vocabulary Size: Specify the target number of tokens for the vocabulary.
- Character Tokenization: Split all words into individual characters.
- Initialize Vocabulary: Add all unique characters to the initial vocabulary.
- Train Language Model: Train a language model on the character sequences to estimate symbol pair probabilities.
- Probabilistic Merging: Iteratively select and merge symbol pairs with the highest likelihood according to the trained model.
- Reach Vocabulary Size: Repeat the merging process until the vocabulary reaches the defined target size.
- Inference Tokenization: Use the built vocabulary for tokenizing new input by applying longest subword matching and using
##
to indicate word continuations.
Conclusion
WordPiece tokenization is a powerful subword segmentation technique that significantly improves the handling of rare or unknown words in Natural Language Processing (NLP) models. By employing a trained language model to guide vocabulary construction, WordPiece ensures that token splits are linguistically and statistically meaningful. This makes it an indispensable component of modern NLP frameworks, including BERT, contributing to enhanced generalization and deeper language understanding.
SEO Keywords for WordPiece Tokenization
- WordPiece tokenization explained
- WordPiece vs Byte Pair Encoding (BPE)
- How WordPiece builds vocabulary
- WordPiece subword tokenization algorithm
- Handling OOV words with WordPiece
- WordPiece tokenization in BERT
- Probabilistic merging in WordPiece
- Tokenizing unseen words with WordPiece
Interview Questions on WordPiece Tokenization
- What is WordPiece tokenization and why is it used?
- How does WordPiece differ from Byte Pair Encoding (BPE)?
- What role does a language model play in WordPiece vocabulary construction?
- Describe the step-by-step process of building a WordPiece vocabulary.
- How does WordPiece handle out-of-vocabulary (OOV) words during tokenization?
- Why does WordPiece use the longest subword matching strategy during tokenization?
- What is the significance of the “##” prefix in WordPiece tokens?
- How does WordPiece improve generalization in NLP models?
- Can WordPiece be used for languages other than English? Explain.
- How does WordPiece tokenization impact downstream NLP tasks like question answering?
Whole Word Masking (WWM) for BERT MLM Explained
Learn how Whole Word Masking (WWM) enhances BERT's Masked Language Modeling (MLM) by masking entire words for better semantic understanding in NLP.
WordPiece Tokenizer: BERT's Subword Tokenization Explained
Discover how BERT's WordPiece tokenizer uses subword tokenization to handle OOV words, breaking them into smaller units for effective natural language processing in AI.