WordPiece Tokenizer: BERT's Subword Tokenization Explained

Discover how BERT's WordPiece tokenizer uses subword tokenization to handle OOV words, breaking them into smaller units for effective natural language processing in AI.

WordPiece Tokenizer in BERT

BERT (Bidirectional Encoder Representations from Transformers) utilizes a specialized tokenization technique known as the WordPiece tokenizer. This tokenizer implements a subword tokenization approach, allowing BERT to effectively handle a vast vocabulary, including rare or out-of-vocabulary (OOV) terms, by breaking them into smaller, more manageable components.

What is the WordPiece Tokenizer?

The WordPiece tokenizer's primary function is to split words into subword units that exist within BERT's vocabulary. BERT's vocabulary typically contains around 30,000 tokens.

  • Vocabulary Match: If a word is present in BERT's vocabulary, it is used as is.
  • Subword Splitting: If a word is not found in the vocabulary, it is broken down into smaller, known subwords. This process continues recursively until all resulting components are present in the vocabulary or the word is reduced to individual characters.

Example: Tokenizing with WordPiece

Consider the sentence:

"Let us start pretraining the model."

When processed by the WordPiece tokenizer, this sentence is broken down as follows:

tokens = [let, us, start, pre, ##train, ##ing, the, model]

In this example, the word "pretraining" is not directly in the vocabulary. Therefore, it is split into:

  • pre
  • ##train
  • ##ing

The ## prefix signifies that the token is a subword that follows another token. This prefix helps maintain continuity and context within the original word.

Why Use WordPiece Tokenization?

The main advantage of WordPiece tokenization for BERT is its ability to handle out-of-vocabulary (OOV) words. Instead of ignoring or replacing unknown words, the tokenizer splits them into familiar subwords. This capability ensures that BERT can generalize more effectively, especially when encountering unseen data.

For instance, a rare term like "microlearning" might be tokenized as:

[micro, ##learn, ##ing]

Even if the complete word "microlearning" is not in the vocabulary, BERT can still infer meaning from its subword components (micro, learn, ing), facilitating better understanding and prediction.

Special Tokens

After the initial tokenization, BERT adds two special tokens to the sequence:

  • [CLS]: This token is prepended to the beginning of the input sequence. It is used to generate the final classification output for tasks like sentiment analysis or question answering.
  • [SEP]: This token is appended to the end of the input sequence. It serves to mark the separation between different sentences (in tasks involving sentence pairs) or to denote the end of a single sentence.

Therefore, the final token sequence for the example sentence becomes:

tokens = [ [CLS], let, us, start, pre, ##train, ##ing, the, model, [SEP] ]

From Tokens to BERT Input

These processed tokens, including the special tokens, are then fed into BERT through a series of embedding layers:

  1. Token Embedding Layer: Converts each token into a numerical vector representation.
  2. Segment Embedding Layer: Distinguishes between different segments or sentences within the input (e.g., sentence A and sentence B).
  3. Position Embedding Layer: Captures the positional information of each token within the sequence.

The outputs from these three embedding layers are summed together to create the final input embeddings. These comprehensive embeddings are then passed into the BERT model for further processing.

WordPiece Tokenizer: BERT's Subword Tokenization Explained