BPE Tokenization: Learn Byte Pair Encoding for LLMs
Master Byte Pair Encoding (BPE) for text tokenization. Understand how BPE builds vocabularies and processes text, essential for Natural Language Processing and LLMs.
Tokenizing with Byte Pair Encoding (BPE)
This document explains the process of tokenizing text using Byte Pair Encoding (BPE), building upon a previously established vocabulary.
The Vocabulary
From a given dataset, the following vocabulary was created:
vocabulary = {'a', 'b', 'c', 'e', 'l', 'm', 'n', 'o', 's', 't', 'u', 'st', 'me', 'men'}
This vocabulary contains a mix of individual characters and frequently occurring subword units.
How BPE Tokenization Works
BPE tokenization works by iteratively splitting input text into the longest possible matching units found in the vocabulary. If a full word is not found, it is broken down into subwords, and this process continues until all components are either present in the vocabulary or reduced to individual characters.
Example 1: Tokenizing "mean"
Let's illustrate with the input word "mean":
- Initial Check: The word "mean" is not found directly in our
vocabulary
. - First Split: We split "mean" into subwords:
['me', 'an']
. - Subword Check:
'me'
is present in thevocabulary
.'an'
is not present in thevocabulary
.
- Second Split: Since
'an'
is not in the vocabulary, we split it further into individual characters:['a', 'n']
. - Character Check:
'a'
is present in thevocabulary
.'n'
is present in thevocabulary
.
- Final Tokens: All components are now found in the vocabulary. The final tokens for "mean" are:
['me', 'a', 'n']
.
Example 2: Tokenizing "bear"
Let's consider the input word "bear":
- Initial Check: The word "bear" is not found directly in our
vocabulary
. - First Split: We split "bear" into subwords:
['be', 'ar']
. - Subword Check:
'be'
is present in thevocabulary
.'ar'
is not present in thevocabulary
.
- Second Split: Since
'ar'
is not in the vocabulary, we split it further into individual characters:['a', 'r']
. - Character Check:
'a'
is present in thevocabulary
.'r'
is not present in thevocabulary
.
- Handling Unknown Characters: We cannot split
'r'
further as it's already an individual character. In such cases, it is replaced by an unknown token, typically represented as<UNK>
. - Final Tokens: The final tokens for "bear" are:
['be', 'a', '<UNK>']
.
Note on <UNK>
Tokens: The appearance of an <UNK>
token in this example is due to the limited size of our sample vocabulary. In practice, when BPE is trained on a very large corpus, the vocabulary typically includes all individual characters, thus minimizing the occurrence of <UNK>
tokens for words that can be fully decomposed into known characters.
Example 3: Tokenizing "men"
Let's consider the input word "men":
- Initial Check: The word "men" is present in our
vocabulary
. - Final Token: Since the entire word is found, it is returned directly as a single token:
['men']
.
Summary
Through these examples, we've demonstrated how BPE tokenizes input text by prioritizing longer matches from the vocabulary. This process allows for efficient representation of text by breaking down rare or unseen words into known subword units and individual characters, or an <UNK>
token when necessary. This understanding sets the stage for exploring more advanced BPE variations, such as byte-level BPE.
Token Embeddings: BERT's Foundation for NLP
Learn how BERT uses token embeddings to convert text into numerical vectors, capturing semantic meaning for advanced AI and NLP tasks. Understand the core of language processing.
Whole Word Masking (WWM) for BERT MLM Explained
Learn how Whole Word Masking (WWM) enhances BERT's Masked Language Modeling (MLM) by masking entire words for better semantic understanding in NLP.