Master Byte Pair Encoding (BPE) for text tokenization. Understand how BPE builds vocabularies and processes text, essential for Natural Language Processing and LLMs.

Tokenizing with Byte Pair Encoding (BPE)

This document explains the process of tokenizing text using Byte Pair Encoding (BPE), building upon a previously established vocabulary.

The Vocabulary

From a given dataset, the following vocabulary was created:

vocabulary = {'a', 'b', 'c', 'e', 'l', 'm', 'n', 'o', 's', 't', 'u', 'st', 'me', 'men'}

This vocabulary contains a mix of individual characters and frequently occurring subword units.

BPE tokenization works by iteratively splitting input text into the longest possible matching units found in the vocabulary. If a full word is not found, it is broken down into subwords, and this process continues until all components are either present in the vocabulary or reduced to individual characters.

Example 1: Tokenizing "mean"

Let's illustrate with the input word "mean":

Initial Check: The word "mean" is not found directly in our vocabulary.
First Split: We split "mean" into subwords: ['me', 'an'].
Subword Check:
- 'me' is present in the vocabulary.
- 'an' is not present in the vocabulary.
Second Split: Since 'an' is not in the vocabulary, we split it further into individual characters: ['a', 'n'].
Character Check:
- 'a' is present in the vocabulary.
- 'n' is present in the vocabulary.
Final Tokens: All components are now found in the vocabulary. The final tokens for "mean" are: ['me', 'a', 'n'].

Example 2: Tokenizing "bear"

Let's consider the input word "bear":

Initial Check: The word "bear" is not found directly in our vocabulary.
First Split: We split "bear" into subwords: ['be', 'ar'].
Subword Check:
- 'be' is present in the vocabulary.
- 'ar' is not present in the vocabulary.
Second Split: Since 'ar' is not in the vocabulary, we split it further into individual characters: ['a', 'r'].
Character Check:
- 'a' is present in the vocabulary.
- 'r' is not present in the vocabulary.
Handling Unknown Characters: We cannot split 'r' further as it's already an individual character. In such cases, it is replaced by an unknown token, typically represented as <UNK>.
Final Tokens: The final tokens for "bear" are: ['be', 'a', '<UNK>'].

Note on <UNK> Tokens: The appearance of an <UNK> token in this example is due to the limited size of our sample vocabulary. In practice, when BPE is trained on a very large corpus, the vocabulary typically includes all individual characters, thus minimizing the occurrence of <UNK> tokens for words that can be fully decomposed into known characters.

Example 3: Tokenizing "men"

Let's consider the input word "men":

Initial Check: The word "men" is present in our vocabulary.
Final Token: Since the entire word is found, it is returned directly as a single token: ['men'].

Summary

Through these examples, we've demonstrated how BPE tokenizes input text by prioritizing longer matches from the vocabulary. This process allows for efficient representation of text by breaking down rare or unseen words into known subword units and individual characters, or an <UNK> token when necessary. This understanding sets the stage for exploring more advanced BPE variations, such as byte-level BPE.

BPE Tokenization: Learn Byte Pair Encoding for LLMs