BERT Tokenization with Hugging Face: Preprocess Your Data

Learn to preprocess text for BERT using Hugging Face Transformers. Tokenize sentences, create token IDs & attention masks for AI/ML.

Preprocessing Input Data for BERT with Hugging Face Transformers

This guide walks you through the essential steps of tokenizing a sentence and preparing it for input into a BERT model using the Hugging Face transformers library. We will demonstrate this process using the example sentence:

sentence = "I love Paris"

The goal is to transform this raw text into numerical representations that BERT can understand, including token IDs and an attention mask.

1. Tokenize the Sentence

The first step is to break down the sentence into smaller units called tokens. BERT typically uses a WordPiece tokenizer, which can split words into subwords if they are not in its vocabulary.

First, import the necessary tokenizer. For BERT, you'd typically load a specific pre-trained tokenizer, for example, BertTokenizer from transformers.

from transformers import BertTokenizer

# Load the pre-trained BERT tokenizer (e.g., for uncased BERT)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
print(tokens)

Output:

['i', 'love', 'paris']

This output shows that the sentence has been tokenized into its constituent words, all in lowercase due to the bert-base-uncased model.

2. Add Special Tokens

BERT requires specific special tokens to be prepended and appended to the input sequence:

  • [CLS]: This token is added at the beginning of every sequence. It's used for classification tasks, as the final hidden state corresponding to this token can be used as the aggregate sequence representation.
  • [SEP]: This token is added at the end of each sentence. If you're processing two sentences (e.g., for question answering or natural language inference), it's used to separate them. For a single sentence, it marks the end.
# Add special tokens to the token list
tokens = ['[CLS]'] + tokens + ['[SEP]']
print(tokens)

Output:

['[CLS]', 'i', 'love', 'paris', '[SEP]']

3. Pad the Token List

For efficient processing, especially when working with batches of sentences, all sequences need to have the same length. This is achieved by padding shorter sequences with a special [PAD] token up to a predefined maximum sequence length. Let's assume a maximum length of 7 tokens for this example.

# Define a maximum sequence length
max_length = 7

# Pad the token list with '[PAD]' tokens
padding_length = max_length - len(tokens)
tokens = tokens + ['[PAD]'] * padding_length
print(tokens)

Output:

['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']

4. Create the Attention Mask

The attention mask is a binary tensor that tells the BERT model which tokens are actual input and which are padding. This prevents the model from paying attention to the [PAD] tokens.

  • 1: Indicates a real token (including [CLS], [SEP]).
  • 0: Indicates a [PAD] token.
# Create the attention mask
attention_mask = [1 if token != '[PAD]' else 0 for token in tokens]
print(attention_mask)

Output:

[1, 1, 1, 1, 1, 0, 0]

5. Convert Tokens to Token IDs

BERT models work with numerical representations. Each token is mapped to a unique integer ID from the model's vocabulary.

# Convert tokens to their corresponding IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

Output:

[101, 1045, 2293, 3000, 102, 0, 0]

Here's a breakdown of the output IDs (these may vary slightly based on the specific BERT tokenizer version, but the principle remains):

  • 101: The ID for the [CLS] token.
  • 1045: The ID for the token i.
  • 2293: The ID for the token love.
  • 3000: The ID for the token paris.
  • 102: The ID for the [SEP] token.
  • 0: The ID for the [PAD] token.

6. Convert to PyTorch Tensors

Finally, BERT models typically expect inputs in the form of tensors, often within a batch. We convert the token_ids and attention_mask lists into PyTorch tensors and add a batch dimension using unsqueeze(0).

import torch

# Convert lists to PyTorch tensors and add a batch dimension
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

print("Token IDs Tensor:", token_ids)
print("Attention Mask Tensor:", attention_mask)

Output:

Token IDs Tensor: tensor([[  101,  1045,  2293,  3000,   102,     0,     0]])
Attention Mask Tensor: tensor([[1, 1, 1, 1, 1, 0, 0]])

The .unsqueeze(0) adds a dimension at the beginning, transforming a 1D tensor of shape (sequence_length,) into a 2D tensor of shape (batch_size, sequence_length), which is the standard input format for BERT.

What's Next?

With the pre-processed token IDs and attention mask ready, you can now feed these tensors into a pre-trained BERT model to extract contextual word and sentence embeddings. These embeddings can then be used for various downstream Natural Language Processing (NLP) tasks, such as:

  • Sentiment Analysis
  • Text Classification
  • Question Answering
  • Named Entity Recognition
  • Semantic Search

Frequently Asked Questions (FAQ) / Interview Questions

  • Why are special tokens like [CLS] and [SEP] required in BERT?

    • [CLS] is crucial for sequence-level classification tasks, as its final hidden state is typically used as a representation of the entire input.
    • [SEP] is used to distinguish between different segments of text (e.g., separating a question from a passage in question answering, or two sentences in a sentence-pair task).
  • What is the purpose of the [PAD] token in BERT input preparation?

    • The [PAD] token is used to equalize the length of all input sequences within a batch. This is necessary for efficient matrix operations in deep learning models, as they require inputs of consistent dimensions.
  • What does the attention mask represent in BERT?

    • The attention mask is a binary tensor that signals to the model which tokens should be attended to (value 1) and which should be ignored (value 0, typically for padding tokens). This ensures that the model doesn't learn from padding and focuses only on the meaningful parts of the input.
  • Explain how WordPiece tokenization works in BERT.

    • WordPiece starts by considering each word as a separate token. If a word is not in the vocabulary, it tries to break it down into subword units. It prioritizes longer subword units that are most likely to appear in the data, effectively balancing a large vocabulary with the ability to handle out-of-vocabulary words by splitting them into smaller, known units.
  • Why do we convert token IDs and attention masks into tensors?

    • Deep learning frameworks like PyTorch and TensorFlow operate on tensors. Converting the lists of IDs and masks into tensors is a prerequisite for feeding them into the BERT model for computation.
  • What is the shape of BERT’s input tensor and why is unsqueeze(0) used?

    • BERT typically expects input tensors with a shape of (batch_size, sequence_length). The unsqueeze(0) operation is used to add the batch_size dimension when processing a single example, transforming a (sequence_length,) tensor into a (1, sequence_length) tensor, effectively treating it as a batch of one.
  • How do you ensure consistency in input length when using BERT?

    • Input length consistency is ensured by padding shorter sequences to a maximum length (either a fixed predefined length or the length of the longest sequence in a batch) using [PAD] tokens and creating a corresponding attention mask.
  • What happens if special tokens are not added before feeding input into BERT?

    • If special tokens ([CLS], [SEP]) are omitted, the model may not perform correctly, especially on tasks that rely on them. For classification, the lack of [CLS] means there's no dedicated token for aggregate representation. For sentence segmentation, the lack of [SEP] can lead to misinterpretation of sentence boundaries.
  • How do token IDs differ between the base vocabulary and padded inputs?

    • Token IDs from the base vocabulary represent actual words or subwords and are assigned specific integer values (e.g., 1045 for 'i'). Padded inputs are represented by a specific token ID, typically 0, which is reserved for the [PAD] token.
  • What are the next steps after preparing BERT inputs?

    • The prepared input tensors (token IDs and attention mask) are then passed to the BERT model. The model processes these inputs to generate contextual embeddings, which can subsequently be used for downstream tasks.