Learn how to extract contextualized word and sentence embeddings from BERT using Hugging Face Transformers. Get step-by-step guidance for AI and ML applications.

Getting Embeddings from BERT with Hugging Face Transformers

This document explains how to extract contextualized word and sentence embeddings using the pre-trained BERT model with Hugging Face's transformers library. We will demonstrate how to feed preprocessed inputs to the model and retrieve embeddings from the final encoder layer.

1. Feeding Inputs to BERT

To obtain the output embeddings, we feed the token_ids and attention_mask to the BERT model.

# Assuming 'model' is a pre-trained BERT model from Hugging Face
# and 'token_ids' and 'attention_mask' are PyTorch tensors
hidden_rep, cls_head = model(token_ids, attention_mask=attention_mask)

The model's output is a tuple containing:

hidden_rep: Embeddings of all tokens from the final encoder layer (e.g., layer 12 for BERT-base).
cls_head: The embedding of the [CLS] token, which represents the entire sentence.

2. Understanding the Output Shape

You can inspect the shape of the hidden_rep to understand its structure.

print(hidden_rep.shape)

Example Output:

torch.Size([1, 7, 768])

This shape signifies:

1: Batch size (number of sequences processed at once).
7: Sequence length (total number of tokens in the input, including special tokens like [CLS], [SEP], and any padding [PAD]).
768: Hidden size (the dimensionality of the embedding vectors for each token, typical for BERT-base models).

3. Extracting Token-Level Embeddings

Contextual embeddings for each individual token can be accessed by indexing the hidden_rep tensor. The index [0] selects the first (and in this example, only) sequence in the batch.

# Example indexing to get embeddings for specific tokens:
cls_embedding = hidden_rep[0][0]      # Embedding of the [CLS] token
first_token_embedding = hidden_rep[0][1] # Embedding of the first actual token (e.g., 'i')
another_token_embedding = hidden_rep[0][2] # Embedding of another token (e.g., 'love')
sep_embedding = hidden_rep[0][4]      # Embedding of the [SEP] token
pad_embedding = hidden_rep[0][5]      # Embedding of a [PAD] token

Each of these embeddings is a 768-dimensional vector that captures the contextual meaning of its corresponding token within the input sentence.

4. Extracting Sentence-Level Embedding Using `[CLS]`

The embedding of the [CLS] token (the first token in the input sequence) is commonly used as a representation for the entire sentence.

print(cls_head.shape)

Example Output:

torch.Size([1, 768])

This [1, 768] shaped vector provides a compact representation of the input sentence and is suitable for various downstream tasks, such as text classification, sentiment analysis, or question answering.

Summary

hidden_rep provides contextual embeddings for every token in the input sequence from BERT's final encoder layer.
cls_head provides the specific embedding for the [CLS] token, which serves as a powerful sentence-level representation.

These contextual embeddings are dynamically generated, meaning the embedding for a word can change based on the surrounding words and the overall sentence structure, unlike static embeddings like Word2Vec.

In the next section, you might explore how to extract embeddings from all encoder layers of BERT, which can be beneficial for tasks requiring layer-wise analysis, visualization, or fine-grained feature engineering.

SEO Keywords

Get CLS vector from BERT Transformers
Extract BERT embeddings Hugging Face
Contextual word embeddings BERT
Sentence embedding using BERT CLS token
BERT hidden states output shape
Token-level vs sentence-level embeddings BERT
BERT encoder layer embeddings extraction
Hugging Face BERT output interpretation

Interview Questions

What is the significance of the [CLS] token in BERT embeddings?
What does the output of a BERT model contain when using Hugging Face Transformers?
Explain the shape [1, 7, 768] of BERT’s hidden representations.
How are contextual embeddings different from static embeddings like Word2Vec?
How can you extract token-level embeddings from a BERT model?
What use cases benefit from using [CLS] embeddings as sentence representations?
What does each of the 768 dimensions represent in BERT embeddings?
How does padding affect the output embeddings from BERT?
What is the difference between hidden_rep and cls_head in BERT’s output?
Can you use embeddings from intermediate layers of BERT? Why might that be useful?

BERT Embeddings with Hugging Face Transformers