Extract BERT Embeddings: Word & Sentence Level

Learn to extract contextual word and sentence embeddings from pre-trained BERT models for NLP tasks like sentiment analysis and text classification. Get started today!

Extracting Embeddings from Pre-Trained BERT

This guide explains how to extract contextual word and sentence embeddings from a pre-trained BERT model. These embeddings are invaluable for various Natural Language Processing (NLP) tasks, including sentiment analysis, text classification, and more.

We will walk through the process with a simple example sentence: "I love Paris." Our objective is to obtain both word-level and sentence-level embeddings using a pre-trained BERT model.

Step-by-Step Guide to Extract Embeddings from BERT

The process involves several key steps:

1. Tokenization

The first step is to tokenize the input sentence. BERT models typically use a subword tokenization algorithm like WordPiece. This breaks down words into smaller units, enabling the model to handle out-of-vocabulary words and capture morphological information.

For our example sentence, "I love Paris," the tokenization process would look like this:

tokens = ["I", "love", "Paris"]

BERT requires special tokens to be added to the tokenized sequence:

  • [CLS] (Classification Token): Placed at the beginning of the sequence. Its final hidden state is often used as the aggregate representation for the entire sequence (e.g., for classification tasks).
  • [SEP] (Separator Token): Placed at the end of a sentence or between two sentences in tasks like Question Answering or Natural Language Inference.

Adding these special tokens to our example:

tokens = ["[CLS]", "I", "love", "Paris", "[SEP]"]

2. Padding

To process sequences of varying lengths uniformly, BERT requires all input sequences to have a fixed, consistent length. Shorter sequences are padded with a special [PAD] token up to this maximum length.

Let's assume a fixed input length of 7 for our example:

tokens = ["[CLS]", "I", "love", "Paris", "[SEP]", "[PAD]", "[PAD]"]

3. Creating the Attention Mask

An attention mask is crucial for BERT to distinguish between real tokens and the padding tokens. The mask is a sequence of the same length as the input tokens, where a value of 1 indicates a real token and 0 indicates a padding token. This ensures that BERT only pays attention to the actual content of the input.

For our padded sequence:

attention_mask = [1, 1, 1, 1, 1, 0, 0]

4. Mapping Tokens to Token IDs

Each token in the vocabulary is assigned a unique integer ID. The tokenizer converts the tokenized sequence into a sequence of these token IDs, which the BERT model can process.

Here's a mapping for our example:

token_ids = [101, 1045, 2293, 3000, 102, 0, 0]

Mapping Breakdown:

  • 101: [CLS]
  • 1045: I
  • 2293: love
  • 3000: Paris
  • 102: [SEP]
  • 0: [PAD]

5. Feeding Inputs to the Pre-trained BERT Model

The token_ids and attention_mask (along with optional token_type_ids for sentence pair tasks) are fed into the pre-trained BERT model. BERT, typically composed of 12 transformer encoder layers in its "base" version, processes these inputs. Each layer refines the token representations by incorporating contextual information from other tokens in the sequence.

6. Extracting Word Embeddings

The output from BERT's final encoder layer provides contextualized embeddings for each token in the input sequence. These embeddings capture the meaning of a word as it appears in its specific context.

For our example sentence:

  • Embedding[1] represents the contextual embedding for the token "I".
  • Embedding[2] represents the contextual embedding for the token "love".
  • Embedding[3] represents the contextual embedding for the token "Paris".

In BERT-base, these token-level embeddings are typically 768-dimensional vectors.

7. Extracting Sentence Embedding

To obtain a single embedding that represents the entire sentence, there are several common strategies:

  • Using the [CLS] Token Embedding: The embedding of the [CLS] token (the first token in the sequence) is often used as the aggregate representation for the entire sentence. This embedding is designed to capture sentence-level semantics, especially when the model is fine-tuned for classification tasks.

    sentence_embedding = output[0] # Assuming output[0] is the embedding for the [CLS] token
  • Pooling Methods: While the [CLS] token embedding is a common choice, it may not always be the most effective. Alternative methods for sentence representation include:

    • Average Pooling: Averaging the embeddings of all tokens (excluding padding) in the sequence.
    • Max Pooling: Taking the element-wise maximum across the embeddings of all tokens.

    These pooling strategies can sometimes yield better sentence representations, depending on the specific downstream task. These methods are explored in more detail in advanced sections.

Use Case: Sentiment Analysis

Let's illustrate how these embeddings can be used for a practical NLP task like sentiment analysis.

Consider the following sample dataset:

SentenceSentiment
I love Paris1 (positive)
I hate traffic0 (negative)

Process:

  1. Tokenize and Embed: Each sentence is tokenized, padded, and fed into the BERT model to obtain its sentence embedding (e.g., using the [CLS] token or pooling).
  2. Feature Extraction: The resulting sentence embeddings serve as feature vectors for each sentence.
  3. Classifier Training: These feature vectors can then be used to train a downstream classifier, such as a logistic regression model, a Support Vector Machine (SVM), or a simple neural network, to predict the sentiment (positive or negative) of new, unseen sentences.

Key Takeaways

  • Pre-trained BERT models provide rich, contextualized embeddings for both individual words and entire sentences.
  • The [CLS] token's final hidden state is a common proxy for sentence-level representation, although pooling methods (average, max) can also be effective.
  • BERT embeddings can be directly used as input features to train classifiers for various downstream NLP tasks.
  • The typical workflow involves tokenization, padding, creating an attention mask, feeding these into the BERT model, and extracting the relevant output embeddings.

Next Steps

In the subsequent sections, we will explore how to implement this embedding extraction process efficiently using the popular Hugging Face Transformers library, which significantly simplifies working with BERT and other transformer-based models.


SEO Keywords

  • Extract BERT embeddings Python
  • Word and sentence embeddings from BERT
  • Contextual embeddings using BERT
  • BERT sentence representation with CLS token
  • Tokenization and padding in BERT
  • Attention mask BERT tutorial
  • Use BERT embeddings for sentiment analysis
  • How to get BERT embeddings with Hugging Face Transformers

Interview Questions

  • What is the role of the [CLS] token in BERT, and how is it used for sentence embeddings?
  • How do BERT embeddings differ from traditional word embeddings like Word2Vec or GloVe?
  • Explain the purpose of an attention mask when feeding input to BERT.
  • Why is tokenization important in BERT, and what is WordPiece?
  • How would you extract word-level embeddings from BERT for a sentence?
  • What are the advantages of using contextual embeddings from BERT over static embeddings?
  • Can you explain the difference between using [CLS] and mean pooling for sentence representation?
  • How do you handle variable-length input sequences in BERT?
  • What is the significance of padding and how is it handled in BERT models?
  • How can BERT embeddings be used as input features in downstream tasks like sentiment analysis?