Hands-On BERT: Embeddings, Architecture & Fine-Tuning

Explore BERT's architecture, extract embeddings, and fine-tune for downstream tasks using Hugging Face Transformers. Master practical NLP applications.

Chapter 3: Getting Hands-On with BERT

This chapter will guide you through practical applications of BERT, focusing on extracting embeddings, understanding its architecture, and fine-tuning it for various downstream tasks. We will leverage the powerful Hugging Face transformers library to streamline these processes.

Exploring the Pre-Trained BERT Model

BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking language representation model. Its key innovation lies in its bidirectional training, allowing it to learn context from both the left and right of a word simultaneously.

Importing Dependencies

Before we begin, ensure you have the necessary libraries installed. The primary library we'll use is Hugging Face's transformers.

# Example of importing core components
from transformers import BertTokenizer, BertModel
import torch

Loading Model and Dataset

To work with BERT, we first need to load a pre-trained model and its corresponding tokenizer. For this chapter, we'll often use the bert-base-uncased model as a starting point. Datasets will be introduced as needed for specific tasks.

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

Preprocessing Input Data

BERT requires specific input formatting, typically involving tokenization, adding special tokens ([CLS] for classification, [SEP] for sentence separation), and creating attention masks.

Preprocessing Input for QA

For Question Answering tasks, the input is usually formatted as a pair of sequences: the question and the context paragraph.

Preprocessing for All Layers Extraction

When extracting embeddings from all layers, ensure your input is correctly tokenized and batched to match the model's expected input shape.

Generating BERT Embeddings

BERT can generate contextualized word embeddings. These embeddings capture the meaning of a word based on its surrounding context.

Getting Embeddings from BERT

After passing input through the BERT model, the output contains hidden states from each layer.

Extracting Embeddings from Pre-Trained BERT

The BertModel object's output includes last_hidden_state, which provides embeddings from the final encoder layer for each token.

Extracting Embeddings from All Encoder Layers

To access embeddings from all layers, you can utilize the output_hidden_states=True argument when calling the model.

# Example of extracting embeddings from all layers
inputs = tokenizer("Hello, world!", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

all_layer_hidden_states = outputs.hidden_states

Fine-Tuning BERT for Downstream Tasks

BERT's power truly shines when fine-tuned on specific tasks. This involves adding a task-specific layer on top of the pre-trained BERT model and training the entire network on a labeled dataset.

Fine-Tuning BERT for Sentiment Analysis

Sentiment analysis is a common text classification task. We'll adapt BERT to predict the sentiment (e.g., positive, negative) of a given text.

Text Classification with BERT

General text classification involves using the [CLS] token's embedding as the input to a classification layer.

Named Entity Recognition

NER involves identifying and classifying named entities (like persons, organizations, locations) in text. This typically requires token-level classification.

Natural Language Inference

NLI tasks determine the relationship (entailment, contradiction, neutral) between two sentences. This is often framed as a sequence pair classification problem.

Question-Answering with BERT

BERT excels at question-answering tasks, where it can predict the start and end tokens of an answer within a given passage.

Performing Question-Answering Tasks

This involves feeding the question and context into BERT and then using the model's outputs to predict answer spans.

Getting the Answer

Once the start and end logits are predicted, post-processing is required to extract the actual answer text from the context.

Training the Model

Fine-tuning involves setting up an optimizer, loss function, and training loop to update BERT's weights based on the task-specific data.

Using Hugging Face Transformers

The Hugging Face transformers library provides pre-built classes and functions for tokenizers, models, and training, greatly simplifying the process of working with BERT.

Summary, Questions, and Further Reading

This chapter provided a hands-on introduction to BERT. You learned how to load and preprocess data, extract embeddings, and the fundamental concepts of fine-tuning for various NLP tasks.

Questions:

  • What is the significance of the [CLS] token in BERT?
  • How does bidirectional training differ from traditional language models?
  • What are the benefits of fine-tuning a pre-trained BERT model?

Further Reading: