Text Embedding Techniques for NLP & Machine Learning

Explore key text embedding techniques for NLP, transforming text into numerical vectors for machine learning. Learn about document embedding for similarity & classification.

6. Text Embedding Techniques

This document explores various techniques for text embedding, a crucial step in Natural Language Processing (NLP) that transforms text into numerical representations suitable for machine learning models.

6.1 Document Embedding

Document embedding aims to represent an entire document as a single dense vector. This allows us to capture the overall meaning and context of a document, enabling tasks like document similarity, classification, and clustering.

6.1.1 Pre-Trained Embeddings

Pre-trained embeddings are vector representations of words or documents that have been learned from massive text corpora. These embeddings capture general semantic relationships and can be leveraged directly or fine-tuned for specific downstream tasks.

  • Advantages:

    • Capture rich semantic and syntactic information.
    • Reduce the need for large, task-specific training datasets.
    • Provide a good starting point for various NLP tasks.
  • Common Sources:

    • Word2Vec: Learns word embeddings by predicting surrounding words or the target word given its context.
    • GloVe (Global Vectors for Word Representation): Combines global matrix factorization with local context window methods.
    • FastText: Extends Word2Vec by considering character n-grams, making it effective for out-of-vocabulary words and morphologically rich languages.
    • BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model that generates context-aware word embeddings, capturing nuances of word meaning based on its surrounding words.
    • Doc2Vec (Paragraph Vectors): An extension of Word2Vec that learns embeddings for paragraphs or documents by treating them as sequences of words, often by adding a unique paragraph ID.

6.1.2 Word Embeddings

Word embeddings represent individual words as dense numerical vectors in a multi-dimensional space. Words with similar meanings are expected to be closer to each other in this vector space.

  • Word2Vec:

    • Skip-gram: Predicts surrounding words given a target word. This model is generally better for infrequent words.
    • Continuous Bag-of-Words (CBOW): Predicts the target word given its surrounding context words. This model is generally better for frequent words.

    Example (Conceptual): If you have trained Word2Vec on a large corpus, you might observe relationships like: vector("king") - vector("man") + vector("woman") ≈ vector("queen")

  • GloVe:

    • Relies on word co-occurrence statistics from a corpus. It aims to capture ratios of word-word co-occurrence probabilities.
  • FastText:

    • Represents words as a bag of character n-grams. This allows it to generate embeddings for words not seen during training (out-of-vocabulary words) by summing the embeddings of their constituent n-grams.

    Example (Conceptual): The word "apple" might be represented by its character n-grams: {"ap", "app", "ppl", "ple", "apple"} and their embeddings.

  • Contextual Embeddings (e.g., BERT, ELMo, GPT):

    • Unlike static word embeddings, contextual embeddings generate different vector representations for the same word depending on its context within a sentence. This is a significant advancement as it captures polysemy (words with multiple meanings).

    Example: The word "bank" would have different embeddings in the sentences:

    1. "I went to the bank to deposit money."
    2. "The river bank was eroded."