Text Representation Techniques for ML & NLP

Explore 5 core text representation techniques for ML and NLP, including Bag of Words (BoW). Learn how to convert text into numerical formats for AI models.

5. Text Representation Techniques

This document outlines common techniques used to represent text data numerically for machine learning and natural language processing tasks. These methods transform unstructured text into a structured format that algorithms can process.

5.1 Bag of Words (BoW)

The Bag of Words (BoW) model is a simple yet effective way to represent text. It disregards grammar and word order, focusing solely on the occurrence of words within a document. The "bag" analogy implies that the words are collected without regard to their original sequence.

How it works:

  1. Vocabulary Creation: A vocabulary is built from all the unique words present in the entire corpus (collection of documents).
  2. Document Representation: Each document is then represented as a vector. The dimension of this vector equals the size of the vocabulary. Each element in the vector corresponds to a word in the vocabulary and its value indicates the frequency (count) of that word in the document.

Example:

Consider two simple documents:

  • Document 1: "The cat sat on the mat."
  • Document 2: "The dog sat on the rug."

The vocabulary would be: {"The", "cat", "sat", "on", "the", "mat", "dog", "rug"} (assuming case-insensitivity and ignoring punctuation).

  • Document 1 vector: [2, 1, 1, 1, 1, 1, 0, 0] (The appears twice, cat once, sat once, etc.)
  • Document 2 vector: [2, 0, 1, 1, 1, 0, 1, 1] (The appears twice, dog once, rug once, etc.)

Pros:

  • Simple to understand and implement.
  • Effective for tasks where word frequency is important (e.g., topic modeling, document classification).

Cons:

  • Ignores word order and context, losing semantic meaning.
  • Can result in very high-dimensional vectors if the vocabulary is large (the "curse of dimensionality").
  • Doesn't account for word importance (common words like "the" can dominate).

5.2 TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus. It is a weighting scheme that evaluates how relevant a document is in a collection or network of documents. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Formula:

$TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D)$

Where:

  • $TF(t, d)$: Term Frequency – the number of times term $t$ appears in document $d$. $TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$
  • $IDF(t, D)$: Inverse Document Frequency – measures how important a term is. It is the logarithm of the ratio of the total number of documents to the number of documents containing the term. $IDF(t, D) = \log\left(\frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t}\right)$

How it works:

  1. Term Frequency (TF): Calculates how frequently a term appears in a document.
  2. Inverse Document Frequency (IDF): Calculates how rare a term is across the entire corpus. Terms that appear in many documents have a lower IDF, while terms that appear in few documents have a higher IDF.
  3. TF-IDF Score: The TF score is multiplied by the IDF score. Words that are frequent in a specific document but rare across the corpus will have a high TF-IDF score, indicating their importance to that document.

Example:

Consider the same documents from the BoW example:

  • Document 1: "The cat sat on the mat."
  • Document 2: "The dog sat on the rug."
  • Corpus: {Document 1, Document 2} (Total documents D = 2)

Let's calculate TF-IDF for the word "cat" and "the":

  • For "cat" in Document 1:

    • $TF(\text{"cat"}, \text{Doc1}) = 1/7$ (appears once out of 7 words)
    • $IDF(\text{"cat"}, \text{Corpus}) = \log(2/1) = \log(2)$ (appears in 1 out of 2 documents)
    • $TF-IDF(\text{"cat"}, \text{Doc1}, \text{Corpus}) = (1/7) \times \log(2)$
  • For "the" in Document 1:

    • $TF(\text{"the"}, \text{Doc1}) = 2/7$ (appears twice out of 7 words)
    • $IDF(\text{"the"}, \text{Corpus}) = \log(2/2) = \log(1) = 0$ (appears in both documents)
    • $TF-IDF(\text{"the"}, \text{Doc1}, \text{Corpus}) = (2/7) \times 0 = 0$

In this simplified example, "cat" would have a higher TF-IDF score than "the", correctly highlighting its distinctiveness to Document 1.

Pros:

  • Accounts for word importance, reducing the impact of common words.
  • Effective for information retrieval and document similarity tasks.
  • Still relatively simple to implement.

Cons:

  • Still ignores word order and semantic relationships between words.
  • Can lead to sparse matrices.

5.3 One-Hot Encoding

One-Hot Encoding is a process of converting categorical data points into a format that machine learning algorithms can understand. In the context of text, it's often used for representing individual words or discrete features.

How it works:

  1. Vocabulary Creation: A vocabulary of all unique tokens (words or sub-word units) is created.
  2. Vector Creation: For each token, a vector is created with a length equal to the size of the vocabulary.
  3. Encoding: The vector is filled with zeros, except for a single one at the index corresponding to the token's position in the vocabulary.

Example:

Vocabulary: {"apple", "banana", "cherry"}

  • "apple" would be represented as [1, 0, 0]
  • "banana" would be represented as [0, 1, 0]
  • "cherry" would be represented as [0, 0, 1]

Usage in Text:

While direct one-hot encoding of entire documents is impractical due to vocabulary size, it's fundamental to building other representations like embeddings, or for encoding discrete features related to text (e.g., part-of-speech tags).

Pros:

  • Simple and unambiguous representation.
  • Forms the basis for more complex neural network embeddings.

Cons:

  • Extremely inefficient for representing words in a corpus due to vocabulary size (very sparse and high-dimensional).
  • Doesn't capture any semantic similarity between words (e.g., "king" and "queen" are as dissimilar as "king" and "apple").

5.4 N-Grams

N-Grams are contiguous sequences of n items from a given sample of text or speech. When applied to text, these items are typically words. N-grams capture local word order and context, unlike the Bag of Words model.

Types of N-Grams:

  • Unigrams: Single words (n=1). E.g., "the", "cat", "sat".
  • Bigrams: Sequences of two words (n=2). E.g., "the cat", "cat sat", "sat on".
  • Trigrams: Sequences of three words (n=3). E.g., "the cat sat", "cat sat on".
  • And so on for higher values of 'n'.

How it works:

  1. Tokenization: The text is tokenized into words.
  2. N-Gram Extraction: All possible contiguous sequences of 'n' words are extracted.
  3. Representation: Similar to BoW, these N-grams can be counted and used to create feature vectors. Each unique N-gram becomes a feature.

Example:

Document: "The quick brown fox jumps over the lazy dog."

  • Bigrams (n=2): "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog".
  • Trigrams (n=3): "The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over", "jumps over the", "over the lazy", "the lazy dog".

Pros:

  • Captures some local word order and context.
  • Can be more informative than unigrams (BoW) for certain tasks.
  • Useful for language modeling and text generation.

Cons:

  • Increases vocabulary size significantly with larger 'n'.
  • Still struggles with long-range dependencies and semantic meaning.
  • Sparsity issues become more pronounced.

5.5 N-Gram Language Modeling

N-Gram Language Modeling is a statistical method used to predict the probability of the next word in a sequence, given the preceding words. It's based on the principle of Markov assumption, where the probability of a word depends only on a fixed number of preceding words (the N-gram context).

How it works:

The goal is to estimate the probability distribution of sequences of words, $P(w_1, w_2, \ldots, w_m)$. Using the chain rule of probability, this can be broken down as:

$P(w_1, w_2, \ldots, w_m) = P(w_1) P(w_2 | w_1) P(w_3 | w_1, w_2) \ldots P(w_m | w_1, \ldots, w_{m-1})$

The Markov assumption simplifies this by assuming that the probability of a word only depends on the previous $n-1$ words:

$P(w_i | w_1, \ldots, w_{i-1}) \approx P(w_i | w_{i-n+1}, \ldots, w_{i-1})$

This probability is then estimated from counts of n-grams in a large corpus:

$P(w_i | w_{i-n+1}, \ldots, w_{i-1}) = \frac{C(w_{i-n+1}, \ldots, w_{i-1}, w_i)}{C(w_{i-n+1}, \ldots, w_{i-1})}$

Where $C(\ldots)$ denotes the count of an n-gram.

Example:

Predicting the next word after "the cat sat". Using a trigram model (n=3), we'd look at the preceding two words: "cat sat".

The model would estimate: $P(\text{next_word} | \text{"cat sat"}) = \frac{C(\text{"cat sat next_word"})}{C(\text{"cat sat"})}$

If the corpus contained "the cat sat on" twice and "the cat sat purred" once, the probabilities would be:

  • $P(\text{"on"} | \text{"cat sat"}) = \frac{C(\text{"cat sat on"})}{C(\text{"cat sat"})} = \frac{2}{3}$
  • $P(\text{"purred"} | \text{"cat sat"}) = \frac{C(\text{"cat sat purred"})}{C(\text{"cat sat"})} = \frac{1}{3}$

The model would predict "on" as the more likely next word.

Applications:

  • Speech recognition
  • Machine translation
  • Spelling correction
  • Text generation

Pros:

  • Captures local context and word dependencies.
  • Forms the basis for many traditional NLP tasks.

Cons:

  • Suffers from data sparsity: many valid n-grams may not appear in the training data. Smoothing techniques are needed.
  • Limited context window: cannot capture long-range dependencies.
  • Computational cost increases with 'n' and corpus size.