Chapter 12 – Text Representation
How Computers Read Words
Representing Textual Data for Machine Learning
1 · Introduction — Why Represent Text?
Every time you use a spam filter, search on Google, ask Siri a question, or chat with a chatbot, a computer is quietly reading and understanding text. But computers fundamentally operate on numbers — they cannot directly understand the letter “A” or the word “happy” the way humans do.
This is the central problem of text representation in Natural Language Processing (NLP). The quality of these numeric representations directly determines how well ML models perform on tasks like:
📧 Spam Detection
- Label emails as spam or not spam
- Gmail processes billions of emails daily
- Accuracy: >99.9% with modern methods
😊 Sentiment Analysis
- Is a movie review positive or negative?
- Used by businesses to monitor brand reputation
- Powers Amazon/Netflix recommendation systems
📰 News Categorisation
- Classify news into Sports, Politics, Business…
- Google News automatically clusters stories
- Helps personalise news feeds
🌍 Machine Translation
- Translate English → Tamil, Hindi, French…
- Google Translate handles 100+ languages
- Powered by the Transformer model (2017)
2 · Historical Context
Text representation has evolved dramatically over 70 years. Understanding this journey helps us appreciate why each new method was invented.
3 · From Characters to Words — The Challenge
The simplest way to represent text is via ASCII codes — each character gets an 8-bit binary number. For example:
The solution is to represent text at a higher level — using the vocabulary of all known words. This ensures every document maps to the same fixed-size vector.
4 · One-Hot Encoding
The simplest fixed-length representation:
- Create an exhaustive sorted list of all words — the vocabulary.
- Represent each word as a binary vector of length = vocabulary size.
- Put a
1at the word’s position,0everywhere else.
Andrew = [0, 0, 1, 0, 0, 0, 0, 0, 0] good = [0, 0, 0, 0, 1, 0, 0, 0, 0]
✅ Advantages
- Simple to understand and implement
- Fixed-length — ML-ready
- Exact representation of which word it is
❌ Disadvantages
- Huge vectors (vocabulary size)
- Mostly zeros — very sparse
- “cat” is equally far from “kitten” as from “rocket”
- No semantic meaning captured
5 · Bag-of-Words (BoW)
To represent a document (not just a word), Bag-of-Words simply counts how many times each vocabulary word appears. Word order is completely ignored — hence the word “bag”.
Worked Example
D₁ “Andrew is a tall boy”
D₂ “Ram is a good boy. Ratna is also good.”
| Word | D₁ (count) | D₂ (count) |
|---|---|---|
| a | 1 | 1 |
| also | 0 | 1 |
| Andrew | 1 | 0 |
| boy | 1 | 1 |
| good | 0 | 2 |
| is | 1 | 2 |
| Ram | 0 | 1 |
| Ratna | 0 | 1 |
| tall | 1 | 0 |
“Andrew is a tall boy” → e(Andrew) + e(is) + e(a) + e(tall) + e(boy)
Application: Presidential Speech Classifier
In the slides, US presidential speeches were represented as BoW histograms. GWB’s speech featured words like “Iraq”, “Terrorists”, “Freedom”; JFK’s featured “Soviet”, “Cuba”, “Missile”; FDR’s featured “Japanese”, “Germany”. A classifier can correctly identify the president just from these word-count patterns!
6 · TF-IDF — Weighting Words Smartly
TF-IDF combines two ideas:
📊 TF — Term Frequency
- How often does word T appear in document D?
- More occurrences → higher relevance to this document
- Normalised by document length to be fair
📉 IDF — Inverse Document Frequency
- How rare is word T across all documents?
- Words like “the”, “is” appear everywhere → IDF ≈ 0
- Rare words like “tsunami” → high IDF
Worked Example
| Word | IDF (2 docs) | TF in D₂ | TF-IDF in D₂ |
|---|---|---|---|
| a | ln(2/2) = 0 | 0.11 | 0.00 |
| is | ln(2/2) = 0 | 0.22 | 0.00 |
| boy | ln(2/2) = 0 | 0.11 | 0.00 |
| also | ln(2/1) = 0.69 | 0.11 | 0.08 |
| Ram | ln(2/1) = 0.69 | 0.11 | 0.08 |
| Ratna | ln(2/1) = 0.69 | 0.11 | 0.08 |
| good | ln(2/1) = 0.69 | 0.22 | 0.15 ← highest! |
Words like “is”, “a”, “boy” appear in both documents → IDF = 0 → TF-IDF = 0. These are stop words and are often removed before analysis. “good” is repeated twice in D₂ and unique to it, giving it the highest score.
Stop Words
Common English stop words: the, is, a, of,
and, in, to, it, that.
Removing them reduces vocabulary size and improves classification accuracy.
7 · Word2Vec — Adding Meaning
Words that appear in similar contexts tend to have similar meanings. Example: “The ___ sat on the mat” — both “cat” and “dog” fit, so they must be related.
How Word2Vec Works
Word2Vec (Mikolov et al., Google, 2013) trains a small neural network on a proxy task: “Given the surrounding words, predict the middle word.”
Words in similar positions across many sentences learn similar internal representations.
The weights of the hidden layer become the word vectors — dense, 100–300 dimensional, and semantically meaningful. Key properties:
Vector Arithmetic — The “King − Man + Woman = Queen” Magic
Other Remarkable Properties
| Analogy | Vector Equation |
|---|---|
| Capital cities | Paris − France + Italy ≈ Rome |
| Verb tenses | walked − walk + swim ≈ swam |
| Comparatives | bigger − big + cold ≈ colder |
8 · Modern Embeddings & Transformers
But “bank” means different things in “river bank” vs “savings bank”!
ELMo (2018) — Embeddings from Language Models
ELMo generates a different vector for each word depending on its context. It reads the entire sentence and produces context-aware representations.
BERT (2018, Google) — Bidirectional Transformers
BERT reads the whole sentence at once (not left-to-right like a human). It uses the Transformer architecture with “attention” — the model learns which words to pay attention to when encoding each word. BERT improved the state-of-the-art on 11 NLP tasks at once when released.
GPT / Claude / Gemini (2020–Present)
These Large Language Models have billions of parameters and are trained on enormous text corpora. They can write essays, answer questions, generate code, and translate languages. All of this is built on rich contextual text representations evolved from the methods described in this guide.
Each step adds more nuance, meaning, and power — but also more complexity and compute.
9 · Performance Metrics
How do we know if our text representations are good? We use two broad categories of evaluation:
Intrinsic Metrics — Evaluating the Vectors Themselves
📐 Cosine Similarity
Measures the angle between two vectors.
- +1 → identical direction (very similar)
- 0 → perpendicular (unrelated)
- −1 → opposite (antonyms)
🧪 Word Analogy Tasks
- Benchmark: Google Analogy Dataset (19,544 pairs)
- Task: king − man + woman = ?
- Score = % of analogies answered correctly
- Word2Vec achieves ~65%; BERT ~80%+
🔗 Word Similarity
- Benchmark: WordSim353, SimLex-999
- Human judges rate word pairs 1–10
- Compare model cosine similarity with human ratings
- Measure: Spearman correlation coefficient
🔍 Nearest Neighbours
- Find the top-10 closest words to a query word
- Qualitative check: do results make sense?
- E.g., nearest to “France”: Austria, Belgium, Germany…
Extrinsic Metrics — Evaluating on Real Tasks
| Task | Metric | What it measures |
|---|---|---|
| Spam classification | Accuracy, F1-score | Fraction of emails correctly classified |
| Sentiment analysis | Accuracy | Correct positive/negative labels |
| Machine translation | BLEU score (0–1) | Similarity to human reference translations |
| Language modelling | Perplexity | How surprised the model is by new text (lower = better) |
| Named entity recognition | F1-score | Precision & recall for identifying names, dates, places |
Classification Metrics — Quick Reference
Precision
Of all emails we called spam, how many really were?
Recall
Of all real spam emails, how many did we catch?
F1-Score
Harmonic mean of precision and recall. Balances both.
Accuracy
Overall fraction of correct predictions. Misleading for imbalanced datasets.
10 · Summary & Big Picture
| Method | Vector Size | Meaning? | Order? | Best For |
|---|---|---|---|---|
| One-Hot | Vocabulary (huge) | ❌ | ❌ | Baseline experiments |
| Bag-of-Words | Vocabulary (huge) | ❌ | ❌ | Spam filter, topic classification |
| TF-IDF | Vocabulary (huge) | Partial ✅ | ❌ | Search engines, document retrieval |
| Word2Vec | 100–300 | ✅ | ❌ | Semantic similarity, analogies |
| BERT / GPT | 768–4096 | ✅✅ | ✅✅ | Question answering, translation, chatbots |
- Text must be converted to numbers before machines can process it.
- One-Hot gives fixed-length vectors but no meaning.
- Bag-of-Words counts word occurrences — works well for many tasks!
- TF-IDF weights words by importance — rare-but-frequent wins.
- Word2Vec learns dense, meaningful vectors from context.
- Transformers (BERT, GPT) produce context-sensitive representations — the state of the art.
- Every AI assistant you use today is built on these representations!
Try It Yourself (Python)
# Bag-of-Words with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
docs = ["Andrew is a tall boy",
"Ram is a good boy Ratna is also good"]
# BoW
bow = CountVectorizer()
print(bow.fit_transform(docs).toarray())
# TF-IDF
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(docs).toarray())
# Word2Vec with gensim
from gensim.models import Word2Vec
sentences = [d.lower().split() for d in docs]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1)
print(model.wv.most_similar("good")) # nearest neighbours