Chapter 12 – Text Representation

Representing Textual Data — An Introductory Guide

How Computers Read Words

Representing Textual Data for Machine Learning

1 · Introduction — Why Represent Text?

Every time you use a spam filter, search on Google, ask Siri a question, or chat with a chatbot, a computer is quietly reading and understanding text. But computers fundamentally operate on numbers — they cannot directly understand the letter “A” or the word “happy” the way humans do.

🎯 Core Challenge

How do we convert words and sentences into numbers so that a machine learning model can process them — while preserving meaning?

This is the central problem of text representation in Natural Language Processing (NLP). The quality of these numeric representations directly determines how well ML models perform on tasks like:

📧 Spam Detection

Label emails as spam or not spam
Gmail processes billions of emails daily
Accuracy: >99.9% with modern methods

😊 Sentiment Analysis

Is a movie review positive or negative?
Used by businesses to monitor brand reputation
Powers Amazon/Netflix recommendation systems

📰 News Categorisation

Classify news into Sports, Politics, Business…
Google News automatically clusters stories
Helps personalise news feeds

🌍 Machine Translation

Translate English → Tamil, Hindi, French…
Google Translate handles 100+ languages
Powered by the Transformer model (2017)

2 · Historical Context

Text representation has evolved dramatically over 70 years. Understanding this journey helps us appreciate why each new method was invented.

1950 — The Turing Test

Alan Turing proposes a test where a machine must hold a human-like conversation. This inspired decades of research into making computers understand language.

1960s–1980s — Rule-Based Systems

Linguists hand-crafted thousands of grammar rules. Systems like ELIZA (1966) could mimic conversation but had no real understanding. This approach was fragile and didn’t scale to real-world language.

1972 — TF-IDF Invented

British computer scientist Karen Spärck Jones introduced the idea of weighting terms by their rarity across documents — a breakthrough for information retrieval (search engines).

1990s — Statistical NLP

Researchers moved from hand-written rules to learning patterns from data. Bag-of-Words models powered early spam filters and document classifiers.

2003 — Neural Language Models

Yoshua Bengio (later a Turing Award winner) showed that neural networks could learn distributed word representations — the precursor to Word2Vec.

2013 — Word2Vec

Tomas Mikolov and team at Google released Word2Vec — a fast, scalable method to learn dense, meaningful word vectors from billions of words. The famous king − man + woman ≈ queen result astonished the NLP world.

2017 — Transformers

The paper “Attention Is All You Need” introduced the Transformer architecture, enabling models to understand long-range dependencies in text. This became the foundation for BERT, GPT, and modern LLMs.

2020–Present — Large Language Models

GPT-3, ChatGPT, Gemini, Claude and others have billions of parameters and can write essays, answer questions, and generate code — all built on the rich text representations we explore in this guide.

3 · From Characters to Words — The Challenge

The simplest way to represent text is via ASCII codes — each character gets an 8-bit binary number. For example:

⚠ The Variable-Length Problem

Machine learning algorithms require all inputs to have the same number of features. ASCII codes give variable-length vectors, so we cannot directly feed words into standard ML models.

The solution is to represent text at a higher level — using the vocabulary of all known words. This ensures every document maps to the same fixed-size vector.

4 · One-Hot Encoding

The simplest fixed-length representation:

Create an exhaustive sorted list of all words — the vocabulary.
Represent each word as a binary vector of length = vocabulary size.
Put a 1 at the word’s position, 0 everywhere else.

📌 Example

Vocabulary: {a, also, Andrew, boy, good, tall, is, Ram, Ratna} → 9 words
Andrew = [0, 0, 1, 0, 0, 0, 0, 0, 0] good = [0, 0, 0, 0, 1, 0, 0, 0, 0]

✅ Advantages

Simple to understand and implement
Fixed-length — ML-ready
Exact representation of which word it is

❌ Disadvantages

Huge vectors (vocabulary size)
Mostly zeros — very sparse
“cat” is equally far from “kitten” as from “rocket”
No semantic meaning captured

5 · Bag-of-Words (BoW)

To represent a document (not just a word), Bag-of-Words simply counts how many times each vocabulary word appears. Word order is completely ignored — hence the word “bag”.

🎒 Analogy

Imagine tearing apart a sentence, throwing all the words into a bag, shaking it up, then counting each word type. The result is a BoW vector.

Worked Example

D₁ “Andrew is a tall boy”
D₂ “Ram is a good boy. Ratna is also good.”

Word	D₁ (count)	D₂ (count)
a	1	1
also	0	1
Andrew	1	0
boy	1	1
good	0	2
is	1	2
Ram	0	1
Ratna	0	1
tall	1	0

💡 Mathematical Insight

The BoW vector for a document = sum of the one-hot vectors of all its words.
“Andrew is a tall boy” → e(Andrew) + e(is) + e(a) + e(tall) + e(boy)

Application: Presidential Speech Classifier

In the slides, US presidential speeches were represented as BoW histograms. GWB’s speech featured words like “Iraq”, “Terrorists”, “Freedom”; JFK’s featured “Soviet”, “Cuba”, “Missile”; FDR’s featured “Japanese”, “Germany”. A classifier can correctly identify the president just from these word-count patterns!

⚠ Limitation

BoW ignores word order: “Dog bites man” and “Man bites dog” get identical BoW vectors, yet have opposite meanings!

6 · TF-IDF — Weighting Words Smartly

📜 History

Introduced by Karen Spärck Jones in 1972, a British computer scientist widely considered a founder of information retrieval. Her idea was radical: not all words should count equally. A word that appears in every document tells you almost nothing; a word unique to a few documents is highly informative.

TF-IDF combines two ideas:

📊 TF — Term Frequency

How often does word T appear in document D?
More occurrences → higher relevance to this document
Normalised by document length to be fair

TF(T, D) = (# of T in D) / (# words in D)

📉 IDF — Inverse Document Frequency

How rare is word T across all documents?
Words like “the”, “is” appear everywhere → IDF ≈ 0
Rare words like “tsunami” → high IDF

IDF(T) = log( N / df(T) )

TF-IDF(T, D) = TF(T, D) × IDF(T)

Worked Example

Word	IDF (2 docs)	TF in D₂	TF-IDF in D₂
a	ln(2/2) = 0	0.11	0.00
is	ln(2/2) = 0	0.22	0.00
boy	ln(2/2) = 0	0.11	0.00
also	ln(2/1) = 0.69	0.11	0.08
Ram	ln(2/1) = 0.69	0.11	0.08
Ratna	ln(2/1) = 0.69	0.11	0.08
good	ln(2/1) = 0.69	0.22	0.15 ← highest!

Words like “is”, “a”, “boy” appear in both documents → IDF = 0 → TF-IDF = 0. These are stop words and are often removed before analysis. “good” is repeated twice in D₂ and unique to it, giving it the highest score.

Stop Words

Common English stop words: the, is, a, of, and, in, to, it, that. Removing them reduces vocabulary size and improves classification accuracy.

7 · Word2Vec — Adding Meaning

🤔 The Meaning Gap

In one-hot encoding, the distance between “cat” and “kitten” equals the distance between “cat” and “rocket”. But humans know these pairs are wildly different! We need representations where similar words are close in vector space.

📜 Linguistic Insight — John Firth (1957)

“You shall know a word by the company it keeps.”
Words that appear in similar contexts tend to have similar meanings. Example: “The ___ sat on the mat” — both “cat” and “dog” fit, so they must be related.

How Word2Vec Works

Word2Vec (Mikolov et al., Google, 2013) trains a small neural network on a proxy task: “Given the surrounding words, predict the middle word.”

🔍 Example Proxy Task (CBOW model)

Input (context): “The”, “cat”, “on”, “the” → Predict: “sat”
Words in similar positions across many sentences learn similar internal representations.

The weights of the hidden layer become the word vectors — dense, 100–300 dimensional, and semantically meaningful. Key properties:

Vector Arithmetic — The “King − Man + Woman = Queen” Magic

Other Remarkable Properties

Analogy	Vector Equation
Capital cities	Paris − France + Italy ≈ Rome
Verb tenses	walked − walk + swim ≈ swam
Comparatives	bigger − big + cold ≈ colder

8 · Modern Embeddings & Transformers

⚠ Word2Vec’s Limitation

Each word gets one single fixed vector, regardless of context.
But “bank” means different things in “river bank” vs “savings bank”!

ELMo (2018) — Embeddings from Language Models

ELMo generates a different vector for each word depending on its context. It reads the entire sentence and produces context-aware representations.

BERT (2018, Google) — Bidirectional Transformers

BERT reads the whole sentence at once (not left-to-right like a human). It uses the Transformer architecture with “attention” — the model learns which words to pay attention to when encoding each word. BERT improved the state-of-the-art on 11 NLP tasks at once when released.

GPT / Claude / Gemini (2020–Present)

These Large Language Models have billions of parameters and are trained on enormous text corpora. They can write essays, answer questions, generate code, and translate languages. All of this is built on rich contextual text representations evolved from the methods described in this guide.

🚀 The Evolution

ASCII → One-Hot → BoW → TF-IDF → Word2Vec → ELMo → BERT → GPT/Claude
Each step adds more nuance, meaning, and power — but also more complexity and compute.

9 · Performance Metrics

How do we know if our text representations are good? We use two broad categories of evaluation:

Intrinsic Metrics — Evaluating the Vectors Themselves

📐 Cosine Similarity

Measures the angle between two vectors.

cos(θ) = (u · v) / (|u| × |v|)

+1 → identical direction (very similar)
0 → perpendicular (unrelated)
−1 → opposite (antonyms)

🧪 Word Analogy Tasks

Benchmark: Google Analogy Dataset (19,544 pairs)
Task: king − man + woman = ?
Score = % of analogies answered correctly
Word2Vec achieves ~65%; BERT ~80%+

🔗 Word Similarity

Benchmark: WordSim353, SimLex-999
Human judges rate word pairs 1–10
Compare model cosine similarity with human ratings
Measure: Spearman correlation coefficient

🔍 Nearest Neighbours

Find the top-10 closest words to a query word
Qualitative check: do results make sense?
E.g., nearest to “France”: Austria, Belgium, Germany…

Extrinsic Metrics — Evaluating on Real Tasks

Task	Metric	What it measures
Spam classification	Accuracy, F1-score	Fraction of emails correctly classified
Sentiment analysis	Accuracy	Correct positive/negative labels
Machine translation	BLEU score (0–1)	Similarity to human reference translations
Language modelling	Perplexity	How surprised the model is by new text (lower = better)
Named entity recognition	F1-score	Precision & recall for identifying names, dates, places

Classification Metrics — Quick Reference

Precision

TP / (TP + FP)

Of all emails we called spam, how many really were?

Recall

TP / (TP + FN)

Of all real spam emails, how many did we catch?

F1-Score

2 × P × R / (P + R)

Harmonic mean of precision and recall. Balances both.

Accuracy

(TP + TN) / Total

Overall fraction of correct predictions. Misleading for imbalanced datasets.

10 · Summary & Big Picture

Method	Vector Size	Meaning?	Order?	Best For
One-Hot	Vocabulary (huge)	❌	❌	Baseline experiments
Bag-of-Words	Vocabulary (huge)	❌	❌	Spam filter, topic classification
TF-IDF	Vocabulary (huge)	Partial ✅	❌	Search engines, document retrieval
Word2Vec	100–300	✅	❌	Semantic similarity, analogies
BERT / GPT	768–4096	✅✅	✅✅	Question answering, translation, chatbots

🎓 Key Takeaways for Students

Text must be converted to numbers before machines can process it.
One-Hot gives fixed-length vectors but no meaning.
Bag-of-Words counts word occurrences — works well for many tasks!
TF-IDF weights words by importance — rare-but-frequent wins.
Word2Vec learns dense, meaningful vectors from context.
Transformers (BERT, GPT) produce context-sensitive representations — the state of the art.
Every AI assistant you use today is built on these representations!

Try It Yourself (Python)

# Bag-of-Words with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

docs = ["Andrew is a tall boy",
        "Ram is a good boy Ratna is also good"]

# BoW
bow = CountVectorizer()
print(bow.fit_transform(docs).toarray())

# TF-IDF
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(docs).toarray())

# Word2Vec with gensim
from gensim.models import Word2Vec
sentences = [d.lower().split() for d in docs]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1)
print(model.wv.most_similar("good"))   # nearest neighbours

Chapter 12 – Text Representation

1 · Introduction — Why Represent Text?

📧 Spam Detection

😊 Sentiment Analysis

📰 News Categorisation

🌍 Machine Translation

2 · Historical Context

3 · From Characters to Words — The Challenge

4 · One-Hot Encoding

✅ Advantages

❌ Disadvantages

5 · Bag-of-Words (BoW)

Worked Example

Application: Presidential Speech Classifier

6 · TF-IDF — Weighting Words Smartly

📊 TF — Term Frequency

📉 IDF — Inverse Document Frequency

Worked Example

Stop Words

7 · Word2Vec — Adding Meaning

How Word2Vec Works

Vector Arithmetic — The “King − Man + Woman = Queen” Magic

Other Remarkable Properties

8 · Modern Embeddings & Transformers

ELMo (2018) — Embeddings from Language Models

BERT (2018, Google) — Bidirectional Transformers

GPT / Claude / Gemini (2020–Present)

9 · Performance Metrics

Intrinsic Metrics — Evaluating the Vectors Themselves

📐 Cosine Similarity

🧪 Word Analogy Tasks

🔗 Word Similarity

🔍 Nearest Neighbours

Extrinsic Metrics — Evaluating on Real Tasks

Classification Metrics — Quick Reference

Precision

Recall

F1-Score

Accuracy

10 · Summary & Big Picture

Try It Yourself (Python)

Brief Description

iHub-Data