Chapter 12 – Text Representation

Representing Textual Data — An Introductory Guide

How Computers Read Words

Representing Textual Data for Machine Learning

1 · Introduction — Why Represent Text?

Every time you use a spam filter, search on Google, ask Siri a question, or chat with a chatbot, a computer is quietly reading and understanding text. But computers fundamentally operate on numbers — they cannot directly understand the letter “A” or the word “happy” the way humans do.

🎯 Core Challenge
How do we convert words and sentences into numbers so that a machine learning model can process them — while preserving meaning?

This is the central problem of text representation in Natural Language Processing (NLP). The quality of these numeric representations directly determines how well ML models perform on tasks like:

📧 Spam Detection

  • Label emails as spam or not spam
  • Gmail processes billions of emails daily
  • Accuracy: >99.9% with modern methods

😊 Sentiment Analysis

  • Is a movie review positive or negative?
  • Used by businesses to monitor brand reputation
  • Powers Amazon/Netflix recommendation systems

📰 News Categorisation

  • Classify news into Sports, Politics, Business…
  • Google News automatically clusters stories
  • Helps personalise news feeds

🌍 Machine Translation

  • Translate English → Tamil, Hindi, French…
  • Google Translate handles 100+ languages
  • Powered by the Transformer model (2017)

2 · Historical Context

Text representation has evolved dramatically over 70 years. Understanding this journey helps us appreciate why each new method was invented.

1950 — The Turing Test
Alan Turing proposes a test where a machine must hold a human-like conversation. This inspired decades of research into making computers understand language.
1960s–1980s — Rule-Based Systems
Linguists hand-crafted thousands of grammar rules. Systems like ELIZA (1966) could mimic conversation but had no real understanding. This approach was fragile and didn’t scale to real-world language.
1972 — TF-IDF Invented
British computer scientist Karen Spärck Jones introduced the idea of weighting terms by their rarity across documents — a breakthrough for information retrieval (search engines).
1990s — Statistical NLP
Researchers moved from hand-written rules to learning patterns from data. Bag-of-Words models powered early spam filters and document classifiers.
2003 — Neural Language Models
Yoshua Bengio (later a Turing Award winner) showed that neural networks could learn distributed word representations — the precursor to Word2Vec.
2013 — Word2Vec
Tomas Mikolov and team at Google released Word2Vec — a fast, scalable method to learn dense, meaningful word vectors from billions of words. The famous king − man + woman ≈ queen result astonished the NLP world.
2017 — Transformers
The paper “Attention Is All You Need” introduced the Transformer architecture, enabling models to understand long-range dependencies in text. This became the foundation for BERT, GPT, and modern LLMs.
2020–Present — Large Language Models
GPT-3, ChatGPT, Gemini, Claude and others have billions of parameters and can write essays, answer questions, and generate code — all built on the rich text representations we explore in this guide.

3 · From Characters to Words — The Challenge

The simplest way to represent text is via ASCII codes — each character gets an 8-bit binary number. For example:

cat c=99 a=97 t=116 ← 3 codes carriage c a r r i a g e ← 8 codes ⚠ Variable length — ML models need FIXED-LENGTH inputs!
⚠ The Variable-Length Problem
Machine learning algorithms require all inputs to have the same number of features. ASCII codes give variable-length vectors, so we cannot directly feed words into standard ML models.

The solution is to represent text at a higher level — using the vocabulary of all known words. This ensures every document maps to the same fixed-size vector.

4 · One-Hot Encoding

The simplest fixed-length representation:

  1. Create an exhaustive sorted list of all words — the vocabulary.
  2. Represent each word as a binary vector of length = vocabulary size.
  3. Put a 1 at the word’s position, 0 everywhere else.
📌 Example
Vocabulary: {a, also, Andrew, boy, good, tall, is, Ram, Ratna}  →  9 words
Andrew = [0, 0, 1, 0, 0, 0, 0, 0, 0]   good = [0, 0, 0, 0, 1, 0, 0, 0, 0]
Word andrew good tall a 0 0 0 also 0 0 0 Andrew 1 0 0 boy 0 0 0 good 0 1 0 tall 0 0 1 … and so on for every word in the vocabulary … Vector size = vocabulary size. For English: typically 50,000 – 500,000!

✅ Advantages

  • Simple to understand and implement
  • Fixed-length — ML-ready
  • Exact representation of which word it is

❌ Disadvantages

  • Huge vectors (vocabulary size)
  • Mostly zeros — very sparse
  • “cat” is equally far from “kitten” as from “rocket”
  • No semantic meaning captured

5 · Bag-of-Words (BoW)

To represent a document (not just a word), Bag-of-Words simply counts how many times each vocabulary word appears. Word order is completely ignored — hence the word “bag”.

🎒 Analogy
Imagine tearing apart a sentence, throwing all the words into a bag, shaking it up, then counting each word type. The result is a BoW vector.

Worked Example

D₁ “Andrew is a tall boy”
D₂ “Ram is a good boy. Ratna is also good.”

WordD₁ (count)D₂ (count)
a11
also01
Andrew10
boy11
good02
is12
Ram01
Ratna01
tall10
💡 Mathematical Insight
The BoW vector for a document = sum of the one-hot vectors of all its words.
“Andrew is a tall boy” → e(Andrew) + e(is) + e(a) + e(tall) + e(boy)

Application: Presidential Speech Classifier

In the slides, US presidential speeches were represented as BoW histograms. GWB’s speech featured words like “Iraq”, “Terrorists”, “Freedom”; JFK’s featured “Soviet”, “Cuba”, “Missile”; FDR’s featured “Japanese”, “Germany”. A classifier can correctly identify the president just from these word-count patterns!

⚠ Limitation
BoW ignores word order: “Dog bites man” and “Man bites dog” get identical BoW vectors, yet have opposite meanings!

6 · TF-IDF — Weighting Words Smartly

📜 History
Introduced by Karen Spärck Jones in 1972, a British computer scientist widely considered a founder of information retrieval. Her idea was radical: not all words should count equally. A word that appears in every document tells you almost nothing; a word unique to a few documents is highly informative.

TF-IDF combines two ideas:

📊 TF — Term Frequency

  • How often does word T appear in document D?
  • More occurrences → higher relevance to this document
  • Normalised by document length to be fair
TF(T, D) = (# of T in D) / (# words in D)

📉 IDF — Inverse Document Frequency

  • How rare is word T across all documents?
  • Words like “the”, “is” appear everywhere → IDF ≈ 0
  • Rare words like “tsunami” → high IDF
IDF(T) = log( N / df(T) )
TF-IDF(T, D) = TF(T, D)  ×  IDF(T)

Worked Example

WordIDF (2 docs)TF in D₂TF-IDF in D₂
aln(2/2) = 00.110.00
isln(2/2) = 00.220.00
boyln(2/2) = 00.110.00
alsoln(2/1) = 0.690.110.08
Ramln(2/1) = 0.690.110.08
Ratnaln(2/1) = 0.690.110.08
goodln(2/1) = 0.690.220.15 ← highest!

Words like “is”, “a”, “boy” appear in both documents → IDF = 0 → TF-IDF = 0. These are stop words and are often removed before analysis. “good” is repeated twice in D₂ and unique to it, giving it the highest score.

Stop Words

Common English stop words: the, is, a, of, and, in, to, it, that. Removing them reduces vocabulary size and improves classification accuracy.

7 · Word2Vec — Adding Meaning

🤔 The Meaning Gap
In one-hot encoding, the distance between “cat” and “kitten” equals the distance between “cat” and “rocket”. But humans know these pairs are wildly different! We need representations where similar words are close in vector space.
📜 Linguistic Insight — John Firth (1957)
“You shall know a word by the company it keeps.”
Words that appear in similar contexts tend to have similar meanings. Example: “The ___ sat on the mat” — both “cat” and “dog” fit, so they must be related.

How Word2Vec Works

Word2Vec (Mikolov et al., Google, 2013) trains a small neural network on a proxy task: “Given the surrounding words, predict the middle word.”

🔍 Example Proxy Task (CBOW model)
Input (context): “The”, “cat”, “on”, “the”  →  Predict: “sat”
Words in similar positions across many sentences learn similar internal representations.

The weights of the hidden layer become the word vectors — dense, 100–300 dimensional, and semantically meaningful. Key properties:

Vector Arithmetic — The “King − Man + Woman = Queen” Magic

Gender Status king man queen woman royalty female female king − man + woman ≈ queen (vector arithmetic works!)

Other Remarkable Properties

AnalogyVector Equation
Capital citiesParis − France + Italy ≈ Rome
Verb tenseswalked − walk + swim ≈ swam
Comparativesbigger − big + cold ≈ colder

8 · Modern Embeddings & Transformers

⚠ Word2Vec’s Limitation
Each word gets one single fixed vector, regardless of context.
But “bank” means different things in “river bank” vs “savings bank”!

ELMo (2018) — Embeddings from Language Models

ELMo generates a different vector for each word depending on its context. It reads the entire sentence and produces context-aware representations.

BERT (2018, Google) — Bidirectional Transformers

BERT reads the whole sentence at once (not left-to-right like a human). It uses the Transformer architecture with “attention” — the model learns which words to pay attention to when encoding each word. BERT improved the state-of-the-art on 11 NLP tasks at once when released.

GPT / Claude / Gemini (2020–Present)

These Large Language Models have billions of parameters and are trained on enormous text corpora. They can write essays, answer questions, generate code, and translate languages. All of this is built on rich contextual text representations evolved from the methods described in this guide.

🚀 The Evolution
ASCII → One-Hot → BoW → TF-IDF → Word2Vec → ELMo → BERT → GPT/Claude
Each step adds more nuance, meaning, and power — but also more complexity and compute.

9 · Performance Metrics

How do we know if our text representations are good? We use two broad categories of evaluation:

Intrinsic Metrics — Evaluating the Vectors Themselves

📐 Cosine Similarity

Measures the angle between two vectors.

cos(θ) = (u · v) / (|u| × |v|)
  • +1 → identical direction (very similar)
  • 0 → perpendicular (unrelated)
  • −1 → opposite (antonyms)

🧪 Word Analogy Tasks

  • Benchmark: Google Analogy Dataset (19,544 pairs)
  • Task: king − man + woman = ?
  • Score = % of analogies answered correctly
  • Word2Vec achieves ~65%; BERT ~80%+

🔗 Word Similarity

  • Benchmark: WordSim353, SimLex-999
  • Human judges rate word pairs 1–10
  • Compare model cosine similarity with human ratings
  • Measure: Spearman correlation coefficient

🔍 Nearest Neighbours

  • Find the top-10 closest words to a query word
  • Qualitative check: do results make sense?
  • E.g., nearest to “France”: Austria, Belgium, Germany…

Extrinsic Metrics — Evaluating on Real Tasks

TaskMetricWhat it measures
Spam classificationAccuracy, F1-scoreFraction of emails correctly classified
Sentiment analysisAccuracyCorrect positive/negative labels
Machine translationBLEU score (0–1)Similarity to human reference translations
Language modellingPerplexityHow surprised the model is by new text (lower = better)
Named entity recognitionF1-scorePrecision & recall for identifying names, dates, places

Classification Metrics — Quick Reference

Precision

TP / (TP + FP)

Of all emails we called spam, how many really were?

Recall

TP / (TP + FN)

Of all real spam emails, how many did we catch?

F1-Score

2 × P × R / (P + R)

Harmonic mean of precision and recall. Balances both.

Accuracy

(TP + TN) / Total

Overall fraction of correct predictions. Misleading for imbalanced datasets.

10 · Summary & Big Picture

ASCII character codes One-Hot fixed-length word Bag-of-Words count words TF-IDF weight words Word2Vec dense semantics BERT / GPT contextual embeddings 1960s 1980s 1990s 1972/2000s 2013 2017–now
MethodVector SizeMeaning?Order?Best For
One-HotVocabulary (huge)Baseline experiments
Bag-of-WordsVocabulary (huge)Spam filter, topic classification
TF-IDFVocabulary (huge)Partial ✅Search engines, document retrieval
Word2Vec100–300Semantic similarity, analogies
BERT / GPT768–4096✅✅✅✅Question answering, translation, chatbots
🎓 Key Takeaways for Students
  1. Text must be converted to numbers before machines can process it.
  2. One-Hot gives fixed-length vectors but no meaning.
  3. Bag-of-Words counts word occurrences — works well for many tasks!
  4. TF-IDF weights words by importance — rare-but-frequent wins.
  5. Word2Vec learns dense, meaningful vectors from context.
  6. Transformers (BERT, GPT) produce context-sensitive representations — the state of the art.
  7. Every AI assistant you use today is built on these representations!

Try It Yourself (Python)

# Bag-of-Words with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

docs = ["Andrew is a tall boy",
        "Ram is a good boy Ratna is also good"]

# BoW
bow = CountVectorizer()
print(bow.fit_transform(docs).toarray())

# TF-IDF
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(docs).toarray())

# Word2Vec with gensim
from gensim.models import Word2Vec
sentences = [d.lower().split() for d in docs]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1)
print(model.wv.most_similar("good"))   # nearest neighbours

Report prepared by CK Raju, iHub-Data, IIIT Hyderabad  |  May 2026