Chapter 13 – Linear Classifiers
Understanding Linear Classifiers
What Is Classification?
Every day, your brain sorts the world into categories — this is a dog, that is a cat; this email looks important, that one looks like spam; this fruit is ripe, that one is not. This act of sorting things into groups is called classification. Teaching a computer to do the same thing automatically — using examples — is one of the most foundational ideas in machine learning.
The details given here, present one of the oldest, simplest, and still most widely-used classification methods: the linear classifier.
The mathematical idea of separating data into groups goes back to Francis Galton and Karl Pearson in the 1880s–1900s, who studied how to distinguish one population from another using statistical measurements. By the 1930s, statistician Ronald Fisher formalized this as Linear Discriminant Analysis (LDA) — a method still used in biology, medicine, and finance today. Fisher’s idea was simple: find the one direction in space that best separates two classes when projected onto it. Linear classifiers are a direct descendant of this insight.
Spam Detection
Is this email spam or not spam? A linear classifier reads features of the email and decides.
Medical Diagnosis
Given a patient’s age, blood pressure, and cholesterol — is there a risk of heart disease?
Flower Species
Given petal width and sepal length — is this flower an Iris Setosa or Iris Versicolor?
Sentiment Analysis
Is this movie review positive or negative? Classify based on word counts and patterns.
In all these cases, the classifier looks at measurable features (called inputs or features) and assigns one of two labels (called classes, often written as +1 and −1, or A and B). A linear classifier does this using the simplest possible mathematical tool: a straight line (or its multi-dimensional equivalent).
Linear Functions as Boundaries
The core idea is disarmingly simple: draw a straight line that separates two groups of points. Everything on one side of the line belongs to Class A; everything on the other side belongs to Class B.
The Equation of a Line
You have almost certainly seen this in school math:
ax + by + c = 0 ← general form
w₁x₁ + w₂x₂ + w₀ = 0 ← machine learning notation
The slides introduce a clever rewriting of the line equation. Instead of writing y = mx + c, they rename the two coordinate axes as x₁ and x₂ (because in machine learning, both axes represent features of the data — no axis is inherently the “output”). The equation becomes:
w₁, w₂ = weights (they control the tilt of the line)
w₀ = bias (controls where the line sits)
The function g(X) is like a “sensor.” For any point X in space:
- If g(X) = 0 — the point is exactly on the boundary line
- If g(X) > 0 — the point is on the positive side (Class A)
- If g(X) < 0 — the point is on the negative side (Class B)
The sign of g(X) is the classifier’s “decision.” This is elegant mathematics at its best — the same formula both defines the boundary and makes the classification decision.
The ability to write a line as an equation is due to René Descartes, who invented the Cartesian coordinate system in 1637. His genius was realizing that geometric shapes (lines, curves) could be described by algebraic equations — and vice versa. This unification of algebra and geometry, called analytic geometry, is the very foundation upon which all linear classifiers are built. Without Descartes, the equation g(X) = 0 would have no meaning.
In one dimension (a single feature, like “height”), the boundary is a point on a number line: taller than 170 cm = adult, shorter = child. In two dimensions it is a line. In three dimensions — a plane. The slides show the beautiful visual of a green semi-transparent plane cutting through a 3D scatter of blue and orange points.
A Worked Example
The slides walk through a concrete example step by step. Consider the equation:
Three points are tested:
| Point | Coordinates | g(X) Value | Meaning |
|---|---|---|---|
| P | (7, 5) | g(P) = 2(7) − 3(5) + 1 = 0 | Exactly on the line |
| Q | (5, 2) | g(Q) = 2(5) − 3(2) + 1 = 5 (+ve) | On the positive (Class A) side |
| R | (2, 4) | g(R) = 2(2) − 3(4) + 1 = −7 (−ve) | On the negative (Class B) side |
This example teaches two things simultaneously. First, the mechanics of substituting a point into the equation. Second — and more profoundly — that the magnitude of g(X) carries information about how far from the boundary a point is (when the weight vector is normalized). A large positive g(X) means “deep inside Class A”; a value near zero means “close to the boundary, uncertain.”
This distance interpretation becomes central in a more advanced classifier called the Support Vector Machine (SVM), which explicitly maximizes the margin between classes.
Banks use exactly this idea. A customer is described by features like annual income (x₁) and number of missed payments (x₂). A linear boundary g(x) = 0 separates “creditworthy” from “risky.” A customer far inside the positive region gets the loan; one near zero gets extra scrutiny; one deep in the negative region is rejected. The bank is literally computing g(X) for every applicant.
The Linear Classifier
Formally, a linear classifier is a function that assigns a label to a data point based on which side of a linear boundary the point falls:
Find weights w such that:
g(xₐ) = wᵀxₐ > 0 for all xₐ ∈ A
g(x_b) = wᵀx_b < 0 for all x_b ∈ B
The slides show an intuitive animation: a scatter of orange dots (Class A) and blue dots (Class B), with a line slowly being drawn to separate them. This is precisely what the algorithm learns to do.
The Decision Rule
g(X) > 0 → Class A
Classify as positive class. Example: “This email is NOT spam.”
g(X) < 0 → Class B
Classify as negative class. Example: “This email IS spam.”
g(X) = 0 → On Boundary
The point is on the decision boundary — the classifier is uncertain.
Ronald Aylmer Fisher, one of the greatest statisticians of the 20th century, published “The Use of Multiple Measurements in Taxonomic Problems” in 1936. He used iris flower measurements (sepal length, sepal width, petal length, petal width) to create a linear boundary that separated three species. His Linear Discriminant Analysis is mathematically equivalent to a linear classifier. The famous “Iris Dataset” he used is still a standard benchmark in machine learning today — nearly 90 years later.
Higher Dimensions & Hyperplanes
The slides elegantly extend the 2D line to higher dimensions. This is not just a mathematical trick — in real problems, data has dozens or even millions of features.
| Dimensions | Equation Form | Boundary Shape | Example |
|---|---|---|---|
| 1D | w₁x₁ + w₀ = 0 | Point on a number line | Is a patient’s glucose level above threshold? |
| 2D | w₁x₁ + w₂x₂ + w₀ = 0 | Line in a plane | Iris species from 2 measurements |
| 3D | w₁x₁ + w₂x₂ + w₃x₃ + w₀ = 0 | Plane in 3D space | Cancer risk from 3 biomarkers |
| dD | w₁x₁ + ⋯ + wdxd + w₀ = 0 | Hyperplane | Spam detection from 50,000 word features |
A stunning result in mathematics (related to the Perceptron Convergence Theorem) states: if two classes are linearly separable in d dimensions, a linear classifier will always find the separating hyperplane. More remarkably, even when data is not separable in its original space, it can often be made separable by projecting it into a higher-dimensional space. This is the idea behind the kernel trick used in Support Vector Machines.
A grayscale 100×100 pixel image is a point in 10,000-dimensional space (one dimension per pixel). Linear classifiers can separate “faces” from “non-faces” in this very high-dimensional space. The Eigenfaces approach by Turk & Pentland (1991) at MIT used linear methods to recognize faces — one of the first practical computer vision systems. Today’s face unlock on your phone descends from this lineage.
Vector Notation
The slides introduce a compact and powerful way to write the linear equation using vectors. This is the language of all modern machine learning.
Augmented feature vector (appends 1 for bias): x = [x₁, x₂, …, xₐ, 1]
The entire linear equation becomes simply:
g(x) = wᵀx = 0
Alternative (bias separate): g(x): wᵀx + b = 0
Why This Notation Is Brilliant
The inner product wᵀx (also called the dot product) collapses an entire weighted sum into a single compact symbol. This is not just aesthetic — it enables:
Computational Speed
Modern CPUs and GPUs are optimized to compute dot products billions of times per second using matrix operations (BLAS libraries).
Geometric Meaning
wᵀx is the projection of x onto w. The weight vector w is perpendicular to the decision boundary — this has deep geometric significance.
Generalization
The same formula works for 2 features or 2 million features. Modern neural networks are built from millions of such dot products.
The concept of a vector was formalized by Hermann Grassmann in 1844 and Josiah Willard Gibbs in the 1880s. The dot product was introduced as part of vector algebra. Arthur Cayley developed matrix notation in 1858. These 19th-century mathematical tools, invented to describe physics, became the computational backbone of machine learning 150 years later. The notation wᵀx is simultaneously a physics concept (work = force · displacement), a geometry concept (projection), and a computing operation (multiply-and-accumulate). Mathematics is the universal language.
The Augmented Vector Trick
Notice the slides append a 1 to the feature vector: x = [x₁, x₂, ..., xd, 1]. This is elegant engineering — by including the constant 1 in the vector and including w₀ (the bias) in the weight vector, the messy separate “+ bias” term disappears. The entire linear classifier, bias included, is now simply wᵀx. This is called the augmented or homogeneous representation and is used universally in practice.
Learning from Data
So far we have assumed the weights w are given. But in practice, we don’t know the weights — the whole point of machine learning is to discover them from examples. This is the “learning” part of machine learning.
You are given a dataset of labeled points: {(x₁ᵢ, x₂ᵢ, yᵢ)} where yᵢ = +1 for Class A and yᵢ = −1 for Class B. You need to find the weight vector w such that the linear function g(x) = wᵀx correctly classifies every point.
The Learning Loop
The slides present the learning process as a beautiful iterative loop — a feedback cycle that gradually improves the classifier. This is the fundamental pattern of nearly all machine learning algorithms:
Initialize Randomly
Start with random values for the weights w. The initial line is random and likely wrong — but that’s okay. We just need a starting point.
Evaluate “Goodness”
Measure how well the current line classifies the training data. Compute a loss function J(w) that quantifies the errors. The better the line, the lower J(w).
Move to a Better Position
Adjust the weights w to reduce J(w). The key question: in which direction should we adjust? This is answered by the gradient.
Repeat Until Satisfied
Iterate steps 2 and 3 until J(w) falls below a threshold τ (tau). At that point, the classifier is “good enough” and training stops.
The idea of improving a model iteratively by measuring its error was present in the work of Adrien-Marie Legendre and Carl Friedrich Gauss in the early 1800s (least squares method). However, applying it to classification rather than regression — and doing so computationally — had to wait for the computer age. Norbert Wiener’s Cybernetics (1948) formalized the concept of feedback loops in machines, directly inspiring early neural network researchers. Every time a machine learning algorithm “learns,” it is running a feedback loop Wiener would have recognized.
DeepMind’s AlphaGo (2016), which defeated the world champion in the ancient game of Go, learned by playing millions of games against itself — a vastly more complex version of the same loop: initialize, evaluate, adjust, repeat. The core feedback mechanism is identical to what is shown in these slides. What changed was the scale and the use of deep neural networks — but the loop itself remained the same.
Loss Functions
The slides ask: how do we define “goodness”? The answer is a loss function J(w) — a number that measures how bad the classifier currently is. Our goal is to minimize this number.
Candidate 1: Count of Misclassified Points
The most obvious choice: count how many points are on the wrong side of the line. If 3 out of 20 points are misclassified, J = 3.
This loss function has a critical flaw: it is not differentiable. When you move the line slightly, the count of misclassified points jumps abruptly (from 3 to 2 or 4 instantly). You cannot compute a smooth gradient — which means you cannot use calculus to guide the optimization. This is a fundamental insight: the mathematical tools we want to use (calculus, gradients) require smooth functions.
Candidate 2: The Perceptron Loss
The slides derive a much better loss function. The key observations are:
Class A (yᵢ = +1): wᵀxᵢ > 0 → product yᵢ(wᵀxᵢ) > 0
Class B (yᵢ = −1): wᵀxᵢ < 0 → product yᵢ(wᵀxᵢ) > 0
For a misclassified sample:
yᵢ(wᵀxᵢ) < 0
So the loss for one misclassified sample is:
J(x) = −yᵢ · wᵀxᵢ
Gradient of J:
∇J = ∂J/∂w = −yᵢxᵢ ← The Perceptron Update Rule
The concept of minimizing a loss (or cost) function to fit a model to data was pioneered by Legendre (1805) and Gauss (1809) in the context of least squares for astronomy. However, applying this to classification — where the “error” is not a numerical residual but a discrete mistake — required a new way of thinking about loss. The Perceptron loss, derived in the slides, is one of the earliest smooth surrogates for classification error. Modern deep learning uses more sophisticated versions (cross-entropy loss, hinge loss, focal loss) but the conceptual framework is identical.
Geometric View of the Loss Landscape
The slides show a beautiful 3D visualization: the loss J plotted as a surface over the parameter space (w₁, w₂). It looks like a curved bowl — high at the edges, low in the middle. Learning is the process of rolling a ball down this bowl to find the lowest point (minimum loss). The gradient tells us which way is “downhill.”
Gradient Descent
Gradient descent is the algorithm that guides the classifier toward better and better weights. The slides present it cleanly in 4 steps:
Step 2: Compute gradient: ∂J/∂w
Step 3: Update: wᵗ⁺¹ = wᵗ − η · (∂J/∂w)
Step 4: Repeat until η · (∂J/∂w) becomes small (convergence)
η (eta) = learning rate — controls step size
Understanding the Learning Rate η
The learning rate η (eta) is one of the most important hyperparameters in machine learning. Think of it as: how big a step do we take downhill on each iteration?
Overshooting
You take such large steps that you leap over the minimum and oscillate. Learning becomes unstable or diverges.
Slow Convergence
You creep toward the minimum with tiny steps — correct, but takes thousands of iterations and much compute time.
Goldilocks Zone
Converges efficiently to the minimum in a reasonable number of iterations — the goal of every ML practitioner.
Augustin-Louis Cauchy first described a method of steepest descent for solving systems of equations in 1847. However, applying it to machine learning had to wait for Norbert Wiener and, crucially, Paul Werbos (1974) who showed how to compute gradients through layered systems using backpropagation. The specific version in these slides — updating weights one sample at a time — is called Stochastic Gradient Descent (SGD). SGD was popularized by LeCun, Bottou, and others in the 1990s and remains the dominant optimization algorithm in deep learning today, used to train GPT, Gemini, and every major AI system you know.
OpenAI’s GPT-4 has approximately 1.8 trillion parameters (weights). Training it used gradient descent — computing ∂J/∂w for each of those 1.8 trillion weights, iteratively. The mathematics on slide 64 of these slides — wᵗ⁺¹ = wᵗ − η · ∂J/∂w — is literally the same update rule being applied to every single weight of every major language model in the world today. The scale is unimaginable; the principle is what you are studying right now.
The Perceptron
The slides culminate in a detailed walkthrough of the Perceptron — the first machine learning algorithm that could learn a linear classifier from data. This is where everything comes together.
From Biology to Mathematics
Slide 74 shows a beautiful diagram comparing a biological neuron with a mathematical model:
Frank Rosenblatt at the Cornell Aeronautical Laboratory published “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain” in Psychological Review, 1958 — the very paper shown as Slide 2 of these lecture notes. He built a physical machine called the Mark I Perceptron that could learn to classify simple visual patterns. The New York Times proclaimed it would be capable of walking, talking, and reproducing itself. This was, perhaps, the first “AI hype cycle.” The mathematical foundation was inspired by the neuron model of McCulloch & Pitts (1943), who first showed that logical operations could be computed by simple threshold units.
The Perceptron Algorithm in Detail
The slides walk through the algorithm step by step using a real 1D dataset of red and green points. The key steps after data preparation are:
The Perceptron Learning — Step by Step Animation
The following animation plays through Slides 77–112 from the original lecture, showing the complete Perceptron learning sequence: from the actual data samples, through augmentation, label modification, and each iteration of the algorithm as the weight vector is updated step by step until convergence.
Data Preparation
Raw samples x are first augmented by appending 1: [x, 1]. Then they are modified to align labels: for Class A (y=+1), keep as is; for Class B (y=−1), flip the sign: y·[x, 1]. This clever transformation means a correctly classified sample always produces a positive dot product with w.
The Update Rule
If wᵀyᵢ < 0 (misclassified), then:
w ← w + yᵢ ← add the misclassified sample to the weight vector!
Repeat until all samples are correctly classified.
The slides then show 6 iterations on a real dataset. Starting from a random weight vector a₀ = (−0.75, 0.25), the algorithm progressively converges to a₆ = (2.44, −1.75), corresponding to the decision boundary 2.44x − 1.75 = 0.
Progression of the Weight Vector
Slide 119 shows the most visually striking result: the weight vector at each iteration plotted as an arrow in weight space. The vectors trace a path from a₀ to a₆, converging to the correct solution:
Novikoff (1963) proved the Perceptron Convergence Theorem: if the training data is linearly separable, the Perceptron algorithm is guaranteed to find a solution in a finite number of steps. This was a milestone — the first mathematical proof that a machine learning algorithm would work. However, Marvin Minsky and Seymour Papert published Perceptrons (1969) showing that the single-layer Perceptron cannot solve the XOR problem (not linearly separable). This book is widely credited with triggering the first “AI Winter” — a decade-long collapse of funding for neural network research. Progress resumed only with the invention of multi-layer networks and backpropagation in the 1980s.
Important Concepts Not in the Slides
The slides provide an excellent foundation. However, a complete understanding of linear classifiers requires several additional concepts. Here they are, with their historical context and relevance to what was taught.
The slides teach you how to fit a linear classifier to data, but not how well it should fit, or what happens when it fits too well. If a model memorizes training data perfectly (zero training error) but fails on new data, it is overfitting. A linear classifier’s limited expressiveness is actually a feature — it is unlikely to overfit. More flexible models can overfit easily.
Historical note: The bias-variance decomposition was formalized by Geman, Bienenstock & Doursat (1992), one of the foundational papers in statistical learning theory. It explains why simpler models sometimes outperform complex ones.
What happens when no straight line can separate the two classes? The Perceptron never converges — it loops forever. The slides mention this will not work but do not address solutions. The key ideas are:
- Feature engineering: Add polynomial or interaction features (x₁², x₁x₂, etc.) to make the data linearly separable in a higher-dimensional space.
- Kernel methods: The kernel trick (Mercer, 1909; popularized by Boser, Guyon & Vapnik, 1992) implicitly maps data to infinite dimensions without explicit computation.
- Multi-layer networks: Stack multiple linear classifiers with nonlinear activation functions — this creates neural networks capable of learning any boundary.
The slides’ linear classifier gives a hard +1/−1 decision. But often we want probability: “there is a 73% chance this email is spam.” Logistic regression wraps the linear function g(x) = wᵀx inside a sigmoid function: σ(g) = 1 / (1 + e^−g), squashing it to [0, 1]. This gives a probability and uses cross-entropy loss instead of the Perceptron loss. Logistic regression (despite its name) is a classifier — still fundamentally linear — and remains one of the most widely used algorithms in medicine, epidemiology, and social science.
Historical note: The logistic function was introduced by Pierre François Verhulst in 1838 for population modeling. Its application to classification was developed by Joseph Berkson in the 1940s.
The slides minimize J(w) without any constraint on how large w can be. In practice, very large weights cause instability and overfitting. Regularization adds a penalty on weight size to the loss function:
- L2 regularization (Ridge): Add λ‖w‖² to the loss — penalizes large weights smoothly (Tikhonov, 1943)
- L1 regularization (Lasso): Add λ‖w‖ to the loss — forces many weights to exactly zero, creating sparse models (Tibshirani, 1996)
Regularization is what makes machine learning work in practice on noisy real-world data. Every major model uses it.
Among all possible lines that separate two classes, which is best? The Perceptron finds any separating hyperplane. The SVM finds the one that maximizes the margin — the gap between the boundary and the nearest points of each class. This “maximum margin” classifier was developed by Vapnik and Chervonenkis in 1963 and refined into its modern form by Boser, Guyon & Vapnik in 1992. SVMs were the dominant classification algorithm in the 1990s–2000s, until deep learning overtook them around 2012. They remain the gold standard for small, clean datasets.
The slides discuss only two-class (binary) classification. Real problems often have many classes: “Is this image a cat, dog, bird, or fish?” There are two standard strategies:
- One-vs-Rest: Train one binary classifier per class (e.g., “cat vs not-cat”), then pick the class with highest confidence.
- One-vs-One: Train one classifier for each pair of classes, then vote.
- Softmax regression: A direct multi-class generalization of logistic regression.
The slides measure goodness by “number of misclassified samples” — equivalent to accuracy. But accuracy is often misleading. If 99% of emails are not spam, a classifier that always says “not spam” has 99% accuracy but is useless. Better metrics include precision, recall, F1 score, ROC-AUC, and the confusion matrix. These are essential for responsible deployment of classifiers in medicine, criminal justice, and finance.
Legacy & Significance
We began with a simple line. We end at the frontier of artificial intelligence. Let us trace the full arc — from Frank Rosenblatt’s Mark I Perceptron to the AI systems reshaping civilization today.
The Family Tree of Modern AI
What Linear Classifiers Gave to the World
🔬 The Scientific Method for Machines
Linear classifiers introduced the fundamental framework of machine learning: define a model with parameters, define a loss function, optimize iteratively. This framework is universal — it applies equally to linear classifiers, deep neural networks, and transformers. You have learned the grammar; the rest of AI is an elaboration of the same sentences.
🧮 The Role of Linear Algebra
The vector notation wᵀx introduced in these slides is the computational primitive of all AI. A Transformer with 175 billion parameters spends most of its computation time on matrix multiplications — which are nothing but simultaneous dot products. Every GPU is designed, at the hardware level, to compute these operations at maximum speed. Linear algebra is not a prerequisite for machine learning — it IS machine learning.
📐 The Geometry of Intelligence
The idea that data can be separated by geometric boundaries — and that learning is the discovery of those boundaries — reframes intelligence as a geometric phenomenon. High-dimensional geometry (hyperplanes, margins, projections) is not abstract mathematics. It is the operating principle of every AI system that has ever existed. The line drawn in these slides is the first hyperplane of artificial intelligence.
🌐 Real-World Impact — Applications Today
Linear classifiers directly underpin: spam filters (since the 1990s), credit scoring systems, medical diagnostic algorithms, sentiment analysis tools, and document classification systems used in law and government. They remain the first choice for practitioners on small, interpretable datasets where you need to explain decisions to regulators.
A Message to Young Learners
What you have studied in this lecture is not a historical curiosity. It is the living foundation of one of the most powerful technologies humanity has ever built. The weight update rule w ← w + y·x — four symbols — encodes the learning mechanism used in machines that write poetry, diagnose cancer, translate languages, compose music, and hold conversations.
You are not studying the past. You are studying the deep structure of the present — and of the future you will inhabit and shape.
The line between two dots. That is where it all began.
Applications Across Domains
Linear classifiers and their descendants appear across virtually every domain of human activity: