Chapter 13 – Linear Classifiers

Linear Classifiers — A Complete Learning Report

Understanding Linear Classifiers

Section 01

What Is Classification?

Every day, your brain sorts the world into categories — this is a dog, that is a cat; this email looks important, that one looks like spam; this fruit is ripe, that one is not. This act of sorting things into groups is called classification. Teaching a computer to do the same thing automatically — using examples — is one of the most foundational ideas in machine learning.

The details given here, present one of the oldest, simplest, and still most widely-used classification methods: the linear classifier.

Historical Insight — The Roots of Classification

The mathematical idea of separating data into groups goes back to Francis Galton and Karl Pearson in the 1880s–1900s, who studied how to distinguish one population from another using statistical measurements. By the 1930s, statistician Ronald Fisher formalized this as Linear Discriminant Analysis (LDA) — a method still used in biology, medicine, and finance today. Fisher’s idea was simple: find the one direction in space that best separates two classes when projected onto it. Linear classifiers are a direct descendant of this insight.

📧

Example

Spam Detection

Is this email spam or not spam? A linear classifier reads features of the email and decides.

🫀

Example

Medical Diagnosis

Given a patient’s age, blood pressure, and cholesterol — is there a risk of heart disease?

🌸

Example

Flower Species

Given petal width and sepal length — is this flower an Iris Setosa or Iris Versicolor?

📰

Example

Sentiment Analysis

Is this movie review positive or negative? Classify based on word counts and patterns.

In all these cases, the classifier looks at measurable features (called inputs or features) and assigns one of two labels (called classes, often written as +1 and −1, or A and B). A linear classifier does this using the simplest possible mathematical tool: a straight line (or its multi-dimensional equivalent).

Section 02

Linear Functions as Boundaries

The core idea is disarmingly simple: draw a straight line that separates two groups of points. Everything on one side of the line belongs to Class A; everything on the other side belongs to Class B.

The Equation of a Line

You have almost certainly seen this in school math:

y = mx + c \leftarrow slope-intercept form (what you learn in school) ax + by + c = 0 \leftarrow general form w₁x₁ + w₂x₂ + w₀ = 0 \leftarrow machine learning notation

The slides introduce a clever rewriting of the line equation. Instead of writing y = mx + c, they rename the two coordinate axes as x₁ and x₂ (because in machine learning, both axes represent features of the data — no axis is inherently the “output”). The equation becomes:

g(X) = w₁x₁ + w₂x₂ + w₀ = 0 w₁, w₂ = weights (they control the tilt of the line) w₀ = bias (controls where the line sits)

Key Insight — What g(X) Tells You

The function g(X) is like a “sensor.” For any point X in space:

If g(X) = 0 — the point is exactly on the boundary line
If g(X) > 0 — the point is on the positive side (Class A)
If g(X) < 0 — the point is on the negative side (Class B)

The sign of g(X) is the classifier’s “decision.” This is elegant mathematics at its best — the same formula both defines the boundary and makes the classification decision.

A linear boundary g(x) = 0 divides the feature space. Points where g(x) > 0 are Class A; where g(x) < 0, Class B.

Historical Insight — René Descartes & Coordinate Geometry (1637)

The ability to write a line as an equation is due to René Descartes, who invented the Cartesian coordinate system in 1637. His genius was realizing that geometric shapes (lines, curves) could be described by algebraic equations — and vice versa. This unification of algebra and geometry, called analytic geometry, is the very foundation upon which all linear classifiers are built. Without Descartes, the equation g(X) = 0 would have no meaning.

In one dimension (a single feature, like “height”), the boundary is a point on a number line: taller than 170 cm = adult, shorter = child. In two dimensions it is a line. In three dimensions — a plane. The slides show the beautiful visual of a green semi-transparent plane cutting through a 3D scatter of blue and orange points.

Section 03

A Worked Example

The slides walk through a concrete example step by step. Consider the equation:

g(x): 2x₁ - 3x₂ + 1 = 0

Three points are tested:

Point	Coordinates	g(X) Value	Meaning
P	(7, 5)	g(P) = 2(7) − 3(5) + 1 = 0	Exactly on the line
Q	(5, 2)	g(Q) = 2(5) − 3(2) + 1 = 5 (+ve)	On the positive (Class A) side
R	(2, 4)	g(R) = 2(2) − 3(4) + 1 = −7 (−ve)	On the negative (Class B) side

Pedagogical Insight — Why This Example Works

This example teaches two things simultaneously. First, the mechanics of substituting a point into the equation. Second — and more profoundly — that the magnitude of g(X) carries information about how far from the boundary a point is (when the weight vector is normalized). A large positive g(X) means “deep inside Class A”; a value near zero means “close to the boundary, uncertain.”

This distance interpretation becomes central in a more advanced classifier called the Support Vector Machine (SVM), which explicitly maximizes the margin between classes.

Real-World Application — Credit Scoring

Banks use exactly this idea. A customer is described by features like annual income (x₁) and number of missed payments (x₂). A linear boundary g(x) = 0 separates “creditworthy” from “risky.” A customer far inside the positive region gets the loan; one near zero gets extra scrutiny; one deep in the negative region is rejected. The bank is literally computing g(X) for every applicant.

Section 04

The Linear Classifier

Formally, a linear classifier is a function that assigns a label to a data point based on which side of a linear boundary the point falls:

Given training data: points xₐ \in Class A and x_b \in Class B Find weights w such that: g(xₐ) = wᵀxₐ > 0 for all xₐ \in A g(x_b) = wᵀx_b < 0 for all x_b \in B

The slides show an intuitive animation: a scatter of orange dots (Class A) and blue dots (Class B), with a line slowly being drawn to separate them. This is precisely what the algorithm learns to do.

The Decision Rule

✅

Decision

g(X) > 0 → Class A

Classify as positive class. Example: “This email is NOT spam.”

❌

Decision

g(X) < 0 → Class B

Classify as negative class. Example: “This email IS spam.”

⚠️

Decision

g(X) = 0 → On Boundary

The point is on the decision boundary — the classifier is uncertain.

Historical Insight — Ronald Fisher’s Discriminant (1936)

Ronald Aylmer Fisher, one of the greatest statisticians of the 20th century, published “The Use of Multiple Measurements in Taxonomic Problems” in 1936. He used iris flower measurements (sepal length, sepal width, petal length, petal width) to create a linear boundary that separated three species. His Linear Discriminant Analysis is mathematically equivalent to a linear classifier. The famous “Iris Dataset” he used is still a standard benchmark in machine learning today — nearly 90 years later.

Section 05

Higher Dimensions & Hyperplanes

The slides elegantly extend the 2D line to higher dimensions. This is not just a mathematical trick — in real problems, data has dozens or even millions of features.

Dimensions	Equation Form	Boundary Shape	Example
1D	w₁x₁ + w₀ = 0	Point on a number line	Is a patient’s glucose level above threshold?
2D	w₁x₁ + w₂x₂ + w₀ = 0	Line in a plane	Iris species from 2 measurements
3D	w₁x₁ + w₂x₂ + w₃x₃ + w₀ = 0	Plane in 3D space	Cancer risk from 3 biomarkers
dD	w₁x₁ + ⋯ + wdxd + w₀ = 0	Hyperplane	Spam detection from 50,000 word features

“In 4 or more dimensions, the boundary is called a hyperplane — a concept we cannot visualize, but can compute perfectly. This is where mathematics gives us superpowers beyond human intuition.” — Paraphrased from the slides, CK Raju

Key Insight — Why Dimensionality Matters

A stunning result in mathematics (related to the Perceptron Convergence Theorem) states: if two classes are linearly separable in d dimensions, a linear classifier will always find the separating hyperplane. More remarkably, even when data is not separable in its original space, it can often be made separable by projecting it into a higher-dimensional space. This is the idea behind the kernel trick used in Support Vector Machines.

Application — Face Recognition (Eigenfaces, 1991)

A grayscale 100×100 pixel image is a point in 10,000-dimensional space (one dimension per pixel). Linear classifiers can separate “faces” from “non-faces” in this very high-dimensional space. The Eigenfaces approach by Turk & Pentland (1991) at MIT used linear methods to recognize faces — one of the first practical computer vision systems. Today’s face unlock on your phone descends from this lineage.

Section 06

Vector Notation

The slides introduce a compact and powerful way to write the linear equation using vectors. This is the language of all modern machine learning.

Weight vector: w = [w₁, w₂, \dots, wₐ, w₀] Augmented feature vector (appends 1 for bias): x = [x₁, x₂, \dots, xₐ, 1] The entire linear equation becomes simply: g(x) = wᵀx = 0 Alternative (bias separate): g(x): wᵀx + b = 0

Why This Notation Is Brilliant

The inner product wᵀx (also called the dot product) collapses an entire weighted sum into a single compact symbol. This is not just aesthetic — it enables:

⚡

Benefit

Computational Speed

Modern CPUs and GPUs are optimized to compute dot products billions of times per second using matrix operations (BLAS libraries).

📐

Benefit

Geometric Meaning

wᵀx is the projection of x onto w. The weight vector w is perpendicular to the decision boundary — this has deep geometric significance.

🔗

Benefit

Generalization

The same formula works for 2 features or 2 million features. Modern neural networks are built from millions of such dot products.

Historical Insight — Vectors & Linear Algebra

The concept of a vector was formalized by Hermann Grassmann in 1844 and Josiah Willard Gibbs in the 1880s. The dot product was introduced as part of vector algebra. Arthur Cayley developed matrix notation in 1858. These 19th-century mathematical tools, invented to describe physics, became the computational backbone of machine learning 150 years later. The notation wᵀx is simultaneously a physics concept (work = force · displacement), a geometry concept (projection), and a computing operation (multiply-and-accumulate). Mathematics is the universal language.

The Augmented Vector Trick

Notice the slides append a 1 to the feature vector: x = [x₁, x₂, ..., xd, 1]. This is elegant engineering — by including the constant 1 in the vector and including w₀ (the bias) in the weight vector, the messy separate “+ bias” term disappears. The entire linear classifier, bias included, is now simply wᵀx. This is called the augmented or homogeneous representation and is used universally in practice.

Section 07

Learning from Data

So far we have assumed the weights w are given. But in practice, we don’t know the weights — the whole point of machine learning is to discover them from examples. This is the “learning” part of machine learning.

The Core Problem

You are given a dataset of labeled points: {(x₁ᵢ, x₂ᵢ, yᵢ)} where yᵢ = +1 for Class A and yᵢ = −1 for Class B. You need to find the weight vector w such that the linear function g(x) = wᵀx correctly classifies every point.

The Learning Loop

The slides present the learning process as a beautiful iterative loop — a feedback cycle that gradually improves the classifier. This is the fundamental pattern of nearly all machine learning algorithms:

Initialize Randomly

Start with random values for the weights w. The initial line is random and likely wrong — but that’s okay. We just need a starting point.

Evaluate “Goodness”

Measure how well the current line classifies the training data. Compute a loss function J(w) that quantifies the errors. The better the line, the lower J(w).

Move to a Better Position

Adjust the weights w to reduce J(w). The key question: in which direction should we adjust? This is answered by the gradient.

Repeat Until Satisfied

Iterate steps 2 and 3 until J(w) falls below a threshold τ (tau). At that point, the classifier is “good enough” and training stops.

Historical Insight — Iterative Learning as a Concept

The idea of improving a model iteratively by measuring its error was present in the work of Adrien-Marie Legendre and Carl Friedrich Gauss in the early 1800s (least squares method). However, applying it to classification rather than regression — and doing so computationally — had to wait for the computer age. Norbert Wiener’s Cybernetics (1948) formalized the concept of feedback loops in machines, directly inspiring early neural network researchers. Every time a machine learning algorithm “learns,” it is running a feedback loop Wiener would have recognized.

Application — How AlphaGo Learned

DeepMind’s AlphaGo (2016), which defeated the world champion in the ancient game of Go, learned by playing millions of games against itself — a vastly more complex version of the same loop: initialize, evaluate, adjust, repeat. The core feedback mechanism is identical to what is shown in these slides. What changed was the scale and the use of deep neural networks — but the loop itself remained the same.

Section 08

Loss Functions

The slides ask: how do we define “goodness”? The answer is a loss function J(w) — a number that measures how bad the classifier currently is. Our goal is to minimize this number.

Candidate 1: Count of Misclassified Points

The most obvious choice: count how many points are on the wrong side of the line. If 3 out of 20 points are misclassified, J = 3.

Why This Doesn’t Work — The Non-Differentiability Problem

This loss function has a critical flaw: it is not differentiable. When you move the line slightly, the count of misclassified points jumps abruptly (from 3 to 2 or 4 instantly). You cannot compute a smooth gradient — which means you cannot use calculus to guide the optimization. This is a fundamental insight: the mathematical tools we want to use (calculus, gradients) require smooth functions.

Candidate 2: The Perceptron Loss

The slides derive a much better loss function. The key observations are:

For a correctly classified sample i: Class A (yᵢ = +1): wᵀxᵢ > 0 \to product yᵢ(wᵀxᵢ) > 0 Class B (yᵢ = -1): wᵀxᵢ < 0 \to product yᵢ(wᵀxᵢ) > 0 For a mis classified sample: yᵢ(wᵀxᵢ) < 0 So the loss for one misclassified sample is: J(x) = -yᵢ \cdot wᵀxᵢ Gradient of J: \nablaJ = \partialJ/\partialw = -yᵢxᵢ \leftarrow The Perceptron Update Rule

Historical Insight — Loss Functions & Optimization

The concept of minimizing a loss (or cost) function to fit a model to data was pioneered by Legendre (1805) and Gauss (1809) in the context of least squares for astronomy. However, applying this to classification — where the “error” is not a numerical residual but a discrete mistake — required a new way of thinking about loss. The Perceptron loss, derived in the slides, is one of the earliest smooth surrogates for classification error. Modern deep learning uses more sophisticated versions (cross-entropy loss, hinge loss, focal loss) but the conceptual framework is identical.

Geometric View of the Loss Landscape

The slides show a beautiful 3D visualization: the loss J plotted as a surface over the parameter space (w₁, w₂). It looks like a curved bowl — high at the edges, low in the middle. Learning is the process of rolling a ball down this bowl to find the lowest point (minimum loss). The gradient tells us which way is “downhill.”

The loss surface J(w) as a function of weights. Gradient descent rolls from the starting point (red) down toward the minimum (teal), guided by the gradient ∇J.

Section 09

Gradient Descent

Gradient descent is the algorithm that guides the classifier toward better and better weights. The slides present it cleanly in 4 steps:

Step 1: Initialize: w⁰ = random values Step 2: Compute gradient: \partialJ/\partialw Step 3: Update: wᵗ⁺¹ = wᵗ - η \cdot (\partialJ/\partialw) Step 4: Repeat until η \cdot (\partialJ/\partialw) becomes small (convergence) η (eta) = learning rate — controls step size

Understanding the Learning Rate η

The learning rate η (eta) is one of the most important hyperparameters in machine learning. Think of it as: how big a step do we take downhill on each iteration?

🏃

η too large

Overshooting

You take such large steps that you leap over the minimum and oscillate. Learning becomes unstable or diverges.

🐢

η too small

Slow Convergence

You creep toward the minimum with tiny steps — correct, but takes thousands of iterations and much compute time.

✨

η just right

Goldilocks Zone

Converges efficiently to the minimum in a reasonable number of iterations — the goal of every ML practitioner.

Historical Insight — Gradient Descent

Augustin-Louis Cauchy first described a method of steepest descent for solving systems of equations in 1847. However, applying it to machine learning had to wait for Norbert Wiener and, crucially, Paul Werbos (1974) who showed how to compute gradients through layered systems using backpropagation. The specific version in these slides — updating weights one sample at a time — is called Stochastic Gradient Descent (SGD). SGD was popularized by LeCun, Bottou, and others in the 1990s and remains the dominant optimization algorithm in deep learning today, used to train GPT, Gemini, and every major AI system you know.

Application — Training GPT-4

OpenAI’s GPT-4 has approximately 1.8 trillion parameters (weights). Training it used gradient descent — computing ∂J/∂w for each of those 1.8 trillion weights, iteratively. The mathematics on slide 64 of these slides — wᵗ⁺¹ = wᵗ − η · ∂J/∂w — is literally the same update rule being applied to every single weight of every major language model in the world today. The scale is unimaginable; the principle is what you are studying right now.

Section 10

The Perceptron

The slides culminate in a detailed walkthrough of the Perceptron — the first machine learning algorithm that could learn a linear classifier from data. This is where everything comes together.

From Biology to Mathematics

Slide 74 shows a beautiful diagram comparing a biological neuron with a mathematical model:

A biological neuron fires when input stimuli (weighted by synapse strength) cross a threshold — exactly analogous to the mathematical perceptron computing wᵀx.

Historical Insight — Rosenblatt’s Perceptron (1958)

Frank Rosenblatt at the Cornell Aeronautical Laboratory published “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain” in Psychological Review, 1958 — the very paper shown as Slide 2 of these lecture notes. He built a physical machine called the Mark I Perceptron that could learn to classify simple visual patterns. The New York Times proclaimed it would be capable of walking, talking, and reproducing itself. This was, perhaps, the first “AI hype cycle.” The mathematical foundation was inspired by the neuron model of McCulloch & Pitts (1943), who first showed that logical operations could be computed by simple threshold units.

The Perceptron Algorithm in Detail

The slides walk through the algorithm step by step using a real 1D dataset of red and green points. The key steps after data preparation are:

The Perceptron Learning — Step by Step Animation

The following animation plays through Slides 77–112 from the original lecture, showing the complete Perceptron learning sequence: from the actual data samples, through augmentation, label modification, and each iteration of the algorithm as the weight vector is updated step by step until convergence.

Data Preparation

Raw samples x are first augmented by appending 1: [x, 1]. Then they are modified to align labels: for Class A (y=+1), keep as is; for Class B (y=−1), flip the sign: y·[x, 1]. This clever transformation means a correctly classified sample always produces a positive dot product with w.

The Update Rule

For each misclassified sample yᵢ: If wᵀyᵢ < 0 (misclassified), then: w \leftarrow w + yᵢ \leftarrow add the misclassified sample to the weight vector! Repeat until all samples are correctly classified.

The slides then show 6 iterations on a real dataset. Starting from a random weight vector a₀ = (−0.75, 0.25), the algorithm progressively converges to a₆ = (2.44, −1.75), corresponding to the decision boundary 2.44x − 1.75 = 0.

Progression of the Weight Vector

Slide 119 shows the most visually striking result: the weight vector at each iteration plotted as an arrow in weight space. The vectors trace a path from a₀ to a₆, converging to the correct solution:

Each arrow represents the weight vector at one iteration of the Perceptron algorithm. The vectors rotate and grow, converging to the final solution a₆ = (2.44, −1.75) which correctly separates all training samples.

Historical Insight — The Perceptron Convergence Theorem

Novikoff (1963) proved the Perceptron Convergence Theorem: if the training data is linearly separable, the Perceptron algorithm is guaranteed to find a solution in a finite number of steps. This was a milestone — the first mathematical proof that a machine learning algorithm would work. However, Marvin Minsky and Seymour Papert published Perceptrons (1969) showing that the single-layer Perceptron cannot solve the XOR problem (not linearly separable). This book is widely credited with triggering the first “AI Winter” — a decade-long collapse of funding for neural network research. Progress resumed only with the invention of multi-layer networks and backpropagation in the 1980s.

Section 11

Important Concepts Not in the Slides

The slides provide an excellent foundation. However, a complete understanding of linear classifiers requires several additional concepts. Here they are, with their historical context and relevance to what was taught.

Missing Concept 1 — The Bias–Variance Tradeoff

The slides teach you how to fit a linear classifier to data, but not how well it should fit, or what happens when it fits too well. If a model memorizes training data perfectly (zero training error) but fails on new data, it is overfitting. A linear classifier’s limited expressiveness is actually a feature — it is unlikely to overfit. More flexible models can overfit easily.

Historical note: The bias-variance decomposition was formalized by Geman, Bienenstock & Doursat (1992), one of the foundational papers in statistical learning theory. It explains why simpler models sometimes outperform complex ones.

Missing Concept 2 — Non-Linearly Separable Data

What happens when no straight line can separate the two classes? The Perceptron never converges — it loops forever. The slides mention this will not work but do not address solutions. The key ideas are:

Feature engineering: Add polynomial or interaction features (x₁², x₁x₂, etc.) to make the data linearly separable in a higher-dimensional space.
Kernel methods: The kernel trick (Mercer, 1909; popularized by Boser, Guyon & Vapnik, 1992) implicitly maps data to infinite dimensions without explicit computation.
Multi-layer networks: Stack multiple linear classifiers with nonlinear activation functions — this creates neural networks capable of learning any boundary.

Missing Concept 3 — Logistic Regression & Probabilistic Output

The slides’ linear classifier gives a hard +1/−1 decision. But often we want probability: “there is a 73% chance this email is spam.” Logistic regression wraps the linear function g(x) = wᵀx inside a sigmoid function: σ(g) = 1 / (1 + e^−g), squashing it to [0, 1]. This gives a probability and uses cross-entropy loss instead of the Perceptron loss. Logistic regression (despite its name) is a classifier — still fundamentally linear — and remains one of the most widely used algorithms in medicine, epidemiology, and social science.

Historical note: The logistic function was introduced by Pierre François Verhulst in 1838 for population modeling. Its application to classification was developed by Joseph Berkson in the 1940s.

Missing Concept 4 — Regularization

The slides minimize J(w) without any constraint on how large w can be. In practice, very large weights cause instability and overfitting. Regularization adds a penalty on weight size to the loss function:

L2 regularization (Ridge): Add λ‖w‖² to the loss — penalizes large weights smoothly (Tikhonov, 1943)
L1 regularization (Lasso): Add λ‖w‖ to the loss — forces many weights to exactly zero, creating sparse models (Tibshirani, 1996)

Regularization is what makes machine learning work in practice on noisy real-world data. Every major model uses it.

Missing Concept 5 — The Support Vector Machine (SVM)

Among all possible lines that separate two classes, which is best? The Perceptron finds any separating hyperplane. The SVM finds the one that maximizes the margin — the gap between the boundary and the nearest points of each class. This “maximum margin” classifier was developed by Vapnik and Chervonenkis in 1963 and refined into its modern form by Boser, Guyon & Vapnik in 1992. SVMs were the dominant classification algorithm in the 1990s–2000s, until deep learning overtook them around 2012. They remain the gold standard for small, clean datasets.

Missing Concept 6 — Multi-Class Classification

The slides discuss only two-class (binary) classification. Real problems often have many classes: “Is this image a cat, dog, bird, or fish?” There are two standard strategies:

One-vs-Rest: Train one binary classifier per class (e.g., “cat vs not-cat”), then pick the class with highest confidence.
One-vs-One: Train one classifier for each pair of classes, then vote.
Softmax regression: A direct multi-class generalization of logistic regression.

Missing Concept 7 — Evaluation Metrics Beyond Accuracy

The slides measure goodness by “number of misclassified samples” — equivalent to accuracy. But accuracy is often misleading. If 99% of emails are not spam, a classifier that always says “not spam” has 99% accuracy but is useless. Better metrics include precision, recall, F1 score, ROC-AUC, and the confusion matrix. These are essential for responsible deployment of classifiers in medicine, criminal justice, and finance.

✦ ✦ ✦

Section 12

Legacy & Significance

We began with a simple line. We end at the frontier of artificial intelligence. Let us trace the full arc — from Frank Rosenblatt’s Mark I Perceptron to the AI systems reshaping civilization today.

The Family Tree of Modern AI

1943

McCulloch & Pitts Neuron

The first mathematical model of a neuron — a threshold unit that fires when weighted inputs exceed a value. The conceptual ancestor of every artificial neuron.

1958

Rosenblatt’s Perceptron

The first learning algorithm — the one studied in detail in these slides. The first machine that could be trained from data to classify patterns.

1969

Minsky & Papert — “Perceptrons”

Proved limitations of single-layer perceptrons. Triggered AI Winter — but also forced researchers to think about multi-layer architectures.

1986

Backpropagation (Rumelhart, Hinton & Williams)

Showed how to compute gradients through multi-layer networks. Revived neural network research. The gradient descent of the slides extended to deep architectures.

1998

LeNet — Convolutional Neural Networks (LeCun)

Applied multi-layer networks with shared weights to image recognition. Reading ZIP codes on mail — the first commercially deployed neural network.

2012

AlexNet — The Deep Learning Revolution

Krizhevsky, Sutskever & Hinton smashed image recognition records using GPUs and deep networks. The era of modern deep learning begins. All previous methods obsoleted in computer vision.

2017

“Attention Is All You Need” — Transformers

Vaswani et al. at Google introduced the Transformer architecture. Built from nothing but weighted dot products (linear operations) with attention mechanisms. GPT, BERT, and all modern LLMs descend from this paper.

2022–2026

Generative AI — ChatGPT, Gemini, Claude

Large Language Models capable of conversation, reasoning, and creativity. Each neuron in these models performs exactly the operation studied in this lecture: wᵀx + b.

What Linear Classifiers Gave to the World

🔬 The Scientific Method for Machines

Linear classifiers introduced the fundamental framework of machine learning: define a model with parameters, define a loss function, optimize iteratively. This framework is universal — it applies equally to linear classifiers, deep neural networks, and transformers. You have learned the grammar; the rest of AI is an elaboration of the same sentences.

🧮 The Role of Linear Algebra

The vector notation wᵀx introduced in these slides is the computational primitive of all AI. A Transformer with 175 billion parameters spends most of its computation time on matrix multiplications — which are nothing but simultaneous dot products. Every GPU is designed, at the hardware level, to compute these operations at maximum speed. Linear algebra is not a prerequisite for machine learning — it IS machine learning.

📐 The Geometry of Intelligence

The idea that data can be separated by geometric boundaries — and that learning is the discovery of those boundaries — reframes intelligence as a geometric phenomenon. High-dimensional geometry (hyperplanes, margins, projections) is not abstract mathematics. It is the operating principle of every AI system that has ever existed. The line drawn in these slides is the first hyperplane of artificial intelligence.

🌐 Real-World Impact — Applications Today

Linear classifiers directly underpin: spam filters (since the 1990s), credit scoring systems, medical diagnostic algorithms, sentiment analysis tools, and document classification systems used in law and government. They remain the first choice for practitioners on small, interpretable datasets where you need to explain decisions to regulators.

“The Perceptron, trained by the simple rule you studied in this lecture — wᵀx ≥ 0, classify as A; otherwise classify as B — is the ancestor of every large language model, every image recognition system, every recommendation engine, every self-driving car, and every AI that will ever exist. A line in 2D space became the organizing idea of civilization’s most transformative technology.” — Data Science Educational Commentary

A Message to Young Learners

What you have studied in this lecture is not a historical curiosity. It is the living foundation of one of the most powerful technologies humanity has ever built. The weight update rule w ← w + y·x — four symbols — encodes the learning mechanism used in machines that write poetry, diagnose cancer, translate languages, compose music, and hold conversations.

You are not studying the past. You are studying the deep structure of the present — and of the future you will inhabit and shape.

The line between two dots. That is where it all began.

Quick Reference

Applications Across Domains

Linear classifiers and their descendants appear across virtually every domain of human activity:

🏥Medicine — Disease diagnosis

💳Finance — Fraud detection

📧Tech — Spam filtering

🌾Agriculture — Crop disease ID

🔒Security — Intrusion detection

📱Social Media — Content moderation

🌍Climate — Weather classification

🎓Education — Essay scoring

🚗Automotive — Object detection

🧬Genomics — Gene expression

🛒Retail — Customer segmentation

⚖️Law — Document classification

Chapter 13 – Linear Classifiers

What Is Classification?

Spam Detection

Medical Diagnosis

Flower Species

Sentiment Analysis

Linear Functions as Boundaries

The Equation of a Line

A Worked Example

The Linear Classifier

The Decision Rule

g(X) > 0 → Class A

g(X) < 0 → Class B

g(X) = 0 → On Boundary

Higher Dimensions & Hyperplanes

Vector Notation

Why This Notation Is Brilliant

Computational Speed

Geometric Meaning

Generalization

The Augmented Vector Trick

Learning from Data

The Learning Loop

Initialize Randomly

Evaluate “Goodness”

Move to a Better Position

Repeat Until Satisfied

Loss Functions

Candidate 1: Count of Misclassified Points

Candidate 2: The Perceptron Loss

Geometric View of the Loss Landscape

Gradient Descent

Understanding the Learning Rate η

Overshooting

Slow Convergence

Goldilocks Zone

The Perceptron

From Biology to Mathematics

The Perceptron Algorithm in Detail

The Perceptron Learning — Step by Step Animation

Data Preparation

The Update Rule

Progression of the Weight Vector

Important Concepts Not in the Slides

Legacy & Significance

The Family Tree of Modern AI

What Linear Classifiers Gave to the World

🔬 The Scientific Method for Machines

🧮 The Role of Linear Algebra

📐 The Geometry of Intelligence

🌐 Real-World Impact — Applications Today

A Message to Young Learners

Applications Across Domains

Brief Description

iHub-Data