Chapter 14 – Gradient Descent Algorithm

Gradient Descent: How Machines Learn to Get Better

A comprehensive study guide

Gradient Descent:
How Machines Learn to Get Better

From Cauchy’s 1847 calculus to the algorithms powering modern artificial intelligence

Based on: Gradient Descent Algorithms — CK Raju, iHub-Data, IIIT Hyderabad, May 2026

Contents

What does it mean for a machine to learn?
Loss functions — measuring how wrong you are
Properties of good loss functions & convexity
The local minima problem & three solutions
The gradient descent update rule
The learning rate — getting it right
Learning rate schedules
Newton descent — optimal steps
Three variants of gradient descent
Comparing the three variants
Concepts beyond the slides: backpropagation, Adam & more
Historical timeline
Why all of this matters: the road to modern AI

What does it mean for a machine to learn?

Think of the last time you practised a skill — shooting a basketball, learning a new recipe, or solving a type of maths problem you had never seen before. You tried something, noticed how far off you were, and adjusted. With every attempt your brain moved toward a better version of the action. That cycle — try, measure, adjust, repeat — is learning.

Machines follow exactly the same logic. A model begins with a random guess at how to perform a task. It measures how wrong that guess is, figures out which direction would make things better, takes a small step in that direction, and repeats until it gets good enough. That process is called gradient descent — the single most important optimisation algorithm in all of machine learning.

“Gradient descent taught machines to improve — and in doing so, changed the world.”

The four big questions that drive this entire topic are: What does it mean for a machine to learn? How does it know if it is doing well or badly? How does it improve step by step? And what can go wrong, and how do we fix it? These slides answer each in turn.

Real-world example — predicting house rents

An app tries to predict the monthly rent of a flat in Bengaluru. On its first attempt it guesses ₹18,000; the actual rent is ₹25,000. The squared error = (25,000 − 18,000)² = 49,000,000. Gradient descent then nudges the app’s internal numbers a little so the next guess is closer. After thousands of such nudges the app becomes quite accurate.

Loss functions — measuring how wrong you are

Before a machine can improve, it needs a score that tells it how badly it is performing. That score is called a loss function (also called a cost function), written J(w). The smaller J(w), the better the model is performing. The machine adjusts its parameters w to minimise J(w).

J(w) measures how wrong the machine’s predictions are. Goal: find w* that minimises J(w).

Consider the house rental example: the model predicts ₹300k; the real price is ₹350k. The error = 350 − 300 = 50 (in thousands). Using squared-error loss, J = (error)² = 2,500. Squaring has two purposes: it makes all errors positive (over-predictions and under-predictions don’t cancel each other out), and it penalises large errors far more harshly than small ones — which matches our intuition that being wildly wrong is much worse than being slightly wrong.

The machine’s entire job is to find the value of w that sits at the lowest point of the J(w) curve. Everything in gradient descent is a systematic strategy for finding that lowest point.

Application domain	Typical task	Common loss function	Why it fits
Real-estate / finance	Predict rent, stock price, exam score	Mean Squared Error (MSE)	Penalises large deviations heavily; convex and easy to optimise
Email / content moderation	Spam detection, sentiment classification	Cross-Entropy Loss	Handles probabilistic class outputs naturally
Medical imaging	Disease classification (rare conditions)	Focal Loss	Handles severe class imbalance; focuses on hard examples
Image generation / GANs	Style transfer, super-resolution	Perceptual Loss	Compares high-level features, not raw pixels — matches human perception
Self-driving vehicles	Object detection bounding boxes	Huber Loss	Robust to outliers (occasional extreme errors in noisy sensor data)
Recommendation systems	Rank products or content for users	Pairwise Ranking Loss	Directly optimises relative ordering rather than absolute scores

Properties of good loss functions & convexity

Not every loss function is equally useful. A good loss function must possess three key properties:

〰

Continuity

No sudden jumps — small changes in w produce small changes in J. Without continuity, gradient descent takes unpredictable leaps and cannot navigate reliably.

∂

Differentiability

We must be able to calculate the slope (gradient) at every point. A jagged, non-smooth surface has points where no slope can be defined.

▽

Few or one local minima

Ideally, the only “valley” is the globally best solution. Multiple valleys mean gradient descent might settle into a suboptimal answer.

The gold standard: convex loss functions

A convex function is shaped like a bowl. Draw a straight line (called a chord) connecting any two points on the curve — that chord will lie entirely above the curve. This geometric property guarantees that only one valley exists. Gradient descent started from any point on a convex function is mathematically guaranteed to find the global minimum, regardless of starting position. Linear regression using MSE loss is convex — which is why it always converges cleanly.

Analogy — the smooth salad bowl

Place a marble anywhere inside a smooth, round salad bowl. No matter where you put it, it always rolls to the very centre. Convex loss functions behave identically — gradient descent always reaches the single, perfect minimum. Real neural-network loss surfaces, unfortunately, look more like a mountain range than a salad bowl.

Historical insight: the birth of loss functions

1805

Adrien-Marie Legendre (France) Published the method of least squares — the very first formal loss function in history — to fit astronomical data and correct observational errors in planetary orbit measurements. Carl Friedrich Gauss independently developed it by 1809. Minimising a squared-error loss was born from practical astronomy, not theoretical elegance.

The local minima problem & three solutions

Real neural networks do not have smooth, bowl-shaped loss surfaces. Their landscapes contain thousands of valleys, plateaus, and saddle points. Gradient descent risks getting trapped in a local minimum — a valley that looks like the bottom from inside, but is not the deepest valley globally.

Consequences of non-convex surfaces include: multiple local minima (many false valleys); slower convergence to the true global minimum; and the possibility of never finding the global minimum at all. The problem grows worse as the model gains more parameters wᵢ.

However, modern research reveals an important nuance: in very high-dimensional spaces (millions of parameters), most local minima are actually quite close in loss value to the global minimum. The real danger in practice turns out to be saddle points — where the gradient is zero but you are not at a minimum — and flat plateaus where the gradient is near zero and learning stalls completely.

Solution 1 — smart initialisation

Where you start (w₀) determines which valley you fall into. Three strategies help escape bad starts:

Avoid bad starts: initialise w with small random values drawn from a uniform or normal distribution centred around 0. Large initial values push the model into extreme regions of the loss surface.
Warm start: compute a rough approximate solution first (e.g. using a simpler model), then use that as the starting point for full gradient descent.
Multiple random restarts: run gradient descent several times from different random starting points and keep whichever run achieves the lowest final loss. Computationally expensive but effective.

Intuition for multiple restarts

Imagine dropping balls onto a bumpy landscape from many different positions. Some balls land near deep valleys and roll to the global bottom; others land on ridges and roll into shallow local traps. Multiple restarts increase the probability that at least one run finds the global minimum.

Solution 2 — adding momentum

Momentum adds a memory of the previous step’s direction to the current update — like a heavy rolling ball that carries its speed through small dips. With momentum coefficient γ (gamma), the update becomes:

Δw^t = −η × (∂J/∂w^t) + γ × Δw^t−1 w^t+1 = w^t + Δw^t

γ controls how much of the previous step direction is retained. A typical value is γ = 0.9, meaning 90% of the previous direction carries forward. Momentum helps in two ways: it accelerates descent along consistent directions, and it can roll through small local minima and cross flat plateaus that would trap standard gradient descent.

Physics analogy — bowling ball vs ping-pong ball

A ping-pong ball rolls down a bumpy slope and stops at every tiny dip. A bowling ball’s momentum carries it through minor obstacles and into deeper valleys. Gradient descent with momentum behaves like the bowling ball. This is also the direct computational analogue of Newton’s First Law of Motion — an object in motion stays in motion.

Solution 3 — regularisation

Regularisation modifies the loss function itself by adding a penalty term that discourages unnecessarily large weights. This smooths out the loss landscape, reducing the number and sharpness of local minima and making gradient descent more reliable.

L2 (Ridge): J'(w) = f(w) + λ‖w‖² ← discourages large weights (shrinks all) L1 (Lasso): J'(w) = f(w) + λ‖w‖ ← drives many weights to exactly zero

The parameter λ controls the strength of the penalty. L2 regularisation shrinks all weights proportionally — useful when all features contribute somewhat. L1 regularisation can force many weights to exactly zero, effectively performing automatic feature selection — particularly useful in text analysis where most words are irrelevant to a specific task.

Real-world analogy — a tax on complexity

Regularisation is like imposing a tax on model complexity. Just as a tax discourages excess consumption, a regularisation penalty discourages the model from using unnecessarily large parameters — preventing it from memorising noise in the training data instead of learning genuine patterns.

Technique	Mechanism	Typical setting	Best used when
Smart initialisation	Start from well-chosen or multiple random points	Normal(0, 0.01) or Xavier/He	Always — baseline good practice for any model
Momentum	Carry weighted direction memory across steps	γ = 0.9	Non-convex surfaces, plateaus, saddle points
L2 regularisation (Ridge)	Penalise sum of squared weights λ‖w‖²	λ = 0.001–0.01	Many features, all somewhat relevant; prevents huge weights
L1 regularisation (Lasso)	Penalise sum of absolute weights λ‖w‖	λ = 0.001–0.01	Sparse data; want automatic irrelevant feature removal

The gradient descent update rule

The gradient ∂J/∂w is the slope of the loss function at the current position in parameter space. It points in the direction of steepest increase. To reduce the loss we move in the opposite direction — downhill. This is precisely why we subtract the gradient, never add it.

w^t+1 = w^t − η × (∂J / ∂w) Where: w^t = current parameters (weights) ∂J/∂w = gradient: direction and steepness of slope (points uphill) η = learning rate: how big each step is − = we move downhill, opposite to the gradient

For a single-layer perceptron specifically, the gradient of the loss with respect to the weights simplifies to ∇J = −yᵢxᵢ, where yᵢ is the true label and xᵢ is the input vector.

A beautiful built-in property of this rule: it automatically slows down near the minimum. As w approaches the lowest point, the slope ∂J/∂w approaches zero — so each step shrinks naturally, without any extra logic. Near the bottom of the bowl, the algorithm takes small, precise steps.

When to stop?

Two practical stopping criteria exist in common use:

Threshold method

Stop when η × |∂J/∂w| < τ (tau). When the step size falls below threshold τ, the algorithm is essentially at the bottom and further updates produce negligible improvement.

Epoch limit

Stop after a fixed number of complete passes through the training dataset. Simpler to implement and widely used in practice — you set a budget and let training run.

1847

Augustin-Louis Cauchy (France) Published the method of steepest descent in Comptes rendus de l’Académie des Sciences. Cauchy was solving large systems of equations arising in astronomical calculations. His fundamental insight — always move in the direction of the negative gradient — remains the mathematical heart of every modern neural network training loop, unchanged across 177 years.

The learning rate — getting it right

The learning rate η is arguably the single most important hyperparameter you control when training a model. It determines the size of each downhill step. Get it wrong and the model either crawls to the answer or flies right past it.

Learning rate setting	What happens	Path shape	Practical consequence
Too small (e.g. η = 0.0001)	Tiny steps; correct direction but painfully slow convergence	Long, straight, gradual descent	May need thousands of extra epochs; wasteful computationally
Too large (e.g. η = 1.0)	Overshoots the minimum; oscillates back and forth across the valley	Zig-zag pattern across the bowl	Plateaus at a higher-than-optimal loss value; slow or no convergence
Very large (e.g. η = 10+)	Loss explodes toward infinity — the algorithm completely breaks	Loss curve shoots sharply upward	Training fails entirely; weights become NaN
“Goldilocks” — just right (η = 0.001–0.1)	Fast, smooth descent to a low loss value	Efficient curved path into the minimum	Good convergence in reasonable time; the ideal outcome

The practical tip from the slides: start with η = 0.01 as a default and adjust based on your training loss curve. A well-behaved curve falls quickly in early epochs and then levels off smoothly near a low value. A jagged, noisy, or rising curve signals a learning rate problem. The recommended starting range is 10⁻³ to 10⁻¹.

Application — image classification

When training ResNet-50 on the ImageNet dataset, practitioners typically start with η = 0.1, then reduce it by a factor of 10 at epochs 30, 60, and 90. This schedule is so reliable it has become a standard baseline cited in hundreds of research papers.

Learning rate schedules

The best-of-both-worlds strategy is to start training with a relatively high learning rate (to cover the loss landscape quickly and escape bad regions) and gradually reduce it as training progresses (for precision near the minimum). This is called a learning rate schedule.

Schedule	Formula	Behaviour	Typical application
Step decay	η_t = η₀ × 0.5^⌊t/k⌋	Halves the learning rate every k epochs; produces a staircase pattern in the rate curve	Classic deep learning, ResNet image classifiers, early-generation CNNs
Exponential decay	η_t = η₀ × e^(−λt)	Smooth continuous reduction; no sudden jumps	General-purpose training where smooth convergence matters
Cosine annealing	η_t = η_min + ½(η₀−η_min)(1+cos(πt/T))	Smoothly reduces following a cosine curve; can include periodic warm restarts to escape local minima	Transformer training, vision models, modern LLMs
Linear warmup + cosine decay	Low η → peak η (warmup) → cosine decay	Starts very small, rises to peak, then follows cosine downward for millions of steps	GPT, BERT, T5, and virtually all large language model training

Application — training GPT-style language models

Large language models use a warmup + cosine decay schedule. For the first few thousand steps the learning rate rises from near zero to its peak value (warmup phase — prevents instability when weights are far from any good solution). It then follows a cosine curve downward for millions of steps. This combination is used in GPT-3, GPT-4, Llama, and virtually every modern LLM.

Newton descent — computing the optimal step

Standard gradient descent requires you to guess (or schedule) a learning rate η. Newton’s method eliminates guessing by computing the mathematically optimal step size at every single point using the curvature of the loss function — not just its slope.

The algorithm works in three steps:

At the current position w, fit a quadratic curve (parabola) locally to the loss surface using both the first and second derivatives.
Jump directly to the minimum of that local parabola — this is the analytically exact optimal step.
Repeat until convergence.

w^t+1 = w^t − H⁻¹ × ∇J Where: ∇J = gradient vector (first derivatives) H = Hessian matrix (matrix of all second partial derivatives) H⁻¹= inverse of the Hessian — encodes curvature information

Newton descent converges in far fewer steps than gradient descent because it accounts for the curvature of the surface at every point — not merely the slope. The learning rate is automatically optimal at each step and requires no tuning. However, computing and inverting H is extremely expensive for large models.

Method	Derivative used	Steps to converge	Cost per step	Practical use
Standard gradient descent	First (gradient ∇J)	Many (hundreds to thousands)	Low — just compute gradients	All modern deep learning
Newton descent	Second (full Hessian H)	Very few (sometimes just a handful)	Extremely high — O(n³) for n params	Impractical for deep nets; used in small classical ML
Quasi-Newton (L-BFGS)	Approximates H⁻¹ without computing it	Moderate	Moderate — approximation is cheap	Classical ML, some fine-tuning; not suited for mini-batch
Adam (adaptive)	Approximates per-parameter curvature via gradient history	Moderate	Low — nearly as fast as gradient descent	Default for virtually all modern deep learning

Three variants of gradient descent

All three variants use the same core update equation. They differ only in how much data is used to compute the gradient before each weight update. This single design choice has profound consequences for speed, stability, memory usage, and GPU efficiency.

Variant 1 — stochastic gradient descent (SGD)

Rule: k = 1 (one sample per update). Weights are updated after every single training example. The data is shuffled at the start of each epoch so examples are seen in a different random order.

for each epoch: shuffle(training_data) for (X, y) in training_data: pred = predict(X, model) error = loss(y, pred) grad = differentiate(error) model = update(model, grad)

Pros of SGD

Very fast individual updates; built-in noise can help escape local minima and saddle points; works with streaming data or datasets too large to fit in memory; can start learning immediately.

Cons of SGD

Very noisy path through parameter space (zigzags a lot); hard to parallelise efficiently on GPUs (each update depends on the previous); may never converge to the exact minimum.

Variant 2 — mini-batch gradient descent

Rule: 1 < k < N (small batch per update). Weights are updated after processing a mini-batch of k examples. This is by far the most widely used approach in modern deep learning. Standard batch sizes: k = 32 to 256.

for each epoch: batches = split(training_data, k) for T_b in batches: error = 0 for (X, y) in T_b: error += loss(y, predict(X)) grad = differentiate(error) model = update(model, grad)

Pros of mini-batch

Highly GPU-efficient (vectorised operations process many examples simultaneously); smoother updates than SGD; excellent balance between computational speed and convergence stability; batch size k is tunable.

Cons of mini-batch

Batch size k is an additional hyperparameter requiring tuning; slightly more complex implementation than pure SGD; gradient estimate has some noise (less than SGD but more than batch).

Variant 3 — batch gradient descent

Rule: k = N (entire dataset per update). The gradient is computed over all N training examples before any weight update occurs. One update per epoch.

for each epoch: error = 0 for (X, y) in training_data: pred = predict(X, model) error += loss(y, pred) grad = differentiate(error) model = update(model, grad)

Pros of batch GD

Most stable, smoothest path through parameter space; mathematically guaranteed to converge for convex functions; gradient estimate is exact (no noise from sampling).

Cons of batch GD

Extremely slow for large datasets; the entire dataset must fit in memory simultaneously; only one weight update per epoch; computationally prohibitive for anything beyond small problems.

Practical note: Batch gradient descent is rarely used for large real-world datasets. It is useful for small problems (a few thousand examples) or as a theoretical baseline to understand convergence behaviour.

Comparing the three variants

The three variants make different trade-offs across seven dimensions that matter in practice. Visualising the path each method traces through parameter space (the θ₀, θ₁ space) makes their character immediately clear: batch descent moves in straight, deliberate steps; mini-batch moves in gentle arcs; SGD zigzags erratically but covers ground with great speed.

Dimension	SGD (k=1)	Mini-batch (k=32–256)	Batch (k=N)
Updates per epoch	N (one per example)	N/k (one per mini-batch)	1 (one per epoch)
Speed per step	Fastest — 1 example at a time	Moderate — k examples at a time	Slowest — all N examples each time
Path stability	Very noisy; erratic zigzag path	Moderately smooth	Very smooth; nearly straight
GPU utilisation	Poor — sequential; GPUs idle	Excellent — vectorised, GPUs fully used	Good but wasteful (large memory transfer)
Memory footprint	Minimal — 1 example in RAM	Small — 32–256 examples in RAM	Full dataset must fit in RAM
Escaping local minima	Good — noise helps exploration	Moderate — partial noise benefit	Poor — smooth path gets stuck
Convergence guarantee	No — noisy updates	Approximate	Yes — for convex functions
Typical use today	Online / streaming learning	Standard for all deep learning	Small classical ML; theoretical study

Application — the AlexNet breakthrough (2012)

Krizhevsky et al. used mini-batch SGD with momentum on two NVIDIA GPUs to train AlexNet on the ImageNet dataset (1.2 million images, 1,000 classes). Their model won the ImageNet Large Scale Visual Recognition Challenge by a margin so large it shocked the research community — achieving a top-5 error rate of 15.3% versus the runner-up’s 26.2%. The choice of mini-batch size, momentum, and learning rate schedule was central to their success. This moment is widely credited with launching the modern deep learning era.

Concepts beyond the slides added depth

Backpropagation — how gradients are computed inside deep networks

The slides explain what to do with the gradient once you have it. They do not explain how the gradient is computed inside a multi-layer neural network. That process is called backpropagation (or backprop). It applies the chain rule of calculus repeatedly, starting from the output layer (where the loss is measured) and working backward through every layer toward the input, computing how much each parameter contributed to the final error. Without backpropagation, gradient descent cannot be applied to deep networks at all.

1974–1986

Werbos (1974) / Rumelhart, Hinton & Williams (1986) Paul Werbos formally described backpropagation in his 1974 PhD thesis, largely unnoticed for over a decade. In 1986, Rumelhart, Hinton and Williams published a clear and accessible explanation in Nature, showing that backprop could train multi-layer networks to learn rich internal representations. This is widely considered one of the most important papers in the history of artificial intelligence.

Factory quality-control analogy

A factory produces a defective product. Backpropagation is the process of tracing backward through every stage of production to assign proportional blame: “Step 7 was responsible for 40% of the defect; step 3 for 35%…” Each stage then corrects itself in proportion to its share of responsibility. This is exactly how error signals (gradients) flow backward through every layer of a neural network.

Adaptive optimisers — Adam, AdaGrad, RMSProp

Standard gradient descent gives every parameter the same learning rate η. Adaptive optimisers automatically give each parameter its own learning rate, adjusted dynamically based on the history of that parameter’s gradient updates. Parameters updated rarely (e.g. rare words in a language model) receive larger steps; parameters updated constantly and consistently receive smaller, more careful steps.

Adam update — simplified: m = β₁ × m + (1−β₁) × ∇J [1st moment: running average of gradients] v = β₂ × v + (1−β₂) × (∇J)² [2nd moment: running average of squared gradients] m̂ = m / (1−β₁ᵗ) [bias correction] v̂ = v / (1−β₂ᵗ) [bias correction] w = w − η × m̂ / (√v̂ + ε) [parameter update] Typical values: β₁=0.9, β₂=0.999, ε=10⁻⁸, η=0.001

Optimiser	Year	Key innovation	Best suited for
SGD + Momentum	1960s–1986	Carry direction memory across gradient steps	Image classification with careful tuning; strong baseline
AdaGrad	2011	Per-parameter learning rate; accumulates all past squared gradients	Sparse data and NLP — rare features get large updates
RMSProp	2012 (Hinton, unpublished)	Like AdaGrad but discards old gradients via exponential decay; prevents rate from shrinking to zero	Recurrent networks, non-stationary problems, online learning
Adam	2014 (Kingma & Ba)	Combines momentum + RMSProp with bias correction for early training stability	Default for virtually all modern deep learning — LLMs, vision, speech
AdamW	2019 (Loshchilov & Hutter)	Adam with decoupled weight decay — fixes interaction between L2 regularisation and adaptive rates	Transformers, GPT, BERT — now preferred over Adam for LLMs

Overfitting and the validation loss

Gradient descent can sometimes optimise the training loss too well — the model memorises the training examples rather than learning general patterns. This is called overfitting. To detect it, practitioners always monitor a separate validation loss computed on data the model has never seen during training. When training loss continues falling but validation loss starts rising, overfitting has begun.

School exam analogy

A student who memorises every past exam paper word-for-word scores 100% on past papers (low training loss) but struggles on the actual exam with new questions (high validation loss). A student who genuinely understands the concepts performs well on both. Gradient descent must find the second type of student. Solutions: regularisation (L1/L2), dropout, data augmentation, and early stopping.

Vanishing and exploding gradients

In very deep networks, gradients computed by backpropagation can become either extremely small (vanishing — early layers receive near-zero gradient signals and barely learn) or extremely large (exploding — weights grow unstable and diverge). This was the primary barrier to training deep networks from the early 1990s through the mid-2000s, effectively halting deep learning research for nearly two decades. Solutions include: careful weight initialisation (Xavier / He initialisation), batch normalisation, gradient clipping, and residual connections (skip connections in ResNet-style architectures that allow gradients to flow directly through many layers unchanged).

Batch normalisation

Proposed by Ioffe and Szegedy in 2015, batch normalisation normalises the inputs to each layer during training, keeping them in a stable range. This dramatically smooths the loss landscape — imagine ironing out wrinkles in the terrain — allowing much higher learning rates and dramatically faster convergence. It also acts as a mild regulariser. Batch normalisation is now a standard component of virtually every modern deep neural network and is one of the most impactful techniques developed since backpropagation itself.

Historical timeline

1805

Legendre — Least Squares (France) First formal loss function invented to fit planetary orbit data. Direct mathematical ancestor of the MSE loss used in every regression model today. Carl Friedrich Gauss independently developed the same idea by 1809.

1847

Cauchy — Method of Steepest Descent (France) Published the algorithm now known as gradient descent in Comptes rendus de l’Académie des Sciences. Aimed at solving large linear systems in astronomy. The core idea — move in the direction of the negative gradient — is completely unchanged in modern deep learning.

1870s

Boltzmann & Maxwell — Energy Minimisation (Austria / UK) Statistical mechanics formalised the physics concept of systems naturally seeking minimum-energy states. A deep conceptual parallel to machine learning’s loss minimisation that influenced later thinking about learning as optimisation.

1952

Hestenes & Stiefel — Conjugate Gradient Method (USA) A smarter variant of steepest descent that avoids retracing the same directions, solving linear systems in far fewer steps. Still used in scientific computing and as the basis of some modern optimisation methods.

1960

Widrow & Hoff — LMS Algorithm (USA) The Least Mean Squares algorithm for adaptive electronic filters was effectively the first practical stochastic gradient descent. Weights updated one data sample at a time — the direct conceptual ancestor of SGD in neural networks.

1974

Werbos — Backpropagation (USA, PhD thesis) First formal mathematical derivation of backpropagation for neural networks. Largely unnoticed at the time — the neural network community was not ready for it.

1986

Rumelhart, Hinton & Williams — Backprop popularised (USA / UK) Published in Nature. Showed multi-layer networks could learn rich internal representations via gradient descent + backpropagation. Launched the first major wave of neural network research.

1998

LeCun et al. — LeNet-5 / MNIST (USA) Mini-batch SGD trained a convolutional neural network for handwritten digit recognition with remarkable accuracy. Proved deep learning was practically viable and established many conventions still in use today.

2011–2014

AdaGrad, RMSProp, Adam (USA) Adaptive learning rate methods emerged in rapid succession. Adam (2014, Kingma & Ba) combined momentum with per-parameter rate adaptation and became the default optimiser for deep learning almost immediately upon publication.

2012

Krizhevsky, Sutskever & Hinton — AlexNet (USA/Canada) Mini-batch SGD with momentum on two GPUs trained a deep convolutional network that won ImageNet by a stunning margin. Launched the modern deep learning era. Every major AI application today traces its lineage to this moment.

2015

Ioffe & Szegedy — Batch Normalisation (Google) Smoothed the loss landscape, enabling much deeper networks and dramatically higher learning rates. Became standard within months of publication and is now in virtually every deep neural network.

2017

Vaswani et al. — Transformer architecture (Google) The architecture behind GPT, BERT, Claude, and all modern large language models. Trained with Adam + learning rate warmup + cosine decay — ideas built on decades of gradient descent research.

2020s

Large Language Models — GPT, Gemini, Claude, Llama (global) Hundred-billion-parameter models trained with mini-batch gradient descent, backpropagation, AdamW, and cosine annealing across thousands of specialised GPUs for months at a time. The culmination of 177 years of gradient descent research.

Why all of this matters: the road to modern AI

It is worth pausing to appreciate what gradient descent actually represents: a 177-year-old idea from observational astronomy that turned out to be the secret ingredient of the artificial intelligence revolution.

For most of the twentieth century, the algorithm sat quietly in mathematics textbooks. The missing pieces were three things that arrived together around 2010: massive datasets, GPUs capable of the required computation, and the insight — already formalised by Rumelhart, Hinton and Williams in 1986 — that gradient descent could be applied through dozens of layers using backpropagation. When those three pieces converged, the results were rapid and dramatic.

Image recognition surpassed human accuracy. Speech recognition became reliable enough for consumer products. Language models learned to write, translate, summarise, and reason. Protein structure prediction — a problem that had stumped biology for 50 years — was solved by a neural network. All of it traces back to the same fundamental loop: compute the loss, compute the gradient, take a small step downhill, repeat.

Every time you use a voice assistant, receive a medical scan analysis, navigate using a maps application, see a content recommendation, or interact with an AI chatbot — gradient descent made it possible. The model behind each of those systems learned by doing exactly what these slides describe: measuring loss, computing gradients, and taking millions of small steps downhill.

The journey from Cauchy’s 1847 paper to today’s large language models is a straight, unbroken line through the ideas in this presentation: loss functions, gradients, learning rates, momentum, regularisation, mini-batch updates, backpropagation, and adaptive optimisers. Understanding gradient descent is not just understanding one algorithm — it is understanding the central computational idea of the modern world.

“A blindfolded person feeling for the lowest point of a hilly landscape — that simple image captures the algorithm that taught machines to learn, and in doing so, changed what machines can do for humanity.”

Recommended next steps

Resource	What you will gain	Level
3Blue1Brown — “Neural Networks” series (YouTube)	Beautiful, intuitive visual explanations of gradient descent, backpropagation, and neural networks — no prior knowledge needed	Beginner
Khan Academy — Multivariable Calculus	The mathematical foundation: partial derivatives, gradients, and the chain rule	Beginner
fast.ai — Practical Deep Learning for Coders (free)	Hands-on coding of real models using gradient descent and PyTorch	Intermediate
Stanford CS231n — lecture notes (free online)	Rigorous and complete treatment of CNNs, loss functions, and optimisation	Advanced
Kingma & Ba (2014) — “Adam: A Method for Stochastic Optimization”	The original Adam paper — clearly written and surprisingly accessible	Advanced
Rumelhart, Hinton & Williams (1986) — “Learning representations by back-propagating errors” (Nature)	The landmark backpropagation paper — the foundation of all modern deep learning	Advanced

Try it yourself: implement gradient descent on J(w) = w² in Python and watch w converge to 0 in just 10 lines of code. Then try J(w) = w⁴ − 3w² + 2 and observe what local minima do to your algorithm.