Chapter 14 – Gradient Descent Algorithm
Gradient Descent:
How Machines Learn to Get Better
- What does it mean for a machine to learn?
- Loss functions — measuring how wrong you are
- Properties of good loss functions & convexity
- The local minima problem & three solutions
- The gradient descent update rule
- The learning rate — getting it right
- Learning rate schedules
- Newton descent — optimal steps
- Three variants of gradient descent
- Comparing the three variants
- Concepts beyond the slides: backpropagation, Adam & more
- Historical timeline
- Why all of this matters: the road to modern AI
What does it mean for a machine to learn?
Think of the last time you practised a skill — shooting a basketball, learning a new recipe, or solving a type of maths problem you had never seen before. You tried something, noticed how far off you were, and adjusted. With every attempt your brain moved toward a better version of the action. That cycle — try, measure, adjust, repeat — is learning.
Machines follow exactly the same logic. A model begins with a random guess at how to perform a task. It measures how wrong that guess is, figures out which direction would make things better, takes a small step in that direction, and repeats until it gets good enough. That process is called gradient descent — the single most important optimisation algorithm in all of machine learning.
The four big questions that drive this entire topic are: What does it mean for a machine to learn? How does it know if it is doing well or badly? How does it improve step by step? And what can go wrong, and how do we fix it? These slides answer each in turn.
An app tries to predict the monthly rent of a flat in Bengaluru. On its first attempt it guesses ₹18,000; the actual rent is ₹25,000. The squared error = (25,000 − 18,000)² = 49,000,000. Gradient descent then nudges the app’s internal numbers a little so the next guess is closer. After thousands of such nudges the app becomes quite accurate.
Loss functions — measuring how wrong you are
Before a machine can improve, it needs a score that tells it how badly it is performing. That score is called a loss function (also called a cost function), written J(w). The smaller J(w), the better the model is performing. The machine adjusts its parameters w to minimise J(w).
Consider the house rental example: the model predicts ₹300k; the real price is ₹350k. The error = 350 − 300 = 50 (in thousands). Using squared-error loss, J = (error)² = 2,500. Squaring has two purposes: it makes all errors positive (over-predictions and under-predictions don’t cancel each other out), and it penalises large errors far more harshly than small ones — which matches our intuition that being wildly wrong is much worse than being slightly wrong.
The machine’s entire job is to find the value of w that sits at the lowest point of the J(w) curve. Everything in gradient descent is a systematic strategy for finding that lowest point.
| Application domain | Typical task | Common loss function | Why it fits |
|---|---|---|---|
| Real-estate / finance | Predict rent, stock price, exam score | Mean Squared Error (MSE) | Penalises large deviations heavily; convex and easy to optimise |
| Email / content moderation | Spam detection, sentiment classification | Cross-Entropy Loss | Handles probabilistic class outputs naturally |
| Medical imaging | Disease classification (rare conditions) | Focal Loss | Handles severe class imbalance; focuses on hard examples |
| Image generation / GANs | Style transfer, super-resolution | Perceptual Loss | Compares high-level features, not raw pixels — matches human perception |
| Self-driving vehicles | Object detection bounding boxes | Huber Loss | Robust to outliers (occasional extreme errors in noisy sensor data) |
| Recommendation systems | Rank products or content for users | Pairwise Ranking Loss | Directly optimises relative ordering rather than absolute scores |
Properties of good loss functions & convexity
Not every loss function is equally useful. A good loss function must possess three key properties:
The gold standard: convex loss functions
A convex function is shaped like a bowl. Draw a straight line (called a chord) connecting any two points on the curve — that chord will lie entirely above the curve. This geometric property guarantees that only one valley exists. Gradient descent started from any point on a convex function is mathematically guaranteed to find the global minimum, regardless of starting position. Linear regression using MSE loss is convex — which is why it always converges cleanly.
Place a marble anywhere inside a smooth, round salad bowl. No matter where you put it, it always rolls to the very centre. Convex loss functions behave identically — gradient descent always reaches the single, perfect minimum. Real neural-network loss surfaces, unfortunately, look more like a mountain range than a salad bowl.
Historical insight: the birth of loss functions
The local minima problem & three solutions
Real neural networks do not have smooth, bowl-shaped loss surfaces. Their landscapes contain thousands of valleys, plateaus, and saddle points. Gradient descent risks getting trapped in a local minimum — a valley that looks like the bottom from inside, but is not the deepest valley globally.
Consequences of non-convex surfaces include: multiple local minima (many false valleys); slower convergence to the true global minimum; and the possibility of never finding the global minimum at all. The problem grows worse as the model gains more parameters wᵢ.
However, modern research reveals an important nuance: in very high-dimensional spaces (millions of parameters), most local minima are actually quite close in loss value to the global minimum. The real danger in practice turns out to be saddle points — where the gradient is zero but you are not at a minimum — and flat plateaus where the gradient is near zero and learning stalls completely.
Solution 1 — smart initialisation
Where you start (w₀) determines which valley you fall into. Three strategies help escape bad starts:
- Avoid bad starts: initialise w with small random values drawn from a uniform or normal distribution centred around 0. Large initial values push the model into extreme regions of the loss surface.
- Warm start: compute a rough approximate solution first (e.g. using a simpler model), then use that as the starting point for full gradient descent.
- Multiple random restarts: run gradient descent several times from different random starting points and keep whichever run achieves the lowest final loss. Computationally expensive but effective.
Imagine dropping balls onto a bumpy landscape from many different positions. Some balls land near deep valleys and roll to the global bottom; others land on ridges and roll into shallow local traps. Multiple restarts increase the probability that at least one run finds the global minimum.
Solution 2 — adding momentum
Momentum adds a memory of the previous step’s direction to the current update — like a heavy rolling ball that carries its speed through small dips. With momentum coefficient γ (gamma), the update becomes:
γ controls how much of the previous step direction is retained. A typical value is γ = 0.9, meaning 90% of the previous direction carries forward. Momentum helps in two ways: it accelerates descent along consistent directions, and it can roll through small local minima and cross flat plateaus that would trap standard gradient descent.
A ping-pong ball rolls down a bumpy slope and stops at every tiny dip. A bowling ball’s momentum carries it through minor obstacles and into deeper valleys. Gradient descent with momentum behaves like the bowling ball. This is also the direct computational analogue of Newton’s First Law of Motion — an object in motion stays in motion.
Solution 3 — regularisation
Regularisation modifies the loss function itself by adding a penalty term that discourages unnecessarily large weights. This smooths out the loss landscape, reducing the number and sharpness of local minima and making gradient descent more reliable.
The parameter λ controls the strength of the penalty. L2 regularisation shrinks all weights proportionally — useful when all features contribute somewhat. L1 regularisation can force many weights to exactly zero, effectively performing automatic feature selection — particularly useful in text analysis where most words are irrelevant to a specific task.
Regularisation is like imposing a tax on model complexity. Just as a tax discourages excess consumption, a regularisation penalty discourages the model from using unnecessarily large parameters — preventing it from memorising noise in the training data instead of learning genuine patterns.
| Technique | Mechanism | Typical setting | Best used when |
|---|---|---|---|
| Smart initialisation | Start from well-chosen or multiple random points | Normal(0, 0.01) or Xavier/He | Always — baseline good practice for any model |
| Momentum | Carry weighted direction memory across steps | γ = 0.9 | Non-convex surfaces, plateaus, saddle points |
| L2 regularisation (Ridge) | Penalise sum of squared weights λ‖w‖² | λ = 0.001–0.01 | Many features, all somewhat relevant; prevents huge weights |
| L1 regularisation (Lasso) | Penalise sum of absolute weights λ‖w‖ | λ = 0.001–0.01 | Sparse data; want automatic irrelevant feature removal |
The gradient descent update rule
The gradient ∂J/∂w is the slope of the loss function at the current position in parameter space. It points in the direction of steepest increase. To reduce the loss we move in the opposite direction — downhill. This is precisely why we subtract the gradient, never add it.
For a single-layer perceptron specifically, the gradient of the loss with respect to the weights simplifies to ∇J = −yᵢxᵢ, where yᵢ is the true label and xᵢ is the input vector.
A beautiful built-in property of this rule: it automatically slows down near the minimum. As w approaches the lowest point, the slope ∂J/∂w approaches zero — so each step shrinks naturally, without any extra logic. Near the bottom of the bowl, the algorithm takes small, precise steps.
When to stop?
Two practical stopping criteria exist in common use:
Stop when η × |∂J/∂w| < τ (tau). When the step size falls below threshold τ, the algorithm is essentially at the bottom and further updates produce negligible improvement.
Stop after a fixed number of complete passes through the training dataset. Simpler to implement and widely used in practice — you set a budget and let training run.
The learning rate — getting it right
The learning rate η is arguably the single most important hyperparameter you control when training a model. It determines the size of each downhill step. Get it wrong and the model either crawls to the answer or flies right past it.
| Learning rate setting | What happens | Path shape | Practical consequence |
|---|---|---|---|
| Too small (e.g. η = 0.0001) | Tiny steps; correct direction but painfully slow convergence | Long, straight, gradual descent | May need thousands of extra epochs; wasteful computationally |
| Too large (e.g. η = 1.0) | Overshoots the minimum; oscillates back and forth across the valley | Zig-zag pattern across the bowl | Plateaus at a higher-than-optimal loss value; slow or no convergence |
| Very large (e.g. η = 10+) | Loss explodes toward infinity — the algorithm completely breaks | Loss curve shoots sharply upward | Training fails entirely; weights become NaN |
| “Goldilocks” — just right (η = 0.001–0.1) | Fast, smooth descent to a low loss value | Efficient curved path into the minimum | Good convergence in reasonable time; the ideal outcome |
The practical tip from the slides: start with η = 0.01 as a default and adjust based on your training loss curve. A well-behaved curve falls quickly in early epochs and then levels off smoothly near a low value. A jagged, noisy, or rising curve signals a learning rate problem. The recommended starting range is 10⁻³ to 10⁻¹.
When training ResNet-50 on the ImageNet dataset, practitioners typically start with η = 0.1, then reduce it by a factor of 10 at epochs 30, 60, and 90. This schedule is so reliable it has become a standard baseline cited in hundreds of research papers.
Learning rate schedules
The best-of-both-worlds strategy is to start training with a relatively high learning rate (to cover the loss landscape quickly and escape bad regions) and gradually reduce it as training progresses (for precision near the minimum). This is called a learning rate schedule.
| Schedule | Formula | Behaviour | Typical application |
|---|---|---|---|
| Step decay | ηt = η₀ × 0.5⌊t/k⌋ | Halves the learning rate every k epochs; produces a staircase pattern in the rate curve | Classic deep learning, ResNet image classifiers, early-generation CNNs |
| Exponential decay | ηt = η₀ × e(−λt) | Smooth continuous reduction; no sudden jumps | General-purpose training where smooth convergence matters |
| Cosine annealing | ηt = η_min + ½(η₀−η_min)(1+cos(πt/T)) | Smoothly reduces following a cosine curve; can include periodic warm restarts to escape local minima | Transformer training, vision models, modern LLMs |
| Linear warmup + cosine decay | Low η → peak η (warmup) → cosine decay | Starts very small, rises to peak, then follows cosine downward for millions of steps | GPT, BERT, T5, and virtually all large language model training |
Large language models use a warmup + cosine decay schedule. For the first few thousand steps the learning rate rises from near zero to its peak value (warmup phase — prevents instability when weights are far from any good solution). It then follows a cosine curve downward for millions of steps. This combination is used in GPT-3, GPT-4, Llama, and virtually every modern LLM.
Newton descent — computing the optimal step
Standard gradient descent requires you to guess (or schedule) a learning rate η. Newton’s method eliminates guessing by computing the mathematically optimal step size at every single point using the curvature of the loss function — not just its slope.
The algorithm works in three steps:
- At the current position w, fit a quadratic curve (parabola) locally to the loss surface using both the first and second derivatives.
- Jump directly to the minimum of that local parabola — this is the analytically exact optimal step.
- Repeat until convergence.
Newton descent converges in far fewer steps than gradient descent because it accounts for the curvature of the surface at every point — not merely the slope. The learning rate is automatically optimal at each step and requires no tuning. However, computing and inverting H is extremely expensive for large models.
| Method | Derivative used | Steps to converge | Cost per step | Practical use |
|---|---|---|---|---|
| Standard gradient descent | First (gradient ∇J) | Many (hundreds to thousands) | Low — just compute gradients | All modern deep learning |
| Newton descent | Second (full Hessian H) | Very few (sometimes just a handful) | Extremely high — O(n³) for n params | Impractical for deep nets; used in small classical ML |
| Quasi-Newton (L-BFGS) | Approximates H⁻¹ without computing it | Moderate | Moderate — approximation is cheap | Classical ML, some fine-tuning; not suited for mini-batch |
| Adam (adaptive) | Approximates per-parameter curvature via gradient history | Moderate | Low — nearly as fast as gradient descent | Default for virtually all modern deep learning |
Three variants of gradient descent
All three variants use the same core update equation. They differ only in how much data is used to compute the gradient before each weight update. This single design choice has profound consequences for speed, stability, memory usage, and GPU efficiency.
Variant 1 — stochastic gradient descent (SGD)
Rule: k = 1 (one sample per update). Weights are updated after every single training example. The data is shuffled at the start of each epoch so examples are seen in a different random order.
Very fast individual updates; built-in noise can help escape local minima and saddle points; works with streaming data or datasets too large to fit in memory; can start learning immediately.
Very noisy path through parameter space (zigzags a lot); hard to parallelise efficiently on GPUs (each update depends on the previous); may never converge to the exact minimum.
Variant 2 — mini-batch gradient descent
Rule: 1 < k < N (small batch per update). Weights are updated after processing a mini-batch of k examples. This is by far the most widely used approach in modern deep learning. Standard batch sizes: k = 32 to 256.
Highly GPU-efficient (vectorised operations process many examples simultaneously); smoother updates than SGD; excellent balance between computational speed and convergence stability; batch size k is tunable.
Batch size k is an additional hyperparameter requiring tuning; slightly more complex implementation than pure SGD; gradient estimate has some noise (less than SGD but more than batch).
Variant 3 — batch gradient descent
Rule: k = N (entire dataset per update). The gradient is computed over all N training examples before any weight update occurs. One update per epoch.
Most stable, smoothest path through parameter space; mathematically guaranteed to converge for convex functions; gradient estimate is exact (no noise from sampling).
Extremely slow for large datasets; the entire dataset must fit in memory simultaneously; only one weight update per epoch; computationally prohibitive for anything beyond small problems.
Practical note: Batch gradient descent is rarely used for large real-world datasets. It is useful for small problems (a few thousand examples) or as a theoretical baseline to understand convergence behaviour.
Comparing the three variants
The three variants make different trade-offs across seven dimensions that matter in practice. Visualising the path each method traces through parameter space (the θ₀, θ₁ space) makes their character immediately clear: batch descent moves in straight, deliberate steps; mini-batch moves in gentle arcs; SGD zigzags erratically but covers ground with great speed.
| Dimension | SGD (k=1) | Mini-batch (k=32–256) | Batch (k=N) |
|---|---|---|---|
| Updates per epoch | N (one per example) | N/k (one per mini-batch) | 1 (one per epoch) |
| Speed per step | Fastest — 1 example at a time | Moderate — k examples at a time | Slowest — all N examples each time |
| Path stability | Very noisy; erratic zigzag path | Moderately smooth | Very smooth; nearly straight |
| GPU utilisation | Poor — sequential; GPUs idle | Excellent — vectorised, GPUs fully used | Good but wasteful (large memory transfer) |
| Memory footprint | Minimal — 1 example in RAM | Small — 32–256 examples in RAM | Full dataset must fit in RAM |
| Escaping local minima | Good — noise helps exploration | Moderate — partial noise benefit | Poor — smooth path gets stuck |
| Convergence guarantee | No — noisy updates | Approximate | Yes — for convex functions |
| Typical use today | Online / streaming learning | Standard for all deep learning | Small classical ML; theoretical study |
Krizhevsky et al. used mini-batch SGD with momentum on two NVIDIA GPUs to train AlexNet on the ImageNet dataset (1.2 million images, 1,000 classes). Their model won the ImageNet Large Scale Visual Recognition Challenge by a margin so large it shocked the research community — achieving a top-5 error rate of 15.3% versus the runner-up’s 26.2%. The choice of mini-batch size, momentum, and learning rate schedule was central to their success. This moment is widely credited with launching the modern deep learning era.
Concepts beyond the slides added depth
Backpropagation — how gradients are computed inside deep networks
The slides explain what to do with the gradient once you have it. They do not explain how the gradient is computed inside a multi-layer neural network. That process is called backpropagation (or backprop). It applies the chain rule of calculus repeatedly, starting from the output layer (where the loss is measured) and working backward through every layer toward the input, computing how much each parameter contributed to the final error. Without backpropagation, gradient descent cannot be applied to deep networks at all.
A factory produces a defective product. Backpropagation is the process of tracing backward through every stage of production to assign proportional blame: “Step 7 was responsible for 40% of the defect; step 3 for 35%…” Each stage then corrects itself in proportion to its share of responsibility. This is exactly how error signals (gradients) flow backward through every layer of a neural network.
Adaptive optimisers — Adam, AdaGrad, RMSProp
Standard gradient descent gives every parameter the same learning rate η. Adaptive optimisers automatically give each parameter its own learning rate, adjusted dynamically based on the history of that parameter’s gradient updates. Parameters updated rarely (e.g. rare words in a language model) receive larger steps; parameters updated constantly and consistently receive smaller, more careful steps.
| Optimiser | Year | Key innovation | Best suited for |
|---|---|---|---|
| SGD + Momentum | 1960s–1986 | Carry direction memory across gradient steps | Image classification with careful tuning; strong baseline |
| AdaGrad | 2011 | Per-parameter learning rate; accumulates all past squared gradients | Sparse data and NLP — rare features get large updates |
| RMSProp | 2012 (Hinton, unpublished) | Like AdaGrad but discards old gradients via exponential decay; prevents rate from shrinking to zero | Recurrent networks, non-stationary problems, online learning |
| Adam | 2014 (Kingma & Ba) | Combines momentum + RMSProp with bias correction for early training stability | Default for virtually all modern deep learning — LLMs, vision, speech |
| AdamW | 2019 (Loshchilov & Hutter) | Adam with decoupled weight decay — fixes interaction between L2 regularisation and adaptive rates | Transformers, GPT, BERT — now preferred over Adam for LLMs |
Overfitting and the validation loss
Gradient descent can sometimes optimise the training loss too well — the model memorises the training examples rather than learning general patterns. This is called overfitting. To detect it, practitioners always monitor a separate validation loss computed on data the model has never seen during training. When training loss continues falling but validation loss starts rising, overfitting has begun.
A student who memorises every past exam paper word-for-word scores 100% on past papers (low training loss) but struggles on the actual exam with new questions (high validation loss). A student who genuinely understands the concepts performs well on both. Gradient descent must find the second type of student. Solutions: regularisation (L1/L2), dropout, data augmentation, and early stopping.
Vanishing and exploding gradients
In very deep networks, gradients computed by backpropagation can become either extremely small (vanishing — early layers receive near-zero gradient signals and barely learn) or extremely large (exploding — weights grow unstable and diverge). This was the primary barrier to training deep networks from the early 1990s through the mid-2000s, effectively halting deep learning research for nearly two decades. Solutions include: careful weight initialisation (Xavier / He initialisation), batch normalisation, gradient clipping, and residual connections (skip connections in ResNet-style architectures that allow gradients to flow directly through many layers unchanged).
Batch normalisation
Proposed by Ioffe and Szegedy in 2015, batch normalisation normalises the inputs to each layer during training, keeping them in a stable range. This dramatically smooths the loss landscape — imagine ironing out wrinkles in the terrain — allowing much higher learning rates and dramatically faster convergence. It also acts as a mild regulariser. Batch normalisation is now a standard component of virtually every modern deep neural network and is one of the most impactful techniques developed since backpropagation itself.
Historical timeline
Why all of this matters: the road to modern AI
It is worth pausing to appreciate what gradient descent actually represents: a 177-year-old idea from observational astronomy that turned out to be the secret ingredient of the artificial intelligence revolution.
For most of the twentieth century, the algorithm sat quietly in mathematics textbooks. The missing pieces were three things that arrived together around 2010: massive datasets, GPUs capable of the required computation, and the insight — already formalised by Rumelhart, Hinton and Williams in 1986 — that gradient descent could be applied through dozens of layers using backpropagation. When those three pieces converged, the results were rapid and dramatic.
Image recognition surpassed human accuracy. Speech recognition became reliable enough for consumer products. Language models learned to write, translate, summarise, and reason. Protein structure prediction — a problem that had stumped biology for 50 years — was solved by a neural network. All of it traces back to the same fundamental loop: compute the loss, compute the gradient, take a small step downhill, repeat.
Every time you use a voice assistant, receive a medical scan analysis, navigate using a maps application, see a content recommendation, or interact with an AI chatbot — gradient descent made it possible. The model behind each of those systems learned by doing exactly what these slides describe: measuring loss, computing gradients, and taking millions of small steps downhill.
The journey from Cauchy’s 1847 paper to today’s large language models is a straight, unbroken line through the ideas in this presentation: loss functions, gradients, learning rates, momentum, regularisation, mini-batch updates, backpropagation, and adaptive optimisers. Understanding gradient descent is not just understanding one algorithm — it is understanding the central computational idea of the modern world.
Recommended next steps
| Resource | What you will gain | Level |
|---|---|---|
| 3Blue1Brown — “Neural Networks” series (YouTube) | Beautiful, intuitive visual explanations of gradient descent, backpropagation, and neural networks — no prior knowledge needed | Beginner |
| Khan Academy — Multivariable Calculus | The mathematical foundation: partial derivatives, gradients, and the chain rule | Beginner |
| fast.ai — Practical Deep Learning for Coders (free) | Hands-on coding of real models using gradient descent and PyTorch | Intermediate |
| Stanford CS231n — lecture notes (free online) | Rigorous and complete treatment of CNNs, loss functions, and optimisation | Advanced |
| Kingma & Ba (2014) — “Adam: A Method for Stochastic Optimization” | The original Adam paper — clearly written and surprisingly accessible | Advanced |
| Rumelhart, Hinton & Williams (1986) — “Learning representations by back-propagating errors” (Nature) | The landmark backpropagation paper — the foundation of all modern deep learning | Advanced |
Try it yourself: implement gradient descent on J(w) = w² in Python and watch w converge to 0 in just 10 lines of code. Then try J(w) = w⁴ − 3w² + 2 and observe what local minima do to your algorithm.