Chapter 11
Student’s Guide · Machine Learning
How Do We Know If a Machine Learning Model Is Good?
A plain-English walkthrough of performance metrics — from cross-validation to the ROC (Receiver Operating Characteristic) curve
The Big Question
Imagine you train a dog to fetch sticks. The dog gets very good at your backyard — but what happens in an unfamiliar park? Machine learning (ML) models face exactly the same problem: they can memorise what they have seen, but the real test is how they do on new, unseen data.
Performance metrics are the measuring tools we use to answer: “Is the model actually learning, or just memorising?” Choosing the right metric matters enormously — measuring the wrong thing can give a false sense of success and lead to real harm when the model is deployed.
Splitting the Data
Before training, we divide our labelled data (data where we already know the right answers) into three buckets:
Fig 1. The three-way split. The test set is locked away and only used once — at the very end.
Cross-Validation — Supercharging the Validation Set
A single random split can be lucky or unlucky. What if all the easy examples end up in the test set? k-fold cross-validation (CV) (where k is just a chosen number, typically 5 or 10) fixes this by rotating which chunk of data is the test set.
Fig 2. 5-fold cross-validation. The orange block rotates as the test set each round. The final error is the average: μₑ = (e₁+e₂+e₃+e₄+e₅) ÷ 5.
After all rounds are done, we average the five error scores. The average (μₑ) is a much fairer estimate than any single split. We also look at how spread out the scores are — the standard deviation (σₑ) — to see if the model is consistent.
A brief history: where did cross-validation come from?
The Confusion Matrix — Breaking Down Errors
When a model makes a classification decision (e.g. “is this email spam or not?”), there are four possible outcomes:
| Model says: No | Model says: Yes | |
|---|---|---|
| Reality: No | ✔ True Negative (TN) Correctly said “no” |
✘ False Positive (FP) False alarm |
| Reality: Yes | ✘ False Negative (FN) Missed it |
✔ True Positive (TP) Correctly said “yes” |
Fig 3. The confusion matrix. Green = correct; red = error.
Real example from the presentation
A disease classifier was tested on 150 patients (66 healthy, 84 sick):
| N = 150 | Predicted: healthy | Predicted: sick | Total |
|---|---|---|---|
| Actually healthy | TN = 61 | FP = 5 | 66 |
| Actually sick | FN = 7 | TP = 77 | 84 |
| Total | 68 | 82 | 150 |
Fig 4. Worked example: 12 errors out of 150 patients.
Key Metrics — What Does Each Number Mean?
From the confusion matrix, we compute a family of scores. Each one highlights a different aspect of the model’s behaviour.
Accuracy
How often is the model right overall?
(TP + TN) ÷ total = (77+61)÷150 = 92%
Precision
Of everyone the model said “yes” to, how many actually were?
TP ÷ (TP+FP) = 77÷82 = 93.9%
Recall (True Positive Rate)
Of everyone who actually was sick, how many did the model catch?
TP ÷ (TP+FN) = 77÷84 = 91.7%
False Positive Rate
Of healthy people, how many were wrongly flagged as sick?
FP ÷ (FP+TN) = 5÷66 = 7.6%
Which metric should you care about?
It depends entirely on what kind of mistake is more dangerous:
Missing sick patients (FN)
Far worse for a terminal disease. You want very high recall — catch every case, even if you raise a few false alarms.
False alarms (FP)
Worse for a spam filter or a security door. You want high precision — don’t block a legitimate email or lock out a genuine user.
The F1 Score — One Number to Summarise Both
Precision and recall often pull in opposite directions: being stricter raises precision but lowers recall. The F1 score blends them into a single balanced number using the harmonic mean (a type of average that punishes extreme values).
For our example: F1 = 2 × (0.939 × 0.917) ÷ (0.939 + 0.917) ≈ 0.928
The generalised version, called the Fβ score, lets you tilt the balance. Setting β = 2 weights recall twice as heavily — ideal for disease screening where missing a sick patient is catastrophic.
Fig 5. Precision, Recall, and F1 score for the worked example. F1 sits between the two.
The ROC Curve — A Picture of Trade-offs
Most ML models do not output a definite “yes” or “no”. They output a probability — like “80% chance this is spam”. We choose a threshold: above it, we say “yes”; below it, “no”.
As we lower the threshold (become more lenient), we catch more true positives — but also raise more false alarms. The Receiver Operating Characteristic (ROC) curve plots True Positive Rate (TPR) against False Positive Rate (FPR) as the threshold slides from strict to lenient.
Fig 6. ROC curves. A perfect model reaches the top-left corner (TPR=1, FPR=0). A random guess follows the diagonal. The area under the curve (AUROC — Area Under the Receiver Operating Characteristic curve) summarises performance: 1.0 = perfect, 0.5 = random.
The Precision–Recall curve
For very imbalanced problems (e.g. fraud detection, where 99.9% of transactions are legitimate), the ROC curve can look deceptively good. The AUPRC (Area Under the Precision–Recall Curve) is a stricter and more honest measure in those cases.
Metrics for Regression (Predicting Numbers)
When a model predicts a number — a house price, tomorrow’s temperature, a patient’s blood pressure — we use different measures based on the difference between the prediction and the truth (called the residual).
MAE — Mean Absolute Error
Average of how far off each prediction is, ignoring direction. Easy to understand; not thrown off by a few huge errors.
MSE / RMSE
Mean Squared Error squares the gaps, so big mistakes are penalised much more. RMSE (Root Mean Square Error) brings it back to the original units.
MAPE — Mean Absolute Percentage Error
Expresses errors as percentages. Intuitive (“our model is off by 5% on average”) but breaks when the true value is near zero.
R² (R-squared)
Proportion of the variation explained by the model. R² = 1 means a perfect fit; R² = 0 means no better than just guessing the average.
Class Imbalance — When One Group Is Much Rarer
Real-world datasets are often lopsided. In fraud detection, less than 1 in 1,000 transactions is fraudulent. In cancer screening, only a small fraction of scans show cancer. Models trained on such data tend to ignore the rare class entirely — and still score high on accuracy.
Fig 7. A typical fraud detection dataset. Predicting “not fraud” for every single transaction gives 99% accuracy — and misses all the fraud.
Better metrics for imbalanced data include:
Data-level fixes include SMOTE (Synthetic Minority Oversampling Technique) — a method that creates artificial new examples of the rare class to balance the dataset.
Calibration — Are the Model’s Probabilities Honest?
A model that says “I’m 80% sure” should be right about 80% of the time — not 50%, not 99%. A model whose probabilities match reality is called well-calibrated.
Why does this matter? If a medical AI says a patient has a “30% chance of a heart attack” and that number is not grounded in reality, doctors cannot use it to make good decisions. Calibration is measured using a reliability diagram (also called a calibration plot) and a number called the Brier score.
Modern and Specialised Metrics
As ML expands into new fields, bespoke metrics have emerged:
Search & recommendations
MAP (Mean Average Precision) and NDCG (Normalised Discounted Cumulative Gain) measure whether the most relevant results appear at the top of a ranked list.
Image generation
FID (Fréchet Inception Distance) measures how realistic AI-generated images look compared to real ones. Lower is better.
Text generation
BLEU and ROUGE compare machine-generated text to human-written reference answers — widely used for translation and summarisation.
Fairness
Demographic Parity and Equalized Odds check whether a model treats different groups (e.g. by race or gender) equally — a critical concern in hiring or lending decisions.
The Practical Checklist — Seven Steps Every Time
- Understand the problem. What costs more: a false alarm or a missed case? This determines your primary metric.
- Check class balance. If one class is very rare, accuracy is misleading from the start.
- Use stratified k-fold cross-validation. Always report the mean and standard deviation of your metric — never just a single number.
- Report multiple metrics. Precision, recall, and F1 together tell a much fuller story than accuracy alone.
- Touch the test set only once. Every time you peek at test results to adjust your model, you are leaking information. Lock it away until the very end.
- Check calibration if your model outputs probabilities that will drive real decisions.
- Test for statistical significance. A 0.1% improvement in one experiment may be just noise. Use a statistical test before claiming you have truly improved the model.
Quick Reference
| Metric | What it measures | Use when… |
|---|---|---|
| Accuracy | Overall % correct | Classes are balanced |
| Precision | Quality of “yes” calls | False alarms are costly |
| Recall / TPR | Coverage of true positives | Missing cases is costly |
| F1 score | Balance of precision & recall | Both errors matter equally |
| AUROC | Overall classifier quality | Comparing models |
| AUPRC | Precision–recall trade-off | Class imbalance |
| MCC | Balanced single score | Imbalanced data |
| MAE / RMSE | Average prediction error | Regression tasks |
| Brier score | Probability accuracy | Calibration check |
Table 1. Common metrics at a glance.