Chapter 11

Performance Metrics in Machine Learning — A Student’s Guide

Student’s Guide · Machine Learning

How Do We Know If a Machine Learning Model Is Good?

A plain-English walkthrough of performance metrics — from cross-validation to the ROC (Receiver Operating Characteristic) curve

The Big Question

Imagine you train a dog to fetch sticks. The dog gets very good at your backyard — but what happens in an unfamiliar park? Machine learning (ML) models face exactly the same problem: they can memorise what they have seen, but the real test is how they do on new, unseen data.

Performance metrics are the measuring tools we use to answer: “Is the model actually learning, or just memorising?” Choosing the right metric matters enormously — measuring the wrong thing can give a false sense of success and lead to real harm when the model is deployed.

Key idea Evaluating a model on the same data you used to train it is like grading a student on the same questions used in practice — it tells you nothing about real understanding.

Splitting the Data

Before training, we divide our labelled data (data where we already know the right answers) into three buckets:

Data split diagram showing training, validation, and test sets A rectangle labelled “All labelled data” is divided into three coloured sections: training (large, green), validation (medium, orange), and test (small, red). All labelled data (n samples) Training data — the model learns from this Validation tune settings Test final score

Fig 1. The three-way split. The test set is locked away and only used once — at the very end.

Why not use everything for training? A model trained on all the data has no “unseen” data to test on. It is like giving a student the answers during the exam — the score is meaningless.

Cross-Validation — Supercharging the Validation Set

A single random split can be lucky or unlucky. What if all the easy examples end up in the test set? k-fold cross-validation (CV) (where k is just a chosen number, typically 5 or 10) fixes this by rotating which chunk of data is the test set.

k-fold cross-validation diagram showing 5 folds rotating as test set Five horizontal bars represent five folds. In each row, one fold is highlighted in orange as the test set while the others are green training data. Round 1 Round 2 Round 3 Round 4 Round 5 TEST (e₁) train train train train train TEST (e₂) train train train train train TEST (e₃) train train train train train TEST (e₄) train train train train train TEST (e₅)

Fig 2. 5-fold cross-validation. The orange block rotates as the test set each round. The final error is the average: μₑ = (e₁+e₂+e₃+e₄+e₅) ÷ 5.

After all rounds are done, we average the five error scores. The average (μₑ) is a much fairer estimate than any single split. We also look at how spread out the scores are — the standard deviation (σₑ) — to see if the model is consistent.

Leave-one-out cross-validation (LOOCV) An extreme version where each single sample takes turns being the test set. Great when you have very little data, but very slow with large datasets.

A brief history: where did cross-validation come from?

1930s–1950s
Statistician Sewall Wright and later Mosteller & Tukey used related resampling ideas in social science research.
1968
Seymour Geisser formalised what we now call cross-validation in a statistics paper — he wanted a fair way to compare different prediction formulas.
1974
Mervyn Stone gave the first rigorous mathematical proof that cross-validation picks the right model size — a major result that made it standard practice.
Today
Every major ML library (Python’s scikit-learn, R’s caret) has cross-validation built in as a one-line function call.

The Confusion Matrix — Breaking Down Errors

When a model makes a classification decision (e.g. “is this email spam or not?”), there are four possible outcomes:

Model says: No Model says: Yes
Reality: No ✔ True Negative (TN)
Correctly said “no”
✘ False Positive (FP)
False alarm
Reality: Yes ✘ False Negative (FN)
Missed it
✔ True Positive (TP)
Correctly said “yes”

Fig 3. The confusion matrix. Green = correct; red = error.

Real example from the presentation

A disease classifier was tested on 150 patients (66 healthy, 84 sick):

N = 150Predicted: healthyPredicted: sickTotal
Actually healthy TN = 61 FP = 5 66
Actually sick FN = 7 TP = 77 84
Total 6882150

Fig 4. Worked example: 12 errors out of 150 patients.

Key Metrics — What Does Each Number Mean?

From the confusion matrix, we compute a family of scores. Each one highlights a different aspect of the model’s behaviour.

Accuracy

How often is the model right overall?
(TP + TN) ÷ total = (77+61)÷150 = 92%

Precision

Of everyone the model said “yes” to, how many actually were?
TP ÷ (TP+FP) = 77÷82 = 93.9%

Recall (True Positive Rate)

Of everyone who actually was sick, how many did the model catch?
TP ÷ (TP+FN) = 77÷84 = 91.7%

False Positive Rate

Of healthy people, how many were wrongly flagged as sick?
FP ÷ (FP+TN) = 5÷66 = 7.6%

Which metric should you care about?

It depends entirely on what kind of mistake is more dangerous:

Missing sick patients (FN)

Far worse for a terminal disease. You want very high recall — catch every case, even if you raise a few false alarms.

False alarms (FP)

Worse for a spam filter or a security door. You want high precision — don’t block a legitimate email or lock out a genuine user.

The accuracy trap Suppose 99% of emails are not spam. A lazy model that just says “not spam” every single time scores 99% accuracy — but catches zero spam. Accuracy alone is useless here. This is why we always look at recall and precision too.

The F1 Score — One Number to Summarise Both

Precision and recall often pull in opposite directions: being stricter raises precision but lowers recall. The F1 score blends them into a single balanced number using the harmonic mean (a type of average that punishes extreme values).

F1 = 2 × (Precision × Recall) ÷ (Precision + Recall)

For our example: F1 = 2 × (0.939 × 0.917) ÷ (0.939 + 0.917) ≈ 0.928

The generalised version, called the Fβ score, lets you tilt the balance. Setting β = 2 weights recall twice as heavily — ideal for disease screening where missing a sick patient is catastrophic.

Precision 93.9%, Recall 91.7%, F1 Score 92.8%

Fig 5. Precision, Recall, and F1 score for the worked example. F1 sits between the two.

The ROC Curve — A Picture of Trade-offs

Most ML models do not output a definite “yes” or “no”. They output a probability — like “80% chance this is spam”. We choose a threshold: above it, we say “yes”; below it, “no”.

As we lower the threshold (become more lenient), we catch more true positives — but also raise more false alarms. The Receiver Operating Characteristic (ROC) curve plots True Positive Rate (TPR) against False Positive Rate (FPR) as the threshold slides from strict to lenient.

A brief history of the ROC curve ROC analysis was invented during World War II by radar operators who needed to distinguish real aircraft from noise on their screens. The letters stand for “Receiver Operating Characteristic” — the receiver being the radar receiver. It entered medicine in the 1970s for diagnostic tests, and was adopted by machine learning researchers in the 1980s.
A perfect classifier reaches the top-left corner. Better classifiers curve more towards the top-left.

Fig 6. ROC curves. A perfect model reaches the top-left corner (TPR=1, FPR=0). A random guess follows the diagonal. The area under the curve (AUROC — Area Under the Receiver Operating Characteristic curve) summarises performance: 1.0 = perfect, 0.5 = random.

The Precision–Recall curve

For very imbalanced problems (e.g. fraud detection, where 99.9% of transactions are legitimate), the ROC curve can look deceptively good. The AUPRC (Area Under the Precision–Recall Curve) is a stricter and more honest measure in those cases.

Metrics for Regression (Predicting Numbers)

When a model predicts a number — a house price, tomorrow’s temperature, a patient’s blood pressure — we use different measures based on the difference between the prediction and the truth (called the residual).

MAE — Mean Absolute Error

Average of how far off each prediction is, ignoring direction. Easy to understand; not thrown off by a few huge errors.

MSE / RMSE

Mean Squared Error squares the gaps, so big mistakes are penalised much more. RMSE (Root Mean Square Error) brings it back to the original units.

MAPE — Mean Absolute Percentage Error

Expresses errors as percentages. Intuitive (“our model is off by 5% on average”) but breaks when the true value is near zero.

R² (R-squared)

Proportion of the variation explained by the model. R² = 1 means a perfect fit; R² = 0 means no better than just guessing the average.

Class Imbalance — When One Group Is Much Rarer

Real-world datasets are often lopsided. In fraud detection, less than 1 in 1,000 transactions is fraudulent. In cancer screening, only a small fraction of scans show cancer. Models trained on such data tend to ignore the rare class entirely — and still score high on accuracy.

99% non-fraud, 1% fraud.

Fig 7. A typical fraud detection dataset. Predicting “not fraud” for every single transaction gives 99% accuracy — and misses all the fraud.

Better metrics for imbalanced data include:

AUPRC MCC (Matthews Correlation Coefficient) Balanced Accuracy = (TPR + TNR) ÷ 2 Per-class F1

Data-level fixes include SMOTE (Synthetic Minority Oversampling Technique) — a method that creates artificial new examples of the rare class to balance the dataset.

Calibration — Are the Model’s Probabilities Honest?

A model that says “I’m 80% sure” should be right about 80% of the time — not 50%, not 99%. A model whose probabilities match reality is called well-calibrated.

Why does this matter? If a medical AI says a patient has a “30% chance of a heart attack” and that number is not grounded in reality, doctors cannot use it to make good decisions. Calibration is measured using a reliability diagram (also called a calibration plot) and a number called the Brier score.

Real-world importance Weather forecasters are among the best-calibrated probabilistic predictors in the world. When a meteorologist says “70% chance of rain”, it rains on about 70% of those days. ML models used in medicine, law, and finance are held to the same standard.

Modern and Specialised Metrics

As ML expands into new fields, bespoke metrics have emerged:

Search & recommendations

MAP (Mean Average Precision) and NDCG (Normalised Discounted Cumulative Gain) measure whether the most relevant results appear at the top of a ranked list.

Image generation

FID (Fréchet Inception Distance) measures how realistic AI-generated images look compared to real ones. Lower is better.

Text generation

BLEU and ROUGE compare machine-generated text to human-written reference answers — widely used for translation and summarisation.

Fairness

Demographic Parity and Equalized Odds check whether a model treats different groups (e.g. by race or gender) equally — a critical concern in hiring or lending decisions.

The Practical Checklist — Seven Steps Every Time

  1. Understand the problem. What costs more: a false alarm or a missed case? This determines your primary metric.
  2. Check class balance. If one class is very rare, accuracy is misleading from the start.
  3. Use stratified k-fold cross-validation. Always report the mean and standard deviation of your metric — never just a single number.
  4. Report multiple metrics. Precision, recall, and F1 together tell a much fuller story than accuracy alone.
  5. Touch the test set only once. Every time you peek at test results to adjust your model, you are leaking information. Lock it away until the very end.
  6. Check calibration if your model outputs probabilities that will drive real decisions.
  7. Test for statistical significance. A 0.1% improvement in one experiment may be just noise. Use a statistical test before claiming you have truly improved the model.
Golden rule Use the metric that reflects the true cost of errors in your specific application — not the metric that makes your model look best.

Quick Reference

Metric What it measures Use when…
AccuracyOverall % correctClasses are balanced
PrecisionQuality of “yes” callsFalse alarms are costly
Recall / TPRCoverage of true positivesMissing cases is costly
F1 scoreBalance of precision & recallBoth errors matter equally
AUROCOverall classifier qualityComparing models
AUPRCPrecision–recall trade-offClass imbalance
MCCBalanced single scoreImbalanced data
MAE / RMSEAverage prediction errorRegression tasks
Brier scoreProbability accuracyCalibration check

Table 1. Common metrics at a glance.

Report based on the lecture “Performance Metrics in Machine Learning” by CK Raju, PhD · HUB DATA · May 2026
To Apply:

Current Vacancies