PSYC4411

Models, Not Magic

Generalisation, Overfitting, and How ML Misleads

Week 3 · Semester 1, 2026

Current Advances in Psychological Methods & Analyses

Prof. Michael J. Richardson

School of Psychological Sciences
Faculty of Medicine, Health and Human Sciences
Macquarie University

michael.j.richardson@mq.edu.au

Today's Agenda

Speaking ML — translating stats vocabulary into ML language
What is a model? — demystifying the word
Generalisation — the fundamental goal of ML
Train/test splits — the simplest defence against self-deception
Overfitting & underfitting — when models lie to you
Bias–variance trade-off — the spectrum from too rigid to too flexible
Cross-validation — a more reliable way to evaluate
Baselines & metrics — how to know if your model actually works
Regularisation preview — setting up Week 4

The Big Idea for This Week

A model that fits your data perfectly is almost certainly lying to you.

The real goal isn't to explain the data you have — it's to make predictions about data you haven't seen yet.

Understanding this distinction is what separates useful ML from self-deception.

Speaking ML

A translation guide from statistics to machine learning

Stats → ML: A Translation Guide

What you know (stats)	What it's called in ML
IV / Predictor	Feature
DV / Outcome	Target (regression) / Label (classification)
Running a regression / fitting a model	Training
Your sample	Training data
Population you generalise to	Test data (held-out)
Coefficients / regression weights	Model parameters / weights
R², adjusted R²	Evaluation metrics (on test data)

The Biggest Shift

Traditional Statistics

Fit model to your entire sample

Ask: "Is this effect real?"

Assess significance (p-values, confidence intervals)

Machine Learning

Fit model to part of your data

Ask: "Can I predict the rest?"

Evaluate on held-out data (R², MAE)

Same data, fundamentally different question. Statistics asks "is this real?" — ML asks "can I predict what happens next?"

Two Flavours of ML

Supervised Learning

You have a target variable (labelled data)

Regression — continuous target
e.g., predicting depression score
Classification — categorical target
e.g., predicting diagnosis group

Weeks 3–6 of this course

Unsupervised Learning

No target variable — looking for structure

Clustering — finding natural groups
e.g., patient subtypes
Dimensionality reduction — simplifying
e.g., PCA on questionnaire items

Weeks 7–8 of this course

Our Running Example

Remember the synthetic DASS dataset from Week 2?

3,000 fake participants · lifestyle variables · DASS-21 depression, anxiety, stress scores

Features (predictors): Sleep_hrs_night, Exercise_hrs_week, SocialMedia_hrs_week, ...

Target (outcome): DASS_Depression (0–42 scale)

This is supervised learning → regression

Known correlations: Sleep r = −0.37, Exercise r = −0.27 with depression
Students had highest depression (M = 20.3) vs employed full-time (M = 8.7)
Can we formalise these patterns into a model that predicts depression for new people?

You've Already Done (a version of) ML

If you've ever fitted a line of best fit or run a correlation, you've already done a version of machine learning.

You just called it statistics.

The difference isn't the maths — it's the question you ask.

Stats: "Is sleep significantly related to depression?"

ML: "If I know someone sleeps 5 hours a night, what depression score do I predict?"

What Is a Model, Really?

Demystifying the word

Models Are Everywhere

A model is a simplified version of reality that captures a pattern.

"Sleep affects mood" — that's a model
A regression equation — that's a model
"Students are more stressed than working adults" — that's a model
A decision tree — that's a model

In stats, a model describes relationships in your sample.

In ML, a model makes predictions about new data — people it has never seen before.

Before a Model Can Learn: Preprocessing

Preprocessing = getting data ready for the model

Missing values — you saw these in Week 2 (~2–3% of lifestyle columns)
- Remove rows? Fill in with the mean? Depends on the situation.
Encoding categories — converting text to numbers
- Gender (Male, Female, Non-binary) → numerical codes the model can use
Scaling — putting variables on comparable ranges
- Income in thousands vs. sleep in single digits — the model needs them comparable

We'll define these techniques properly in Week 4 when you put them into practice.

Generalisation

The fundamental goal of machine learning

The Core Goal: Generalisation

Generalisation = making accurate predictions about data the model has never seen.

A model that explains your sample perfectly but fails on new participants is worthless
We want models that capture real patterns — not noise specific to our particular sample
This is where most ML projects go wrong — optimising for training data instead of new data

The Exam Analogy

Memorising

Memorise every past exam paper word-for-word

Perfect marks on those specific papers

New questions? In trouble.

= Overfitting

Understanding

Understand the underlying concepts

Can answer questions you've never seen before

Adapts to new situations

= Generalising

Understanding the concepts (generalising) beats memorising the specifics (overfitting) every time.

You Already Know About Generalisation

In inferential statistics, you generalise from a sample to a population.

That's what p-values and confidence intervals are for — asking whether patterns in your sample likely exist in the broader population.

ML takes this further: instead of asking "is this effect real?" it asks "can I predict what will happen for a new person?"

This connects directly to Week 1's theme: prediction vs explanation
(Yarkoni & Westfall, 2017)

Train / Test Splits

The simplest defence against self-deception

Split Your Data Before You Start

Before building any model, split your data into two parts:

Training Set

~75% of data

Model learns from this

Test Set

~25% of data

Evaluate here

Like splitting your sample in half, running your analysis on one half, and checking whether the same pattern holds in the other — before publishing.

The Cardinal Rule

Never peek at the test set.

If you look at it — even once — during model development, it stops being "new" data and your evaluation is contaminated.

Don't use test data to choose features
Don't use test data to tune settings
Don't use test data to compare models during development
Only touch the test set once, at the very end, for your final evaluation

Beyond the Basic Split

Train/test is the foundation — but there are more sophisticated strategies:

Validation set — a third split carved from training data, used to compare models and tune settings without touching the test set
- Cross-validation (coming up next) achieves this without sacrificing data
Leave-One-Participant-Out (LOPO) — hold out one person at a time, train on everyone else
- Essential when you have repeated measures — prevents data from the same person leaking across splits
Leave-One-Group-Out (LOGO) — hold out an entire site or group
- Does a model trained at one hospital generalise to another?
Stratified splitting — ensures categories (e.g., diagnosis, gender) are represented proportionally in each split

We'll explore these specialised methods in later weeks when we work with more complex datasets.

Think About It

A researcher builds 20 different models and picks the one that scores best on the test set. They report that model's test performance as their result.

Why is this a problem?

How is it similar to p-hacking in traditional statistics?

Overfitting & Underfitting

When models lie to you

Overfitting

The model memorises the training data — including its noise and quirks. Performs brilliantly on training data but poorly on anything new.

Like the student who memorised past exam answers without understanding the material
Training R² = 0.95 → "Amazing!"
- Test R² = 0.10 → "Disaster."
The model learned the noise, not the signal

Underfitting

The model is too rigid to capture real patterns. A straight line through clearly curved data.

Like the student who only read the chapter summaries
Training R² = 0.05 → "Hmm."
- Test R² = 0.04 → "At least it's consistent... consistently bad."
The model missed the real patterns in the data

The Sweet Spot

A Concrete Example

Using the synthetic DASS dataset from Week 2:

Circular overfitting: If we include the individual DASS items (DASS_1, DASS_2, ... DASS_21) as features to predict DASS_Depression — the model "learns" the scoring formula. R² ≈ 1.0 on training data. But it hasn't learned anything about how lifestyle relates to depression.

Clinical danger: A model appears to predict suicide risk with 95% accuracy in the training sample — but drops to 55% on new patients. A clinician relying on that model has false confidence in predictions barely better than a coin flip.

Think About It

Psychology traditionally uses relatively rigid models — t-tests, ANOVA, linear regression — that make strong assumptions.

ML offers far more flexible models.

When might a psychologist prefer a biased but stable model over a flexible but unstable one?

Think about: clinical decision-making, replication, and sample sizes.

The Bias–Variance
Trade-off

The spectrum from too rigid to too flexible

Bias

Bias = how far off the model's predictions are on average

High bias → the model consistently misses the truth
- A straight line fitted to a curved relationship
High bias = underfitting
- The model is too rigid to capture the real pattern
Example: using only age to predict depression (ignoring sleep, exercise, social support...)
- The model is systematically wrong because it's missing important information

Variance

Variance = how much predictions change when trained on different samples

High variance → the model is unstable
- A wiggly curve that changes dramatically with each new dataset
High variance = overfitting
- The model is fitting noise, not signal
Example: a complex model using all 44 variables on a sample of 50
- Train it on a different 50 people and you get completely different results

The Trade-off

Reducing bias tends to increase variance, and vice versa. You can't minimise both at once.

High Bias

Low Variance

Underfitting

← The trade-off →

Low Bias

High Variance

Overfitting

Stats models you know:
t-tests, ANOVA, linear regression

High bias, low variance — strong assumptions, stable results, may miss patterns

Flexible ML models:
neural networks, random forests

Low bias, high variance — fewer assumptions, capture complexity, need more data

The Replication Crisis Connection

Complex analyses on small samples = high variance = findings that don't replicate.

Psychology's replication crisis is partly a bias–variance problem
- Flexible analyses on small, noisy samples → results that look compelling but don't hold up
ML makes this trade-off explicit rather than hiding it behind a single p-value
- You can see the gap between training and test performance — that gap is the overfitting
Cross-validation (coming up next) forces you to confront this directly

Cross-Validation

A more reliable way to evaluate

The Problem with a Single Split

A single train/test split gives you one estimate of performance
What if you got a lucky split? Or an unlucky one?
- Maybe the easy-to-predict participants all ended up in your test set
Your result depends on which observations happened to land where
We need something more robust...

K-Fold Cross-Validation

Split your data into K equal chunks (typically 5 or 10)
Train on K−1 chunks, test on the remaining one
Repeat K times, rotating which chunk is held out
Average the K performance scores

Think of it like running your study K times with different random samples. Instead of one p-value, you get K estimates of performance — and you can see how much they vary.

5-Fold Cross-Validation

Fold 1

Test

Train

→ R²₁

Fold 2

Train

Test

Train

→ R²₂

Fold 3

Train

Test

Train

→ R²₃

Fold 4

Train

Test

Train

→ R²₄

Fold 5

Train

Test

→ R²₅

Final score = Average of R²₁ through R²₅

Cross-Validation in Practice

K = 5 or K = 10 are the most common choices
- K = 5 is faster; K = 10 gives slightly better estimates
If performance is consistent across folds → you can be more confident the model generalises
If performance swings wildly → something may be wrong
- Too few observations? Data not representative? Model too complex?
Important: CV is for model development — comparing approaches, tuning settings
- Still keep a final held-out test set you only touch at the very end

Reference: de Rooij & Weeda (2020) — "Cross-Validation: A Method Every Psychologist Should Know"

A Note on Random Seeds

Data splits are random — different rows land in training vs test each time you run the code.

Random seed = a number that makes randomness reproducible
- Same seed → same split → same results every time
During development: fix your seed so results are stable while you build and compare models
Before reporting: try a few different seeds and check your conclusions still hold
- If results change dramatically with a different seed → warning sign

This extends beyond splitting. Models like neural networks start from random initial weights — fixing those seeds makes your entire analysis reproducible. We'll revisit this in later weeks.

random_state=42 ← doing something important!

Baselines & Metrics

How to know if your model actually works

Baseline Models: "Am I Actually Learning Anything?"

Before building anything complex, ask: what's the dumbest possible prediction?

Mean prediction baseline: Predict the average target value for everyone.

In our synthetic DASS dataset, mean depression ≈ 14. If we predict 14 for every person, MAE ≈ 9 points.

If your fancy model can't beat this, it hasn't learned anything useful.

Baselines keep you honest
R² = 0.15 sounds bad — until baseline is R² = 0.00 and theoretical max is ~0.30
- Context matters enormously in behavioural science

Think About It

A research team reports that their ML model predicts therapy outcomes with 72% accuracy.

Sounds impressive — but what if 70% of patients improve regardless of treatment?

Their model barely beats random guessing. Why do you think baseline comparisons are so rarely reported in published ML papers?

Evaluation Metrics for Regression

R²

Coefficient of determination

Proportion of variance explained

R² = 0.30 → model explains 30% of variation in depression scores

Familiar from stats — but now calculated on test data

MAE

Mean Absolute Error

Average prediction error, in same units as target

MAE = 4.2 → predictions off by ~4 points on the 0–42 DASS scale

Intuitive — easy to explain to non-technical audiences

RMSE

Root Mean Squared Error

Like MAE but penalises large errors more

RMSE > MAE → some predictions are very far off

Use when big mistakes are especially costly (e.g., clinical risk)

Rule of thumb: report at least R² (relative performance) and MAE (practical accuracy). Classification metrics (accuracy, precision, recall) come in Week 5.

What Counts as "Good"?

It depends on the domain. In behavioural science:

R² = 0.10 – 0.15 → small but meaningful
- Predicting individual behaviour from a few survey items? That's real signal.
R² = 0.20 – 0.35 → quite good for individual-level prediction
- Human behaviour is inherently noisy — we're not predicting physics.
R² > 0.50 → check for data leakage or circular features
- Suspiciously good? Make sure you're not accidentally cheating.

We're predicting what real, complex people do — not the trajectory of a billiard ball. Context matters.

Regularisation

A brief preview for Week 4

Fighting Overfitting with Regularisation

Regularisation = adding a penalty that discourages the model from becoming too complex

Ridge (L2)

Shrinks all coefficients toward zero
Keeps all features, but makes them smaller
"Don't let any single predictor dominate"

Lasso (L1)

Can shrink coefficients all the way to zero
Effectively drops unimportant features
"Which predictors can I remove without losing much?"

Stats Connection: Stepwise Regression

You may have encountered stepwise regression — adding or removing predictors based on significance.

Lasso does something similar but more principled — it simultaneously fits the model and selects features in a single step, rather than testing one predictor at a time.

These are the models you'll build in Week 4's lab
You'll compare: baseline → linear regression → Ridge → Lasso
The AI will help you write the code — you need to understand what the models do and why

Think About It

If a Lasso regression drops a feature entirely (sets its coefficient to zero), does that mean the feature is truly unrelated to the outcome?

Or could it mean something else?

Think about what happens when two features are highly correlated with each other.

Common Misconceptions

"Higher R² is always better"
- Not if it comes from overfitting. Training R² = 0.95 but test R² = 0.10 is a disaster.
"More features = better model"
- Adding irrelevant features introduces noise. Sometimes less is more.
"R² = 0.25 means my model is bad"
- In behavioural science, that's quite respectable for individual-level prediction.
"Cross-validation guarantees good performance"
- It gives better estimates, but can still mislead with structured or clustered data.

The ML Pipeline — Putting It All Together

Data

Features + Target

→

Preprocess

Clean, Scale, Encode

→

Split

Train / Test + CV

→

Train

Fit Model

→

Evaluate

R², MAE, RMSE

This is the workflow you'll follow in every lab — Week 4 will be your first time through it.

Getting Ready for Week 4

Next week you'll put all of this into practice:

Build regression models on a real DASS dataset
- 39,775 real survey responses — not synthetic data this time
Use scikit-learn (sklearn) — the standard Python library for ML
- Your AI assistant helps with the code — you focus on what and why
Compare: Baseline → Linear Regression → Ridge → Lasso
- Which model generalises best to held-out data?
LLM skill focus: debugging
- When code breaks, share the error with your AI and guide it to a fix

Your New Shared Language with AI

The vocabulary from this week becomes your tool for communicating with AI coding assistants:

"My features are TIPI Extraversion and Emotional Stability. My target is DASS Depression. I want to train a Ridge regression with 5-fold cross-validation and report MAE and R²."

The AI knows exactly what you mean. That's the power of speaking ML.

Before next week: Read the companion reading if you haven't already. Review the key terms. You don't need to memorise code — just the concepts.

Homework: Download the Week 4 dataset before class!

conda activate psyc4411-env
cd weeks/week-04-lab/data
python download_data.py

Key References

Yarkoni & Westfall (2017) — Prediction vs explanation in psychology
- doi.org/10.1177/1745691617693393 — Open access
de Rooij & Weeda (2020) — Cross-validation for psychologists
- doi.org/10.1177/2515245919898466 — Open access
Poldrack, Huckins & Varoquaux (2020) — Best practices for prediction studies
- doi.org/10.1001/jamapsychiatry.2019.3671 — Free via PMC
James et al. (2023) — Intro to Statistical Learning (Python), Ch 2 & 5
- statlearning.com — Free PDF

Full reading list: readings.md

Questions?

Next week: Your First ML Pipeline — Regression with scikit-learn

PSYC4411 · Macquarie University · 2026

Models, Not Magic

Generalisation, Overfitting, and How ML Misleads

Today's Agenda

The Big Idea for This Week

Speaking ML

Stats → ML: A Translation Guide

The Biggest Shift

Traditional Statistics

Machine Learning

Two Flavours of ML

Supervised Learning

Unsupervised Learning

Our Running Example

You've Already Done (a version of) ML

What Is a Model, Really?

Models Are Everywhere

Before a Model Can Learn: Preprocessing

Generalisation

The Core Goal: Generalisation

The Exam Analogy

Memorising

Understanding

You Already Know About Generalisation

Train / Test Splits

Split Your Data Before You Start

The Cardinal Rule

Beyond the Basic Split

Think About It

Overfitting & Underfitting

Overfitting

Underfitting

The Sweet Spot

A Concrete Example

Think About It

The Bias–VarianceTrade-off

Bias

Variance

The Trade-off

High Bias

Low Bias

The Replication Crisis Connection

Cross-Validation

The Problem with a Single Split

K-Fold Cross-Validation

5-Fold Cross-Validation

Cross-Validation in Practice

A Note on Random Seeds

Baselines & Metrics

Baseline Models: "Am I Actually Learning Anything?"

Think About It

Evaluation Metrics for Regression

R²

MAE

RMSE

What Counts as "Good"?

Regularisation

Fighting Overfitting with Regularisation

Ridge (L2)

Lasso (L1)

Stats Connection: Stepwise Regression

Think About It

Common Misconceptions

The ML Pipeline — Putting It All Together

Getting Ready for Week 4

Your New Shared Language with AI

Key References

Questions?

The Bias–Variance
Trade-off