PSYC4411

Models, Not Magic

Generalisation, Overfitting, and How ML Misleads

Week 3 · Semester 1, 2026

Current Advances in Psychological Methods & Analyses

Prof. Michael J. Richardson

School of Psychological Sciences
Faculty of Medicine, Health and Human Sciences
Macquarie University

michael.j.richardson@mq.edu.au

Today's Agenda

  1. Speaking ML — translating stats vocabulary into ML language
  2. What is a model? — demystifying the word
  3. Generalisation — the fundamental goal of ML
  4. Train/test splits — the simplest defence against self-deception
  5. Overfitting & underfitting — when models lie to you
  6. Bias–variance trade-off — the spectrum from too rigid to too flexible
  7. Cross-validation — a more reliable way to evaluate
  8. Baselines & metrics — how to know if your model actually works
  9. Regularisation preview — setting up Week 4

The Big Idea for This Week

A model that fits your data perfectly is almost certainly lying to you.

The real goal isn't to explain the data you have — it's to make predictions about data you haven't seen yet.

Understanding this distinction is what separates useful ML from self-deception.

Speaking ML

A translation guide from statistics to machine learning

Stats → ML: A Translation Guide

What you know (stats) What it's called in ML
IV / Predictor Feature
DV / Outcome Target (regression) / Label (classification)
Running a regression / fitting a model Training
Your sample Training data
Population you generalise to Test data (held-out)
Coefficients / regression weights Model parameters / weights
R², adjusted R² Evaluation metrics (on test data)

The Biggest Shift

Traditional Statistics

Fit model to your entire sample

Ask: "Is this effect real?"

Assess significance (p-values, confidence intervals)

Machine Learning

Fit model to part of your data

Ask: "Can I predict the rest?"

Evaluate on held-out data (R², MAE)

Same data, fundamentally different question. Statistics asks "is this real?" — ML asks "can I predict what happens next?"

Two Flavours of ML

Supervised Learning

You have a target variable (labelled data)

  • Regression — continuous target
    e.g., predicting depression score
  • Classification — categorical target
    e.g., predicting diagnosis group

Weeks 3–6 of this course

Unsupervised Learning

No target variable — looking for structure

  • Clustering — finding natural groups
    e.g., patient subtypes
  • Dimensionality reduction — simplifying
    e.g., PCA on questionnaire items

Weeks 7–8 of this course

Our Running Example

Remember the synthetic DASS dataset from Week 2?

3,000 fake participants · lifestyle variables · DASS-21 depression, anxiety, stress scores

Features (predictors): Sleep_hrs_night, Exercise_hrs_week, SocialMedia_hrs_week, ...

Target (outcome): DASS_Depression (0–42 scale)

This is supervised learning → regression

  • Known correlations: Sleep r = −0.37, Exercise r = −0.27 with depression
  • Students had highest depression (M = 20.3) vs employed full-time (M = 8.7)
  • Can we formalise these patterns into a model that predicts depression for new people?

You've Already Done (a version of) ML

If you've ever fitted a line of best fit or run a correlation, you've already done a version of machine learning.

You just called it statistics.

The difference isn't the maths — it's the question you ask.

Stats: "Is sleep significantly related to depression?"

ML: "If I know someone sleeps 5 hours a night, what depression score do I predict?"

What Is a Model, Really?

Demystifying the word

Models Are Everywhere

A model is a simplified version of reality that captures a pattern.

  • "Sleep affects mood" — that's a model
  • A regression equation — that's a model
  • "Students are more stressed than working adults" — that's a model
  • A decision tree — that's a model

In stats, a model describes relationships in your sample.

In ML, a model makes predictions about new data — people it has never seen before.

Before a Model Can Learn: Preprocessing

Preprocessing = getting data ready for the model

  • Missing values — you saw these in Week 2 (~2–3% of lifestyle columns)
    • Remove rows? Fill in with the mean? Depends on the situation.
  • Encoding categories — converting text to numbers
    • Gender (Male, Female, Non-binary) → numerical codes the model can use
  • Scaling — putting variables on comparable ranges
    • Income in thousands vs. sleep in single digits — the model needs them comparable

We'll define these techniques properly in Week 4 when you put them into practice.

Generalisation

The fundamental goal of machine learning

The Core Goal: Generalisation

Generalisation = making accurate predictions about data the model has never seen.

  • A model that explains your sample perfectly but fails on new participants is worthless
  • We want models that capture real patterns — not noise specific to our particular sample
  • This is where most ML projects go wrong — optimising for training data instead of new data

The Exam Analogy

Memorising

Memorise every past exam paper word-for-word

Perfect marks on those specific papers

New questions? In trouble.

= Overfitting

Understanding

Understand the underlying concepts

Can answer questions you've never seen before

Adapts to new situations

= Generalising

Understanding the concepts (generalising) beats memorising the specifics (overfitting) every time.

You Already Know About Generalisation

In inferential statistics, you generalise from a sample to a population.

That's what p-values and confidence intervals are for — asking whether patterns in your sample likely exist in the broader population.

ML takes this further: instead of asking "is this effect real?" it asks "can I predict what will happen for a new person?"

This connects directly to Week 1's theme: prediction vs explanation
(Yarkoni & Westfall, 2017)

Train / Test Splits

The simplest defence against self-deception

Split Your Data Before You Start

Before building any model, split your data into two parts:

Training Set
~75% of data
Model learns from this
Test Set
~25% of data
Evaluate here

Like splitting your sample in half, running your analysis on one half, and checking whether the same pattern holds in the other — before publishing.

The Cardinal Rule

Never peek at the test set.

If you look at it — even once — during model development, it stops being "new" data and your evaluation is contaminated.

  • Don't use test data to choose features
  • Don't use test data to tune settings
  • Don't use test data to compare models during development
  • Only touch the test set once, at the very end, for your final evaluation

Beyond the Basic Split

Train/test is the foundation — but there are more sophisticated strategies:

  • Validation set — a third split carved from training data, used to compare models and tune settings without touching the test set
    • Cross-validation (coming up next) achieves this without sacrificing data
  • Leave-One-Participant-Out (LOPO) — hold out one person at a time, train on everyone else
    • Essential when you have repeated measures — prevents data from the same person leaking across splits
  • Leave-One-Group-Out (LOGO) — hold out an entire site or group
    • Does a model trained at one hospital generalise to another?
  • Stratified splitting — ensures categories (e.g., diagnosis, gender) are represented proportionally in each split

We'll explore these specialised methods in later weeks when we work with more complex datasets.

Think About It

A researcher builds 20 different models and picks the one that scores best on the test set. They report that model's test performance as their result.

Why is this a problem?

How is it similar to p-hacking in traditional statistics?

Overfitting & Underfitting

When models lie to you

Overfitting

The model memorises the training data — including its noise and quirks. Performs brilliantly on training data but poorly on anything new.

  • Like the student who memorised past exam answers without understanding the material
  • Training R² = 0.95 → "Amazing!"
    • Test R² = 0.10 → "Disaster."
  • The model learned the noise, not the signal

Underfitting

The model is too rigid to capture real patterns. A straight line through clearly curved data.

  • Like the student who only read the chapter summaries
  • Training R² = 0.05 → "Hmm."
    • Test R² = 0.04 → "At least it's consistent... consistently bad."
  • The model missed the real patterns in the data

The Sweet Spot

Model Complexity → Simple Complex Prediction Error Sweet Spot Underfitting Overfitting Training error Test error

A Concrete Example

Using the synthetic DASS dataset from Week 2:

Circular overfitting: If we include the individual DASS items (DASS_1, DASS_2, ... DASS_21) as features to predict DASS_Depression — the model "learns" the scoring formula. R² ≈ 1.0 on training data. But it hasn't learned anything about how lifestyle relates to depression.

Clinical danger: A model appears to predict suicide risk with 95% accuracy in the training sample — but drops to 55% on new patients. A clinician relying on that model has false confidence in predictions barely better than a coin flip.

Think About It

Psychology traditionally uses relatively rigid models — t-tests, ANOVA, linear regression — that make strong assumptions.

ML offers far more flexible models.

When might a psychologist prefer a biased but stable model over a flexible but unstable one?

Think about: clinical decision-making, replication, and sample sizes.

The Bias–Variance
Trade-off

The spectrum from too rigid to too flexible

Bias

Bias = how far off the model's predictions are on average

  • High bias → the model consistently misses the truth
    • A straight line fitted to a curved relationship
  • High bias = underfitting
    • The model is too rigid to capture the real pattern
  • Example: using only age to predict depression (ignoring sleep, exercise, social support...)
    • The model is systematically wrong because it's missing important information

Variance

Variance = how much predictions change when trained on different samples

  • High variance → the model is unstable
    • A wiggly curve that changes dramatically with each new dataset
  • High variance = overfitting
    • The model is fitting noise, not signal
  • Example: a complex model using all 44 variables on a sample of 50
    • Train it on a different 50 people and you get completely different results

The Trade-off

Reducing bias tends to increase variance, and vice versa. You can't minimise both at once.

High Bias

Low Variance

Underfitting

← The trade-off →

Low Bias

High Variance

Overfitting

Stats models you know:
t-tests, ANOVA, linear regression

High bias, low variance — strong assumptions, stable results, may miss patterns

Flexible ML models:
neural networks, random forests

Low bias, high variance — fewer assumptions, capture complexity, need more data

The Replication Crisis Connection

Complex analyses on small samples = high variance = findings that don't replicate.

  • Psychology's replication crisis is partly a bias–variance problem
    • Flexible analyses on small, noisy samples → results that look compelling but don't hold up
  • ML makes this trade-off explicit rather than hiding it behind a single p-value
    • You can see the gap between training and test performance — that gap is the overfitting
  • Cross-validation (coming up next) forces you to confront this directly

Cross-Validation

A more reliable way to evaluate

The Problem with a Single Split

  • A single train/test split gives you one estimate of performance
  • What if you got a lucky split? Or an unlucky one?
    • Maybe the easy-to-predict participants all ended up in your test set
  • Your result depends on which observations happened to land where
  • We need something more robust...

K-Fold Cross-Validation

  1. Split your data into K equal chunks (typically 5 or 10)
  2. Train on K−1 chunks, test on the remaining one
  3. Repeat K times, rotating which chunk is held out
  4. Average the K performance scores

Think of it like running your study K times with different random samples. Instead of one p-value, you get K estimates of performance — and you can see how much they vary.

5-Fold Cross-Validation

Fold 1
Test
Train
Train
Train
Train
→ R²₁
Fold 2
Train
Test
Train
Train
Train
→ R²₂
Fold 3
Train
Train
Test
Train
Train
→ R²₃
Fold 4
Train
Train
Train
Test
Train
→ R²₄
Fold 5
Train
Train
Train
Train
Test
→ R²₅

Final score = Average of R²₁ through R²₅

Cross-Validation in Practice

  • K = 5 or K = 10 are the most common choices
    • K = 5 is faster; K = 10 gives slightly better estimates
  • If performance is consistent across folds → you can be more confident the model generalises
  • If performance swings wildly → something may be wrong
    • Too few observations? Data not representative? Model too complex?
  • Important: CV is for model development — comparing approaches, tuning settings
    • Still keep a final held-out test set you only touch at the very end

Reference: de Rooij & Weeda (2020) — "Cross-Validation: A Method Every Psychologist Should Know"

A Note on Random Seeds

Data splits are random — different rows land in training vs test each time you run the code.

  • Random seed = a number that makes randomness reproducible
    • Same seed → same split → same results every time
  • During development: fix your seed so results are stable while you build and compare models
  • Before reporting: try a few different seeds and check your conclusions still hold
    • If results change dramatically with a different seed → warning sign

This extends beyond splitting. Models like neural networks start from random initial weights — fixing those seeds makes your entire analysis reproducible. We'll revisit this in later weeks.

random_state=42   ← doing something important!

Baselines & Metrics

How to know if your model actually works

Baseline Models: "Am I Actually Learning Anything?"

Before building anything complex, ask: what's the dumbest possible prediction?

Mean prediction baseline: Predict the average target value for everyone.

In our synthetic DASS dataset, mean depression ≈ 14. If we predict 14 for every person, MAE ≈ 9 points.

If your fancy model can't beat this, it hasn't learned anything useful.

  • Baselines keep you honest
  • R² = 0.15 sounds bad — until baseline is R² = 0.00 and theoretical max is ~0.30
    • Context matters enormously in behavioural science

Think About It

A research team reports that their ML model predicts therapy outcomes with 72% accuracy.

Sounds impressive — but what if 70% of patients improve regardless of treatment?

Their model barely beats random guessing. Why do you think baseline comparisons are so rarely reported in published ML papers?

Evaluation Metrics for Regression

Coefficient of determination

Proportion of variance explained

R² = 0.30 → model explains 30% of variation in depression scores

Familiar from stats — but now calculated on test data

MAE

Mean Absolute Error

Average prediction error, in same units as target

MAE = 4.2 → predictions off by ~4 points on the 0–42 DASS scale

Intuitive — easy to explain to non-technical audiences

RMSE

Root Mean Squared Error

Like MAE but penalises large errors more

RMSE > MAE → some predictions are very far off

Use when big mistakes are especially costly (e.g., clinical risk)

Rule of thumb: report at least (relative performance) and MAE (practical accuracy). Classification metrics (accuracy, precision, recall) come in Week 5.

What Counts as "Good"?

It depends on the domain. In behavioural science:

  • R² = 0.10 – 0.15 → small but meaningful
    • Predicting individual behaviour from a few survey items? That's real signal.
  • R² = 0.20 – 0.35 → quite good for individual-level prediction
    • Human behaviour is inherently noisy — we're not predicting physics.
  • R² > 0.50 → check for data leakage or circular features
    • Suspiciously good? Make sure you're not accidentally cheating.

We're predicting what real, complex people do — not the trajectory of a billiard ball. Context matters.

Regularisation

A brief preview for Week 4

Fighting Overfitting with Regularisation

Regularisation = adding a penalty that discourages the model from becoming too complex

Ridge (L2)

  • Shrinks all coefficients toward zero
  • Keeps all features, but makes them smaller
  • "Don't let any single predictor dominate"

Lasso (L1)

  • Can shrink coefficients all the way to zero
  • Effectively drops unimportant features
  • "Which predictors can I remove without losing much?"

Stats Connection: Stepwise Regression

You may have encountered stepwise regression — adding or removing predictors based on significance.

Lasso does something similar but more principled — it simultaneously fits the model and selects features in a single step, rather than testing one predictor at a time.

  • These are the models you'll build in Week 4's lab
  • You'll compare: baseline → linear regression → Ridge → Lasso
  • The AI will help you write the code — you need to understand what the models do and why

Think About It

If a Lasso regression drops a feature entirely (sets its coefficient to zero), does that mean the feature is truly unrelated to the outcome?

Or could it mean something else?

Think about what happens when two features are highly correlated with each other.

Common Misconceptions

  • "Higher R² is always better"
    • Not if it comes from overfitting. Training R² = 0.95 but test R² = 0.10 is a disaster.
  • "More features = better model"
    • Adding irrelevant features introduces noise. Sometimes less is more.
  • "R² = 0.25 means my model is bad"
    • In behavioural science, that's quite respectable for individual-level prediction.
  • "Cross-validation guarantees good performance"
    • It gives better estimates, but can still mislead with structured or clustered data.

The ML Pipeline — Putting It All Together

Data
Features + Target
Preprocess
Clean, Scale, Encode
Split
Train / Test + CV
Train
Fit Model
Evaluate
R², MAE, RMSE

This is the workflow you'll follow in every lab — Week 4 will be your first time through it.

Getting Ready for Week 4

Next week you'll put all of this into practice:

  • Build regression models on a real DASS dataset
    • 39,775 real survey responses — not synthetic data this time
  • Use scikit-learn (sklearn) — the standard Python library for ML
    • Your AI assistant helps with the code — you focus on what and why
  • Compare: Baseline → Linear Regression → Ridge → Lasso
    • Which model generalises best to held-out data?
  • LLM skill focus: debugging
    • When code breaks, share the error with your AI and guide it to a fix

Your New Shared Language with AI

The vocabulary from this week becomes your tool for communicating with AI coding assistants:

"My features are TIPI Extraversion and Emotional Stability. My target is DASS Depression. I want to train a Ridge regression with 5-fold cross-validation and report MAE and ."

The AI knows exactly what you mean. That's the power of speaking ML.

Before next week: Read the companion reading if you haven't already. Review the key terms. You don't need to memorise code — just the concepts.

Homework: Download the Week 4 dataset before class!

conda activate psyc4411-env
cd weeks/week-04-lab/data
python download_data.py

Key References

Full reading list: readings.md

Questions?

Next week: Your First ML Pipeline — Regression with scikit-learn

PSYC4411 · Macquarie University · 2026