Week 3 · Semester 1, 2026
Current Advances in Psychological Methods & Analyses
Prof. Michael J. Richardson
School of Psychological Sciences
Faculty of Medicine, Health and Human Sciences
Macquarie University
A model that fits your data perfectly is almost certainly lying to you.
The real goal isn't to explain the data you have — it's to make predictions about data you haven't seen yet.
Understanding this distinction is what separates useful ML from self-deception.
A translation guide from statistics to machine learning
| What you know (stats) | What it's called in ML |
|---|---|
| IV / Predictor | Feature |
| DV / Outcome | Target (regression) / Label (classification) |
| Running a regression / fitting a model | Training |
| Your sample | Training data |
| Population you generalise to | Test data (held-out) |
| Coefficients / regression weights | Model parameters / weights |
| R², adjusted R² | Evaluation metrics (on test data) |
Fit model to your entire sample
Ask: "Is this effect real?"
Assess significance (p-values, confidence intervals)
Fit model to part of your data
Ask: "Can I predict the rest?"
Evaluate on held-out data (R², MAE)
Same data, fundamentally different question. Statistics asks "is this real?" — ML asks "can I predict what happens next?"
You have a target variable (labelled data)
Weeks 3–6 of this course
No target variable — looking for structure
Weeks 7–8 of this course
Remember the synthetic DASS dataset from Week 2?
3,000 fake participants · lifestyle variables · DASS-21 depression, anxiety, stress scores
Features (predictors): Sleep_hrs_night, Exercise_hrs_week, SocialMedia_hrs_week, ...
Target (outcome): DASS_Depression (0–42 scale)
This is supervised learning → regression
If you've ever fitted a line of best fit or run a correlation, you've already done a version of machine learning.
You just called it statistics.
The difference isn't the maths — it's the question you ask.
Stats: "Is sleep significantly related to depression?"
ML: "If I know someone sleeps 5 hours a night, what depression score do I predict?"
Demystifying the word
A model is a simplified version of reality that captures a pattern.
In stats, a model describes relationships in your sample.
In ML, a model makes predictions about new data — people it has never seen before.
Preprocessing = getting data ready for the model
We'll define these techniques properly in Week 4 when you put them into practice.
The fundamental goal of machine learning
Generalisation = making accurate predictions about data the model has never seen.
Memorise every past exam paper word-for-word
Perfect marks on those specific papers
New questions? In trouble.
= Overfitting
Understand the underlying concepts
Can answer questions you've never seen before
Adapts to new situations
= Generalising
Understanding the concepts (generalising) beats memorising the specifics (overfitting) every time.
In inferential statistics, you generalise from a sample to a population.
That's what p-values and confidence intervals are for — asking whether patterns in your sample likely exist in the broader population.
ML takes this further: instead of asking "is this effect real?" it asks "can I predict what will happen for a new person?"
This connects directly to Week 1's theme: prediction vs explanation
(Yarkoni & Westfall, 2017)
The simplest defence against self-deception
Before building any model, split your data into two parts:
Like splitting your sample in half, running your analysis on one half, and checking whether the same pattern holds in the other — before publishing.
Never peek at the test set.
If you look at it — even once — during model development, it stops being "new" data and your evaluation is contaminated.
Train/test is the foundation — but there are more sophisticated strategies:
We'll explore these specialised methods in later weeks when we work with more complex datasets.
A researcher builds 20 different models and picks the one that scores best on the test set. They report that model's test performance as their result.
Why is this a problem?
How is it similar to p-hacking in traditional statistics?
When models lie to you
The model memorises the training data — including its noise and quirks. Performs brilliantly on training data but poorly on anything new.
The model is too rigid to capture real patterns. A straight line through clearly curved data.
Using the synthetic DASS dataset from Week 2:
Circular overfitting: If we include the individual DASS items (DASS_1, DASS_2, ... DASS_21) as features to predict DASS_Depression — the model "learns" the scoring formula. R² ≈ 1.0 on training data. But it hasn't learned anything about how lifestyle relates to depression.
Clinical danger: A model appears to predict suicide risk with 95% accuracy in the training sample — but drops to 55% on new patients. A clinician relying on that model has false confidence in predictions barely better than a coin flip.
Psychology traditionally uses relatively rigid models — t-tests, ANOVA, linear regression — that make strong assumptions.
ML offers far more flexible models.
When might a psychologist prefer a biased but stable model over a flexible but unstable one?
Think about: clinical decision-making, replication, and sample sizes.
The spectrum from too rigid to too flexible
Bias = how far off the model's predictions are on average
Variance = how much predictions change when trained on different samples
Reducing bias tends to increase variance, and vice versa. You can't minimise both at once.
Low Variance
Underfitting
High Variance
Overfitting
Stats models you know:
t-tests, ANOVA, linear regression
High bias, low variance — strong assumptions, stable results, may miss patterns
Flexible ML models:
neural networks, random forests
Low bias, high variance — fewer assumptions, capture complexity, need more data
Complex analyses on small samples = high variance = findings that don't replicate.
A more reliable way to evaluate
Think of it like running your study K times with different random samples. Instead of one p-value, you get K estimates of performance — and you can see how much they vary.
Final score = Average of R²₁ through R²₅
Reference: de Rooij & Weeda (2020) — "Cross-Validation: A Method Every Psychologist Should Know"
Data splits are random — different rows land in training vs test each time you run the code.
This extends beyond splitting. Models like neural networks start from random initial weights — fixing those seeds makes your entire analysis reproducible. We'll revisit this in later weeks.
random_state=42 ← doing something important!
How to know if your model actually works
Before building anything complex, ask: what's the dumbest possible prediction?
Mean prediction baseline: Predict the average target value for everyone.
In our synthetic DASS dataset, mean depression ≈ 14. If we predict 14 for every person, MAE ≈ 9 points.
If your fancy model can't beat this, it hasn't learned anything useful.
A research team reports that their ML model predicts therapy outcomes with 72% accuracy.
Sounds impressive — but what if 70% of patients improve regardless of treatment?
Their model barely beats random guessing. Why do you think baseline comparisons are so rarely reported in published ML papers?
Coefficient of determination
Proportion of variance explained
R² = 0.30 → model explains 30% of variation in depression scores
Familiar from stats — but now calculated on test data
Mean Absolute Error
Average prediction error, in same units as target
MAE = 4.2 → predictions off by ~4 points on the 0–42 DASS scale
Intuitive — easy to explain to non-technical audiences
Root Mean Squared Error
Like MAE but penalises large errors more
RMSE > MAE → some predictions are very far off
Use when big mistakes are especially costly (e.g., clinical risk)
Rule of thumb: report at least R² (relative performance) and MAE (practical accuracy). Classification metrics (accuracy, precision, recall) come in Week 5.
It depends on the domain. In behavioural science:
We're predicting what real, complex people do — not the trajectory of a billiard ball. Context matters.
A brief preview for Week 4
Regularisation = adding a penalty that discourages the model from becoming too complex
You may have encountered stepwise regression — adding or removing predictors based on significance.
Lasso does something similar but more principled — it simultaneously fits the model and selects features in a single step, rather than testing one predictor at a time.
If a Lasso regression drops a feature entirely (sets its coefficient to zero), does that mean the feature is truly unrelated to the outcome?
Or could it mean something else?
Think about what happens when two features are highly correlated with each other.
This is the workflow you'll follow in every lab — Week 4 will be your first time through it.
Next week you'll put all of this into practice:
sklearn) — the standard Python library for ML
The vocabulary from this week becomes your tool for communicating with AI coding assistants:
"My features are TIPI Extraversion and Emotional Stability. My target is DASS Depression. I want to train a Ridge regression with 5-fold cross-validation and report MAE and R²."
The AI knows exactly what you mean. That's the power of speaking ML.
Before next week: Read the companion reading if you haven't already. Review the key terms. You don't need to memorise code — just the concepts.
Homework: Download the Week 4 dataset before class!
conda activate psyc4411-env
cd weeks/week-04-lab/data
python download_data.py
Full reading list: readings.md
Next week: Your First ML Pipeline — Regression with scikit-learn
PSYC4411 · Macquarie University · 2026