Week 5 · Semester 1, 2026
Numbers to categories
Predict a number
“Depression score = 14.3”
Metric: How far off? (MAE, R²)
Predict a category
“Elevated depression: Yes/No”
Metric: Which errors? (more complex)
When you classify people, the type of mistake matters as much as the number of mistakes.
Week 6 target: Classify participants as having elevated depression (PHQ-9 ≥ 5) vs minimal symptoms (PHQ-9 < 5) using real COVID-era survey data.
Probabilities, not just labels
Linear regression draws a straight line. Logistic regression fits an S-shaped curve.
The curve squashes any input to a value between 0 and 1 — we interpret this as a probability.
The probability output is powerful: it lets us adjust how cautious the model is.
Logistic regression has been a workhorse of behavioural research for decades.
Same model, different goals. ML asks: “How well does it work on new data?”
Four types of prediction
In screening, false negatives are usually the more dangerous error — we miss someone who needs help.
Here's a real confusion matrix from Python — this is what you'll produce in Week 6:
18 missed cases — people with elevated depression that the model didn't catch. Are you comfortable with that?
Imagine two depression screening tools for a university:
Tool A: Catches 95% of students with elevated depression, but flags 40% of students who are fine.
Tool B: Catches 70% of students with elevated depression, but rarely flags anyone incorrectly.
Which would you choose for a walk-in counselling service?
What about for emergency crisis detection?
Beyond accuracy
Accuracy = proportion of predictions that were correct. Seems like the obvious metric.
The trap: Imagine 95% of your sample does not have elevated depression. A model that predicts “not elevated” for everyone achieves 95% accuracy — without learning anything.
Class imbalance makes accuracy even more misleading — when classes are skewed, always report F1 or AUC alongside accuracy.
Of everyone flagged as elevated, how many actually were?
High precision = few false alarms
Of everyone who actually had elevated depression, how many did we catch?
High recall = few missed cases
Harmonic mean of precision and recall — a single number that balances both.
Of everyone who was actually minimal, how many did we correctly identify?
The mirror image of recall.
When you flag someone, you're almost always right
But you might miss people who need help
You catch almost everyone who needs help
But you also flag many who are fine
You can't maximise both at once — making the model more aggressive at catching cases inevitably increases false alarms.
0.5 is a choice, not a law
Most classifiers output a probability. The threshold for converting that to a label is your decision.
Cast a wider net
More true positives
More false alarms
Balanced approach
Equal treatment
of both errors
More conservative
Fewer false alarms
More missed cases
Missing someone who needs help is worse than a false alarm.
→ Lower the threshold (e.g., 0.3)
→ Prioritise recall
False positives waste expensive resources.
→ Raise the threshold (e.g., 0.7)
→ Prioritise precision
The right threshold depends on the real-world consequences of each type of error.
You've built a classifier that predicts which first-year students are likely to drop out of university.
Low threshold (0.3): many students flagged, extra support spread thin.
High threshold (0.7): only the most at-risk get support, but some are missed.
What factors should guide your choice?
If-then rules for prediction
A decision tree asks a series of yes/no questions about your features, creating a flowchart:
The downside is serious: an unrestricted tree keeps splitting until it perfectly memorises the training data. Like a clinician building an ever-more-elaborate checklist that encodes quirks of specific patients rather than general patterns.
Many trees are better than one
One tree is interpretable but unstable. What about many trees?
Like asking 100 slightly different experts for their opinion and going with the consensus.
Random forests rank which features contributed most across all trees:
Feature importance can also guide feature selection — drop low-importance features to simplify the model.
Your logistic regression achieves 82% accuracy on a depression classification task.
A decision tree achieves only 76% accuracy.
A colleague says you should use the logistic regression because it's more accurate.
Can you think of a situation where you might prefer the less accurate tree?
When algorithms affect people
An algorithm used in US courts to predict whether defendants would reoffend.
ProPublica found: Black defendants were almost twice as likely to be incorrectly labelled as high risk (false positives) compared to white defendants — even when controlling for prior criminal history.
A widely-used algorithm identified patients who would benefit from extra care.
The algorithm used healthcare costs as a proxy for health needs. But Black patients historically had less access to healthcare (lower costs) → the algorithm systematically under-identified Black patients who needed care.
The algorithm wasn't “racist” in its code — it learned from biased data that reflected historical inequities.
Chouldechova (2017) proved mathematically that when base rates differ between groups, you cannot simultaneously achieve:
Equal false positive rates
across groups
Equal false negative rates
across groups
Equal predictive values
across groups
This isn't a technical problem to solve — it's a values question about which type of fairness matters most in each context.
You've built a depression classifier that achieves 85% accuracy overall.
When you check across demographic groups: 90% accurate for one group but only 72% for another.
Should you deploy this model?
What questions would you want to ask before deciding?
Same pipeline as regression — different metrics and different model types.
Your second challenge lab
Week 2: Prompting · Week 4: Debugging · Week 6: Refactoring
Weak prompt
“Clean up my code.”
Strong prompt
“Refactor this pipeline to: (1) separate data loading from modelling, (2) add assertions to verify data shape after each merge, (3) create a reusable function for fitting and evaluating a model, (4) add docstrings, (5) add comments explaining why each step is done. Prioritise readability over cleverness.”
Refactoring = making code cleaner and more maintainable without changing what it does.
conda activate psyc4411-env
cd weeks/week-06-lab/data
python download_data.py
Downloads ~22 MB of CSV files. If you've already done this for a Week 4 bonus challenge, the files will already be there.
Full reading list: readings.md
Next week: Build a Defensible Classifier — Trees, Ensembles, and Real Data
PSYC4411 · Macquarie University · 2026