Week 7 · Semester 1, 2026
When there’s no “right answer”
Weeks 2–6
You have a target variable
“Predict depression score”
“Classify elevated vs. minimal”
Clear right answers → clear metrics
This week
No target variable
“Are there subgroups?”
“What are the latent dimensions?”
No right answers → stability matters
Simplify complex data into fewer dimensions while keeping the important patterns.
42 questionnaire items → 3 components
Discover groups of similar observations without being told the groups in advance.
“Are there depression subtypes?”
If you’ve run a factor analysis in statistics, you’ve already done a version of unsupervised ML.
Clustering algorithms will ALWAYS find groups — even in random noise.
This week’s core skill: distinguishing discovery from fabrication.
DASS-42 dataset
“Can personality predict depression scores?”
Target: DASS_Depression
Metrics: MAE, R²
Same dataset
“Are there subgroups of distress?”
No target variable
Metrics: silhouette, stability
Seeing the big picture
Dimensionality reduction finds a simpler view that preserves the important patterns — like summarising a long conversation.
Analogy: Photographing a coffee mug from every angle. Most photos look similar. PCA finds the few angles that show the most different views.
42 questionnaire items → 3 components that capture 56% of all variation. PCA doesn’t discard data — it finds a simpler view that preserves the most important patterns.
How much variance does each component capture?
DASS-42: 3 components capture ~55% of all variation. 42 dimensions → 3.
To describe someone’s distress pattern, you’d need all 42 item scores. That’s a lot of numbers.
One single number — their score on PC1 — tells you nearly half the story. It’s like a master volume dial for distress.
“Variance explained” = how much of the total spread in scores this dimension captures.
45% from one component is unusually high — it means distress is more “one big thing” than “42 separate things.”
If everyone who scores high on one distress symptom also scores high on the others, a single “general distress” factor will capture most of the variation. That’s exactly what PC1 is.
45% of variance
High loadings on all DASS items
People who score high on one symptom tend to score high on everything
7% of variance
Depression items load positive, Anxiety items load negative
What distinguishes depression from anxiety
4% of variance
Stress items load positive, Anxiety items load negative
What distinguishes stress from anxiety
The DASS three-factor structure emerges — but PC1 (general distress) dominates everything else.
A researcher runs PCA on 42 DASS items and finds 3 components. They call these “Depression,” “Anxiety,” and “Stress” and say they’ve “discovered” the three-factor structure.
But those are the three subscales the questionnaire was designed to measure.
Have they discovered structure — or rediscovered something built into the questionnaire from the start?
Non-linear visualisation
UMAP (Uniform Manifold Approximation and Projection) finds non-linear structure and creates 2D maps where similar points cluster together.
PCA is trustworthy but blurry — distances are real. UMAP is vivid but can mislead — cluster gaps and sizes may be artefacts. Use UMAP for exploration, PCA for measurement.
Groups overlap, boundaries blurry
Distances are meaningful
Reproducible (no randomness)
Variance explained is quantifiable
Best for: measurement, inference, reporting
Groups look clean and well-separated
Distances between clusters are NOT meaningful
Cluster sizes and shapes can be artefacts
Results change with different settings
Best for: exploration, hypothesis generation
UMAP is a sketch, not a photograph. A map for exploration, not a measurement tool for inference.
Three things UMAP can distort:
Only local structure is preserved: nearby points stay nearby. Everything else is up for distortion.
UMAP is a sketch, not a photograph. Use it for exploration, not confirmation.
Finding groups in data
Points within a cluster are similar to each other
Points in different clusters are dissimilar
Remember: clustering algorithms will always find groups, even in random data. The question is whether the groups are real.
Centroid-based
Define clusters around central points. Fast and intuitive.
You choose k in advance
Agglomerative
Build a tree of nested groups. See the full hierarchy.
No need to pre-specify k
Density-based
Find dense regions. Handles odd shapes and noise.
Identifies “noise” points
Place Centroids
Randomly place k central points in the data space
Assign Points
Each data point joins its nearest centroid
Move Centroids
Each centroid moves to the mean of its assigned points
Repeat
Steps 2–3 until nothing changes (convergence)
Fast, intuitive, and usually the first method to try. But assumes roughly spherical, equal-sized clusters.
The data points stay put — only the centroids move. The algorithm converges when nobody changes cluster. Different random starting positions can give different results, which is why stability checks matter.
Plot within-cluster distance vs. k
Look for the “elbow” — where adding more clusters stops helping much
How well does each point fit its cluster vs. the nearest other cluster?
Range: −1 (bad) to +1 (perfect)
Neither tool gives a definitive answer. The “right” number of clusters is a judgement call — guided by diagnostics and by whether the clusters make psychological sense.
| k | Silhouette |
|---|---|
| 2 | 0.240 |
| 3 | 0.150 |
| 4 | 0.145 |
| 5 | 0.132 |
k = 2 wins — but even 0.24 is modest
Cluster 0 (N = 16,908): Low distress
Depression M = 11.8, Anxiety M = 8.6
Cluster 1 (N = 17,668): High distress
Depression M = 29.7, Anxiety M = 22.8
Advantage: See the full hierarchy. Cut at any height to get any number of clusters.
Weakness: Different linkage methods (single, complete, Ward’s) give different trees — another source of instability.
Advantage: See nested structure — maybe 2 broad groups, each containing 2 subgroups.
Weakness: Different linkage methods give different trees — another source of instability.
Strengths: No need to specify k. Finds odd shapes. Identifies outliers.
Weaknesses: Sensitive to parameters. Struggles with varying density.
Can you trust your clusters?
If you run the analysis again, do you get the same clusters?
A cluster solution that changes every time you look at it isn’t a discovery — it’s noise.
k = 2 solution across 5 random seeds:
| Seed | Points Changed | % Changed |
|---|---|---|
| 42 (reference) | — | — |
| 123 | 116 | 0.34% |
| 456 | 65 | 0.19% |
| 789 | 54 | 0.16% |
| 2026 | 2 | 0.01% |
This solution is very stable — fewer than 0.35% of participants change cluster. But k = 2 is also simple (low vs. high distress). More nuanced solutions (k = 4, 5) would be less stable.
| Score Range | Interpretation |
|---|---|
| 0.70 – 1.00 | Strong structure (rare in psychology) |
| 0.50 – 0.70 | Reasonable structure |
| 0.25 – 0.50 | Weak structure (common in psychology) |
| < 0.25 | No meaningful structure |
Our k = 2 DASS solution: silhouette = 0.24 — right at the boundary. The clusters exist, but they’re not strongly separated.
You cluster participants into 3 groups based on their depression and anxiety patterns. At a conference, someone asks: “Did you check whether those clusters are stable?”
You rerun with a different random seed and get 4 groups with different boundaries.
What should you do next? What does this tell you about the “reality” of the subgroups?
Categories vs. dimensions
Some are useful: Diagnoses guide treatment. Categories simplify communication.
Some lack evidence: Learning styles have been extensively debunked.
But are these categories real — or convenient fictions?
You have depression or you don’t
Discrete types, clear boundaries
Everyone sits somewhere on a severity continuum
No natural boundaries
Haslam et al. (2020): Meta-analysis of 317 taxometric studies. Dimensional models outnumber categorical 5:1 across psychology. Most psychopathology is dimensional.
Reification: naming a cluster makes it feel real, even if it barely survives a random seed change.
Names are powerful. Use them after stability checks, not before.
A third perspective: symptoms cause each other rather than being caused by a latent disorder.
Depression isn’t a hidden disease entity — it’s a self-reinforcing network of mutually activating symptoms. This challenges both the categorical and dimensional frameworks.
Mental health researchers have debated for decades whether conditions like depression are discrete “types” or points on a continuum.
Could clustering analysis settle this debate?
What would you need to see in the data to be convinced that discrete types genuinely exist?
“PCA discovers hidden factors.”
PCA finds linear combinations that maximise variance. Whether these correspond to meaningful psychological constructs depends on interpretation, not mathematics.
“More clusters = better model.”
Adding clusters always improves fit on the training data. The question is whether additional clusters are stable and meaningful.
“UMAP shows the true structure.”
UMAP is optimised for visualisation. It can distort distances, densities, and cluster boundaries. Treat it as a sketch, not a photograph.
PCA, UMAP, and clustering on real data
Monday 27 April is a public holiday — no class next week. Complete the Week 8 lab in your assigned groups in your own time.
You may still have the data from Week 4. If not: cd weeks/week-08-lab/data && python download_data.py
In research, your analysis is only as good as your ability to explain it.
Weak prompt
“Explain my code.”
Strong prompt
“I ran PCA on 42 DASS items from 34,576 participants. 3-component solution explaining 55% of variance. Write a methods paragraph for a psychology journal. APA style. Include sample size, measures, software.”
The AI’s draft is a starting point. You must verify every number and method name matches what you actually did.
Week 8: PCA, UMAP & Clustering Lab