PSYC4411

Find Structure, Don't Fabricate It

Week 8 Challenge Lab

PCA, UMAP & Clustering on Real Mental Health Data

Monday 27 April is a public holiday — complete this lab in your assigned groups in your own time

Today's Challenge

Phase 1: Dimensionality Reduction
- PCA on 42 DASS items — how many components?
- UMAP visualisation coloured by depression severity
Phase 2: Clustering
- k-Means on Big Five + DASS subscales (8 features)
- Profile your clusters — what do they mean psychologically?
Stability & Documentation
- Test whether your clusters survive a different random seed
- Write a methods paragraph — then verify the AI got it right
Prepare a summary slide
- Scree plot, UMAP, cluster profiles, stability check, methods paragraph correction

The Dataset: DASS-42 (from Week 4)

Same data, different question
- Week 4: “Can personality predict depression?”
- Week 8: “Are there subgroups in how people experience distress?”
After cleaning: ~34,500 respondents
PCA input: 42 DASS items (distress symptoms)
Clustering input: 5 Big Five + 3 DASS subscales = 8 features

Data prep is done for you! The starter notebook loads, cleans, and scores the data in cells 1–5. You start the analysis from cell 6.

New this week: VCL fake word filter removes careless responders — participants who claim to know words that don’t exist.

New LLM Skill: Documentation

Week 2: Prompting · Week 4: Debugging · Week 6: Refactoring · Week 8: Documentation

Weak

“Explain my code.”

Strong

“I ran PCA on 42 DASS items from 34,500 participants. Write a methods paragraph for a psychology journal (APA style). Include: sample size, measures, preprocessing, number of components, variance explained, clustering details, evaluation metrics. Be specific about software.”

The AI’s draft is a starting point. You MUST verify every number and method name matches your actual analysis. Documentation that doesn’t match is worse than no documentation.

Getting Started

Steps

Meet with your assigned group (in person or online)
Open starter.ipynb or starter.py
Run cells 1–5 (data is loaded, cleaned & scored)
Ask your AI to plan first
Phase 1: PCA → loadings → UMAP
Phase 2: Cluster → profile → stability
Write methods paragraph & prepare summary slide

Target Numbers

PCA: 3 clear components, ~55% variance

PC1: ~45% — general distress factor

Best silhouette: k=2 (~0.24)

Stability: very high for k=2 (<1% change)