Week 11 · Monday 18 May 2026
In Week 10 you trained an MLP — a stack of fully-connected neurons that takes a flat vector in and produces a prediction.
When the architecture matches the structure of the data, fewer parameters do more. The network can learn useful features instead of needing them engineered by hand.
CNNs, RNNs, and the shape of behavioural data
Rows × columns. Order doesn’t matter. MLP works fine.
Spatial structure. Nearby pixels matter. CNN.
Temporal structure. Order matters. RNN/LSTM — or Transformer.
Same underlying maths (weights, activations, gradient descent) — different wiring to match the data.
A small learnable filter slides across the image. At each position it produces one number — how strongly its pattern is present there. Hundreds of filters per layer, all learned from data.
a vertical-edge detector
Bright = strong response. This filter responds strongly to vertical edges where the image goes from dark to light. Other filters (curves, textures, colour blobs) work the same way — they all get learned from data.
Nobody designs these features. The network discovers them. This is representation learning (Week 9) applied to images.
CNNs were niche for decades. Then Krizhevsky, Sutskever & Hinton (2012) halved the error rate on 1.2M images, 1,000 categories.
Within five years CNNs dominated every image task — and started spreading to medical imaging, video, and behavioural science.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep CNNs. Communications of the ACM. doi.org/10.1145/3065386
Marker-less pose estimation: CNNs trained to localise body landmarks in video frames. No physical markers, no calibrated rigs — just a phone camera.
Animal & human pose tracking, the foundational tool in behavioural neuroscience.
Real-time multi-person 2D pose for humans. Workhorse of gesture and gait research.
Google’s on-device pose, face, hand tracking. Runs in a browser, runs on a phone.
Free, open-source, no marker set-up.
Detect people first, then find keypoints inside each box.
Find all keypoints first, then group them into people.
Same end product (keypoints per person) — very different inside.
Barrett et al. (2019) — thorough critique of inferring discrete emotion categories from facial movements.
Action units are measurable. “Anger” as a face is contested.
FACS Action Units (Ekman & Friesen, 1978). Figure from Barrett et al. (2019).
One video clip in → one or more behavioural codes out. The CNN replaces (some of) the human rater. Speed and scale increase by orders of magnitude.
Pose estimation has made behavioural research dramatically cheaper and faster — a single laptop can now do what once needed an expensive motion-capture lab.
What does this mean for the kinds of studies that get done? Does it widen access to good methods, change which questions get asked, or both?
Suppose you want to study cognitive load: when participants are mentally taxed, do their body movements change?
This is the integrated workflow: deep learning extracts features that classical ML then uses. Most applied behavioural pipelines look like this.
Workload effects show up in every modality — performance, physiology, eye, and pose. Pose tracks workload as well as traditional physiological measures.
Patil et al. (under review) — Pose Estimation for Cognitive Workload Classification.
Phone-camera video. No markers.
DeepLabCut / MediaPipe extracts keypoints.
Velocity, smoothness, postural sway, blink rate.
Random forest, logistic regression (Week 6).
Cognitive load state per trial.
The CNN replaces the marker-tracking hardware. The downstream ML is the same code you wrote in Week 6.
Real game footage → CNN detects and tracks every player’s pose, in real time, no markers.
Multi-person 2D pose tracking on broadcast video. Same machinery, scaled from one webcam to a whole field of players.
Same machinery as the case study — running in your browser, on your webcam, in real time.
Webcam stream → CNN extracts body / face keypoints → overlay drawn on top.
Launch live demo (new tab)Watching the CNN tag joints in real time is a useful sanity check — you can see which parts the model finds easy or hard.
A research team uses a CNN to score facial expressions in clinical interviews, replacing two human raters. The model agrees with humans 85% of the time. They publish faster, sample more participants, and improve statistical power.
What is lost? What is gained? When would you trust the model over the humans — and does the answer change if the model was trained on a population that doesn’t look like yours?
For sequences — speech, EEG, gestures — the network needs memory. The hidden state is that memory.
One machine. Applied step by step. The hidden state threads through every step, carrying a running summary of everything seen.
The same network repeated — with a hidden state threading through all the steps.
Conceptually elegant. Practically, vanilla RNNs have a problem.
When the network looks back across many steps, the learning signal shrinks toward zero — the RNN can’t learn long-range dependencies. It forgets.
Or the opposite — the signal grows uncontrollably and training becomes unstable. The network blows up.
Vanilla RNNs work fine on short sequences (10–20 steps) but struggle past that. We need gated memory.
Hochreiter & Schmidhuber (1997) — a memory cell with three learned gates running along a cell-state highway.
All three gates are learned from data. The network discovers what to remember, forget, share — and the cell-state highway lets gradients flow back through long sequences without vanishing.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation. doi.org/10.1162/neco.1997.9.8.1735
Until around 2020, all major speech systems were RNN/LSTM-based. Now used in clinical voice analysis, child language research.
Sequence classifiers for brain–computer interface, sleep staging, seizure detection. LSTMs still dominate medical signals.
Classify movements over time, predict next movements. Fall prediction in aged care, intent prediction in BCI.
Auletta et al. (2023) — LSTM + SHAP predicted players’ next target before their conscious intent.
Auletta, F., Kallen, R. W., di Bernardo, M., & Richardson, M. J. (2023). Predicting and understanding human action decisions. Scientific Reports. doi.org/10.1038/s41598-023-31807-1
Auletta et al. (2023) — an LSTM that predicts which target a player will pursue next, sometimes before they consciously decide.
(a) Two players around a multi-touch table herding targets · (b) Playback & classification interface
Confusion matrices: expert vs novice pairs, at τhor = 16 and 32 (longer horizon).
Results: 90–97% accuracy across both skill levels and both horizons. The LSTM’s prediction often precedes the player’s reported conscious decision — movement gives away intent before the chooser knows.
The architecture that has eaten the world over the past five years is the transformer.
Full treatment of attention is coming up — it’s the engine inside every LLM.
Vaswani et al. (2017). Attention is all you need. NeurIPS. arxiv.org/abs/1706.03762
Behind the curtain of ChatGPT
Three ingredients, one outcome.
The architecture from earlier in the lecture. Self-attention, stacked many times.
Given the text so far, predict the next token. That’s the entire training objective.
Billions of parameters, trillions of training tokens, thousands of GPUs.
That’s it. ChatGPT is “next token, but really big”. Everything else — helpfulness, refusal, formatting — comes from the post-training we’ll cover in a few slides.
An LLM doesn’t process letters or whole words — it processes tokens, which are typically subwords.
the are a single tokenpsychology → psych + ologyWhy does this matter? Tokens are the discrete units that get mapped to embeddings. The model never sees the text directly — it sees a sequence of integers.
Input sentence:
After tokenisation:
Each chip is one token → one integer ID:
From here on, the LLM is doing maths on integers, not letters.
Each token ID is mapped to a dense vector of numbers — typically 768 to 4,096 dimensions long.
cat [0.12, -0.43, 0.88, …, 0.05]
dog [0.18, -0.39, 0.85, …, 0.07]
Tuesday [-0.72, 0.31, 0.04, …, -0.89]
Vectors are learned so that similar tokens end up close together in space.
“cat” sits near “dog”. Far from “Tuesday”. “depressed” sits near “sad”, “lonely”, “anhedonic”. Far from “rollerblade”.
The closeness of vectors is the model’s representation of meaning. Nothing is stored about the letters — only about which other tokens this one tends to appear near.
If meaning lives in geometry, you can do arithmetic on concepts.
Paris − France + Italy ≈ RomeMikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arxiv.org/abs/1301.3781
Pool the token embeddings for a whole interview transcript and you get one vector per document.
Now you can do quantitative things to qualitative data:
Quantitative thematic analysis — group similar interviews automatically.
Find responses similar to a target answer, even if they share no exact words.
Measure shifts in how someone describes their experience across sessions or years.
Pre-label transcript chunks for human review. Speeds qualitative analysis dramatically.
See Demszky et al. (2023) for a thorough review of LLMs in psychology. Nature Reviews Psychology. doi.org/10.1038/s44159-023-00241-5
Hover over any word and see its nearest neighbours in 3D embedding space.
A surprisingly effective intuition pump for the “meaning is geometry” idea.
projector.tensorflow.orgTry the Word2Vec 10K demo. Type “depression” into the search box. Look at what comes up.
If an LLM “knows” that depression is related to insomnia because their embeddings sit near each other in a high-dimensional space, is that the same kind of knowing that a clinical psychologist has?
What does the model not have access to that the clinician does — lived experience, embodied affect, the relational context of a session — and does it matter for what we use the model for?
For each token in the input, the model asks:
“Which other tokens should I pay attention to?”
Line thickness = attention weight. Some tokens attend strongly to one or two others; some spread their attention more evenly.
Each token is projected into three vectors:
“What am I looking for?”
“What do I offer?”
“What do I carry?”
Each block builds a richer representation. The output of block N is the input to block N+1.
ChatGPT is not one training run. It is three stages, stacked.
Billions of pages of internet text. Predict the next token. No human labels.
Fine-tune on (question, helpful answer) pairs written by humans.
Humans rank outputs. Model learns to produce what humans prefer.
The model is excellent at predicting plausible text but not at following instructions. Ask raw GPT “What is the capital of France?” and it might continue with “What is the capital of Germany?”.
“What ChatGPT knows” → pre-training.
“How ChatGPT behaves” → instruction tuning + RLHF.
Fine-tune on (question, helpful answer) pairs written by humans.
The model learns the form of being a helpful assistant.
Reinforcement Learning from Human Feedback.
Humans rank pairs of model outputs. Model trained to produce outputs humans prefer.
Ouyang et al. (2022). Training language models to follow instructions with human feedback. arxiv.org/abs/2203.02155
Coding help, summarising, drafting, brainstorming, semantic search, qualitative coding support.
Verifiable factual recall without a tool. Anything where the cost of being wrong is high and you can’t independently verify.
Text and image share one context window: the model sees both and reasons about them together.
Reads screenshots, diagrams, scanned documents, charts.
Long context (1M+ tokens), strong document and chart understanding.
Vision + reasoning, screenshot interpretation, document parsing.
Comparable open-weight models you can run locally — essential when data can’t leave your machine.
Auto-code behavioural images, score facial expressions, parse scanned questionnaires, accessibility tools.
Same risks as text LLMs — hallucinations, bias, confident wrong answers on edge cases. Verify before trusting.
A vision-language model can now look at a photo of a participant’s face and produce a paragraph describing their probable mood, attention, and demographics — in seconds, at scale.
What kinds of psychological studies does this enable? What kinds of mistakes would worry you most? Would you use it to generate participant labels for training another model, or only as one signal among several?
Regression, regularisation, classification, trees, random forests.
Clustering, PCA, UMAP — finding structure without labels.
From perceptron to MLP to a working PyTorch model on real EEG data.
CNNs, RNNs, transformers, embeddings — the map of everything-else.
You started this course with no coding and no ML background. You finish it with a working understanding of every major area of modern data science — and the LLM skills to keep going.
weeks/week-12-viva-review/Skim each week’s companion reading. Walk through the slide deck’s key figures.
For lab weeks, re-open your notebooks — the act of running a cell brings the concept back.
Use the AI as your study partner. Ask it to quiz you.
PSYC4411 · Week 11
Thank you — see you for the viva in Week 12.
Beyond Tabular · 18 May 2026