
Enhancing Caption Accessibility
Design Framework for Emotive Captioning Systems
— The Challenge
How might captions communicate tone and emotion — without increasing cognitive load?
Standard closed captions accurately transcribe speech, but they flatten emotional nuance. Tone, sarcasm, urgency, and personality are often reduced to plain text, leaving Deaf and Hard-of-Hearing viewers to infer meaning from incomplete cues.
I explored how typography, positioning, and system-level customization could restore emotional meaning to captions while preserving clarity, legibility, and user control.
"Standard captions fail to describe the personality of a sound. There's a whole world in how something is said — and right now, captions pretend it doesn't exist."
— Takeaways
Learning as I Built
This project began as a research investigation and evolved into a speculative system redesign. I had to move beyond viewing captions as a formatting layer and instead treat them as an interface that shapes participation.
The deeper I went, the clearer the tension became: expressiveness improves emotional clarity — but risks increasing cognitive load. Designing within that tension became the core of the project.


How I Started — Mapping Emotional Dimensions
The design process began with a thorough review of academic literature on affective captioning, speech visualisation, and accessible media design. Several key sources shaped the direction of the project.
Research by de Lacerda Pataca et al. (2024) used the circumplex model of emotion — a framework that maps emotional states across two axes — as a foundation for designing caption styles.
The two axes are:
Valence: how positive or negative an emotion feels (e.g., joy vs. fear)
Arousal: how energetic or calm the emotion is (e.g., rage vs. sadness)
User studies found that colour was the most intuitive typographic variable for communicating valence — warm reds and oranges for negative emotions, cooler greens and blues for positive ones. Size and font weight were most effective for conveying arousal — larger, bolder text read as louder and more intense.

However, the same research revealed a critical counterpoint: all four of the top-performing affective caption styles scored significantly lower on ease of reading than a plain, unstyled baseline. Expressive captions are emotionally richer but cognitively heavier. A 2007 study by Lee et al. added another layer of complexity: Deaf viewers largely disliked graphical caption enhancements, preferring to read emotion from actors' faces and body language, while Hard-of-Hearing viewers found the same enhancements genuinely useful. This divergence made it clear that no single design could serve all users.
Finally, a BBC R&D study on dynamic subtitles — captions positioned near the on-screen speaker rather than in a fixed bar — found that speaker-proximate positioning reduced the time viewers spent reading and increased immersion, with gaze patterns more closely resembling those of viewers watching without captions at all.
Interaction Model

Where This Goes Next
This concept opens several directions for further development, each addressing a different dimension of the problem.
Co-designing with DHH communities
The most critical next step is direct involvement of DHH users in the design process. The research literature makes clear that academic studies can identify general preferences, but lived experience surfaces needs that controlled studies miss. A genuine co-design process — involving Deaf users, Hard-of-Hearing users, late-deafened adults, and users with varying levels of residual hearing — would stress-test every design assumption and likely surface solutions that have not yet been considered.
AI-powered emotion detection
The current concept assumes that emotional metadata is applied manually or through post-production tagging. A compelling future direction is the use of machine learning models trained to detect emotional valence and arousal in speech in real time — allowing affective captions to be generated automatically for live broadcasts, video calls, and user-generated content where manual tagging is not feasible. This would dramatically expand the reach of the system beyond professionally produced content.
Extending beyond film and television
The emotional stakes of missing affective cues are arguably even higher in interpersonal communication than in entertainment. Video conferencing platforms, social media video, and real-time communication tools are all spaces where tone carries enormous weight — and where DHH users are currently underserved. The principles developed here could be applied to a Zoom caption system, an Instagram Reels caption layer, or a live event captioning service.
Longitudinal usability research
Short-term user testing can assess preference and immediate legibility, but cannot tell us whether affective captions reduce fatigue over long viewing sessions, whether users adapt and become faster at processing expressive text over time, or whether the system improves comprehension and enjoyment in a measurable way. Long-term research would be essential before any production deployment.

