Designing Emotionally Expressive Captions

Balancing Affective Communication with Accessibility

Designing Emotionally Expressive Captions

Balancing Affective Communication with Accessibility

Enhancing Caption Accessibility

Design Framework for Emotive Captioning Systems

— The Challenge

How might captions communicate tone and emotion — without increasing cognitive load?

Standard closed captions accurately transcribe speech, but they flatten emotional nuance. Tone, sarcasm, urgency, and personality are often reduced to plain text, leaving Deaf and Hard-of-Hearing viewers to infer meaning from incomplete cues.

I explored how typography, positioning, and system-level customization could restore emotional meaning to captions while preserving clarity, legibility, and user control.

"Standard captions fail to describe the personality of a sound. There's a whole world in how something is said — and right now, captions pretend it doesn't exist."

— Takeaways

Learning as I Built

This project began as a research investigation and evolved into a speculative system redesign. I had to move beyond viewing captions as a formatting layer and instead treat them as an interface that shapes participation.

The deeper I went, the clearer the tension became: expressiveness improves emotional clarity — but risks increasing cognitive load. Designing within that tension became the core of the project.

Research

What are the people saying ? Whats the research say ?

Design

Designing CC, Subtiles and settings configurations to access them on streaming platforms

Implementation

Implement design on to a mock up for Netflix

Design Process

I wanna go where culture is !

With your work ethic you should go to city collage

I wanna go where culture is !

With your work ethic you should go to city collage

How I Started — Mapping Emotional Dimensions

The design process began with a thorough review of academic literature on affective captioning, speech visualisation, and accessible media design. Several key sources shaped the direction of the project.

Research by de Lacerda Pataca et al. (2024) used the circumplex model of emotion — a framework that maps emotional states across two axes — as a foundation for designing caption styles.

The two axes are:

Valence: how positive or negative an emotion feels (e.g., joy vs. fear)

Arousal: how energetic or calm the emotion is (e.g., rage vs. sadness)

User studies found that colour was the most intuitive typographic variable for communicating valence — warm reds and oranges for negative emotions, cooler greens and blues for positive ones. Size and font weight were most effective for conveying arousal — larger, bolder text read as louder and more intense.

However, the same research revealed a critical counterpoint: all four of the top-performing affective caption styles scored significantly lower on ease of reading than a plain, unstyled baseline. Expressive captions are emotionally richer but cognitively heavier. A 2007 study by Lee et al. added another layer of complexity: Deaf viewers largely disliked graphical caption enhancements, preferring to read emotion from actors' faces and body language, while Hard-of-Hearing viewers found the same enhancements genuinely useful. This divergence made it clear that no single design could serve all users.

Finally, a BBC R&D study on dynamic subtitles — captions positioned near the on-screen speaker rather than in a fixed bar — found that speaker-proximate positioning reduced the time viewers spent reading and increased immersion, with gaze patterns more closely resembling those of viewers watching without captions at all.

I wanna go where culture is !

With your work ethic you should go to city collage

Interaction Model

Static vs Dynamic vs Adaptive

I explored three potential layers of caption interaction:

Static vs Dynamic vs Adaptive

Baseline Mode

High-contrast, minimal styling, static positioning.

Baseline Mode

High-contrast, minimal styling, static positioning.

Expressive Mode

Subtle color coding and weight modulation to signal emotion.

Expressive Mode

Subtle color coding and weight modulation to signal emotion.

Adaptive Mode

User-configurable profiles with adjustable intensity, context presets (film, meeting, lecture), and optional dynamic positioning.

Adaptive Mode

User-configurable profiles with adjustable intensity, context presets (film, meeting, lecture), and optional dynamic positioning.

The goal wasn’t maximal expressiveness — it was controlled expressiveness.

“Every day I wake up and hope you're de*d!”

Dynamic Persons

Size

Background

Compact

Standard

Large

X-Large

Subtitle style for Main Characters

CC Style Setting

- Standard

- Emotional Base

- Custom Style

- High Contrast

— Future Considerations

Where This Goes Next

This concept opens several directions for further development, each addressing a different dimension of the problem.

Co-designing with DHH communities

The most critical next step is direct involvement of DHH users in the design process. The research literature makes clear that academic studies can identify general preferences, but lived experience surfaces needs that controlled studies miss. A genuine co-design process — involving Deaf users, Hard-of-Hearing users, late-deafened adults, and users with varying levels of residual hearing — would stress-test every design assumption and likely surface solutions that have not yet been considered.

AI-powered emotion detection

The current concept assumes that emotional metadata is applied manually or through post-production tagging. A compelling future direction is the use of machine learning models trained to detect emotional valence and arousal in speech in real time — allowing affective captions to be generated automatically for live broadcasts, video calls, and user-generated content where manual tagging is not feasible. This would dramatically expand the reach of the system beyond professionally produced content.

Extending beyond film and television

The emotional stakes of missing affective cues are arguably even higher in interpersonal communication than in entertainment. Video conferencing platforms, social media video, and real-time communication tools are all spaces where tone carries enormous weight — and where DHH users are currently underserved. The principles developed here could be applied to a Zoom caption system, an Instagram Reels caption layer, or a live event captioning service.

Longitudinal usability research

Short-term user testing can assess preference and immediate legibility, but cannot tell us whether affective captions reduce fatigue over long viewing sessions, whether users adapt and become faster at processing expressive text over time, or whether the system improves comprehension and enjoyment in a measurable way. Long-term research would be essential before any production deployment.