3.4 Findings of Study 1
With varying degrees, interviewees related having faced hardships when using automatic captioning systems to communicate with hearing peers. While some issues stem from failings of the current state of automatic speech recognition software, a lack of depiction of prosody and emotions emerged as a cause for captions’ dull ambiguity. Since interviewees faced this on a nearly daily basis, communication becomes an uphill battle, with significant cognitive and emotional tolls.
Strategies to alleviate these ambiguities are diverse and include reliance on multimodal signals such as facial expressions, body language, and general engagement. Interviewees indicated that using these cues is not a straightforward, lossless process, and they were therefore favorable towards the promise of captions’ depicting prosody and emotions. There was nuance to this preference: given how multifaceted these features can be and given the diversity of what is needed in particular settings, different contexts call for different solutions.Where needed, quotes were edited for clarity and conciseness.
3.4.1 Theme 1: Captions’ dull ambiguity.
Captions’ imperfections are felt in different ways. Automatic speech recognition capabilities in live captions have gotten better in recent years, but they still leave a lot to be desired. Agatha
4:
‘Sometimes it’s really, really, slow. Someone speaks, and when a few seconds later the caption finally appears the speaker is already on the next topic.’ Eliah:
‘Often, when people speak with accents captions will have a lot of mistakes.’ Otto:
‘They’re horrible, missing context and words. It takes a lot of work to understand exactly what they’re saying.’Beyond how latency and imprecision can make the linguistic content hard to understand, there are also consequences related to how a shift in mood can go unnoticed. Alex tells us of a time when there was a quick shift from a casual to a serious topic that wasn’t apparent right away, leading to them ‘jumping in at the wrong time and causing my hearing colleagues to look down at me like I’m bad at reading people, which I’m not.’
Human-made transcriptions of pre-recorded allow for greater accuracy, but even when written words perfectly match written counterparts, still, something seems missing. A common occurrence is failing to understand whether comments are serious or not. Alex says that since ‘comedy relies heavily on tone, hearing people can understand immediately when something is a joke, but my friend, who is also dhh, has a hard time because they’re missing that tone.’ Erin tells of a time when ‘someone was telling a story that had a specific inside joke, and I had no idea what was going on because it was connected to the tone.’ Otto: ‘Sometimes I’ll realize it’s a joke after they look at me and ask whether I understood it, and I was like “oh, I thought it was serious because the captions seemed serious.” ’
Participants complained about the monotonous, droning quality of inexpressive captions. Alex tells that because of this they find it hard to focus on captions: ‘I can find it easy to zone out because speech is not really... emphasized?’ Erin finds the contrast between captions and signing hard since she ‘grew up accustomed to some use of body language, so it is hard to just watch and read captions all of the time.’ Otto: ‘Facial and body language will show a lot of context, while captions are bland.’
All of this gives captioned speech an unapproachable ambiguity that disproportionately affects dhh individuals. This is particularly true with dimensions of communication that are already inherently ambiguous, such as moods and emotions. Otto thinks that this disconnection is analogous to texting, which ‘tends to be devoid of emotion. It’s better to interact with the person, to see their real raw emotion, while texting hides it, making it hard to be emotionally transparent.’ For Agatha, when reading captions she tends to miss ‘meaning or feeling behind the words.’ To deal with this, she usually has ‘to read the full paragraph of what was said or have the picture of the speaker’s face, but even then there’s a delay in understanding.’
3.4.2 Theme 2: Communication as an uphill battle.
Working and studying among hearing peers, our interviewees relate recurrent feelings of isolation. The frequent shortcomings of captioning systems fall almost exclusively on their shoulders, leaving them forced to either speak up or face missing out on what’s going on. Ira told us of how in meetings her peers can at times urge her to ‘use captioning right away, but I feel awkward because I’m the only person using them. Sometimes I will miss something and feel awkward to ask hearing group mates to repeat themselves; it just feels weird.’
Sometimes, it’s only when they later read a meeting’s transcript that what was said becomes clear. Agatha: ‘I later understood, but I had to go back and read the transcript to fully understand.’ Eliah: ‘It’s nice that live captions’ transcriptions can be saved as a transcript so I can catch up to what was said.’
For some, this distance from peers has become naturalized. Erin: ‘I am curious that I don’t know what’s happening and I just have to wait there. I know that I am frustrated but at the same time I know that I have to collaborate. I can’t expect it to be easy to communicate all of the time.’ She later adds ‘I tend to accept it because of work. Every weekend, they talk about parties and I accept that I am not part of that conversation and just leave it.’
Some environments are more welcoming than others. Otto’s manager makes a point of checking how their captions are coming out, saying ‘ “lemme check the captions... Oh no, I didn’t mean that,” and then repeating themselves until the captions are accurate.’ Eliah’s boss writes them a summary of what is being discussed because even with an interpreter and captions ‘there’s a lot of overlap and I can’t really catch the specifics.’
The flip side is that dhh persons depend on sometimes-lacking goodwill from their hearing colleagues to be included in the conversation. Ada: ‘Often, my coworkers forget that I need a good environment before I can understand, so they’ll be having a conversation with background noise, or not looking at me. I’ll still try, but I’ll feel alone and left out.’ Irene also faces issues with her coworkers’ carelessness with how their environment can impact accessibility. When she raises the issue to leadership, they might try to do something, ‘but the other members of the group are not as willing and, especially since covid, have reached their limits.’ Otto: ‘I try to be assertive, trying to talk to them, but even if I type in the chat some hearing people don’t know how to use it or just ignore it and keep talking anyway. That means I can’t do much about it.’ When she does intercede, Agatha feels that ‘with the captions, I’m delayed, so if I had questions I need to ask them to go back on the conversation. I feel like it’s annoying to my boss.’The emotional ambiguity of captions heightened these feelings of isolation. Eliah: ‘With a large team, it’s hard to see their faces and I usually depend on captions. I don’t know their emotions, and I feel like I’m not there, not connected with them.’ Otto feels missing emotional representation always impacts him: ‘In general communication, I can’t really participate fully. The discussion can be work-related but there’s also another discussion that’s humorous, and I wouldn’t understand. Most of them are laughing and I’m left out, unsure whether they’re actually joking or not.’
3.4.3 Theme 3: Reliance on multimodal signals.
Interviewees related being very attuned to how people communicate with their facial expressions, body language, and general engagement. This, some said, is a way to tackle the shortcomings of captioning. Ira: ‘When people are talking I can look and figure out what their thoughts are based on their behavior. With masks, I sometimes miss out on information, so I’ll look at their eyebrows or eyes, but it’s hard.’ Alex says that to gauge mood or emotion they ‘have to look up from the captions at their expressions, body language, and how they react so I can tell what they mean,’ although ‘that doesn’t mean I capture all the information.’ In describing what makes a speaker’s emotions easy to identify, Erin says that it comes out to ‘body language; how they’re shifting in their seat, how they’re moving, their facial expressions, and mouth movement.’
Cultural differences come into play here. Erin: ‘Here in America you can definitely identify it easily, but in other countries, it’s challenging.’ Irene: ‘It is very hard to understand hearing people’s body language and tone, especially through the computer. They tend to sit very relaxed with their hands on their face, or look neutral, while Deaf people are extremely expressive and clear.’
Technology adds to the complexity of navigating this mosaic of affective signals, and this is present in Agata’s comment that, ‘with captions, sometimes I miss the facial expressions or emotions behind the words.’ The delay in captions makes Ada struggle with trying to listen and read at the same time: ‘it’s really hard: I have a choice of either listening to the person or reading the captions, but trying both simultaneously takes more work and won’t help me.’
3.4.4 Theme 4: Different contexts call for different solutions.
Having introduced to the interviewees the idea that a speaker’s tone of voice can serve to signify both different emotions and as a contrastive focus to emphasize certain words, we asked them what they’d think would be most important to represent in captions. Answers were varied and were tied to what interviewees felt was appropriate for different types of meetings.
Some, such as Otto, claimed that while both dimensions are important, in work environments one should prioritize prosody, ‘because I need to understand information better, to pay attention to which word is important. Emotion is important, yes, but I’d rather hold off on that because it’s more suitable for general communication.’ Eliah echoed this: ‘We don’t need to depend on mood because we’re here for work. The working environment usually has a lot of discussions so it’s important to have emphasis so that we can be involved, discuss more, and ask more questions as deaf people.’
Others were undecided. Ada, for instance, said that while for her prosody would be generally more important, when their hearing is fatigued, ‘I no longer can figure out valence myself, so it would then become the more important one.’ Alex: ‘I think both should be included. Valence can show emotion, but not what’s important; prosody emphasizes what’s important, but not emotion, so how would I know?’
Others preferred the representation of emotions. In Erin’s case, the choice between prosody or valence was almost a tie, but ‘emphasis I can figure out, while emotion is really nice to have on the screen so that I know what is going on.’ Ira: ‘emotion is more important since it helps to visualize the full picture, which deaf people usually miss, while emphasis is just for a specific word.’ For Agatha, ‘emotions add more depth to words,’ and are thus more important to be visualized.
3.4.5 Design recommendations from the pilot study.
Reactions to the design prototypes shown after the interviews were mixed. There was an appreciation of the ideas explored, but not exactly their execution. This issue arose particularly when there was a perceived mismatch between what emotions/emphasis the captions were denoting and what participants were seeing from facial expressions in the video. Eliah:
‘The woman on the video was showing distinct facial expressions, there wasn’t much change in the border of the first design [Figure
2a ]
, but then later on when she wasn’t showing much the border became pink or blue.’The imprecise alignment between words and sounds did not go unnoticed. Irene: ‘I would see the speaker take a breath but there was no break in the captions.’ The display of loudness also seemed misaligned. Ira: ‘I liked the idea in the second design of bolding some words for emphasis, but it didn’t seem to match the sentences.’
Legibility was a major concern, with six out of the eight interviewees having mentioned it. Some of this could be related to the colors used: Erin: ‘you get tired of reading, and then the colors start to change, it is confusing to try to understand the tone.’ She also mentioned having some degree of colorblindness, which made matters worse. The fonts used were also a source of concern. Ira: ‘It was too busy. The font and color changes made it hard to read and look at the person’s emotion.’
Some participants did not notice the border changes in the first design, and some that did found it distracting. Erin: ‘The border was awful. Its constant motion would give me a headache.’ Alex: ‘Zoom or Microsoft Teams already have a border around whoever is talking, so if you add an additional one tied to the captions it’ll be extremely distracting.’ For Eliah, inversely, the border, which reminded him of a similar device used in the video-game ‘The Sims,’ was functional precisely because it didn’t get in the way: ‘I liked how the color change represented mood while staying out of the view of the captions.’
Reactions to the typographic designs (designs two and three) were mixed. Agatha:
‘I wish the third design [Figure
2c ]
had an easier font to read but I enjoyed the changing fonts because it helped to show the emotions.’ Ada:
‘The best thing about the second design [Figure
2b ]
was maybe the change in font thickness, whether it’s thin or thick to show the emphasis, I think that was helpful.’ Ira:
‘Seeing the caption change color was interesting because it helped me separate one sentence from another, while also helping me understand how the person is saying specific phrases.’3.5 Discussion of Study 1
The first goal of the study was to find out in what ways
dhh individuals experience the absence of prosodic and emotional depictions in computer-generated captions, as are used in meetings with hearing peers (
rq1.a). Our interviewees discussed the many dimensions in which speech accessibility solutions can fail them. Captions, in particular, have many shortcomings. Some come from known limits of current automatic speech recognition systems, which negatively affect
dhh individuals’ experience of captioning systems [
42,
52], and include high-latency and difficulty with non-‘standard’ accents [
1,
27].
Beyond these failings, however, we found that captions’ depictions of words are felt as if lacking something, leaving out meaningful dimensions of speech. These elements are present acoustically, so their absence creates barriers for
dhh individuals. Missing a shift in tone from a serious to a humorous conversation, for instance, was a frequent complaint — and an expected one, given that humor has prosodic markings [
5].
Our interviewees deal with these challenges in a myriad of ways, but the strategies employed are not perfect. Reading and interpreting text perceived as dull has an additional cognitive toll, and is commonly thought of as
boring. This finding agrees with studies that show that emotional stimuli draw more attention and are better remembered than neutral counterparts [
45], an effect that extends to written text [
41].
All of these issues leave interviewees feeling as if not part of the group when participating in meetings with their hearing peers. This is such a common occurrence that some have naturalized it as being an inherent aspect of such meetings, rather than a consequence of how their underlying technologies have been designed.
Our second goal was to understand how these strategies and experiences could inform the design of new captioning systems that depict prosody and/or emotions (
rq1.b). While participants agreed that including
some non-textual dimensions of speech could help alleviate the ambiguities of
asr-generated captions, they diverged as to which of these dimensions would be most helpful: either emotional cues, prosodic cues, or both. A follow-up study investigating
how these captions could look like could thus face a design space too vast to explore. A plausible alternative, then, was to first evaluate
what non-textual dimensions are most effective to alleviate the communication issues that emerged from the interviews of Study 1, thus allowing future studies a narrower research scope while still measuring whether these expanded captions can help
dhh individuals identify paralinguistic dimensions in speech. While a ‘good enough’ design style for the captions may be sufficient for the purposes of this ‘what dimensions’ study, its parameters must still be carefully considered. See subsection
4.1.2 for a detailed discussion of our approach to tackling this issue in Study 2.
In discussing the prototypes shown, responses reflected a diverse set of preferences, allowing some high-level recommendations: (a) Legibility is a notable concern, even when participants felt prosody and emotions were being well represented; (b) Even though participants will generally complement their understanding of captions with a multimodal apprehension of other signals, such as facial expressions and body language, peripheral visual elements used for representing prosody or emotions run the risk of being ignored.