skip to main content
10.1145/3517428.3544808acmconferencesArticle/Chapter ViewAbstractPublication PagesassetsConference Proceedingsconference-collections
research-article

Beyond Subtitles: Captioning and Visualizing Non-speech Sounds to Improve Accessibility of User-Generated Videos

Published: 22 October 2022 Publication History

Abstract

Captioning provides access to sounds in audio-visual content for people who are Deaf or Hard-of-hearing (DHH). As user-generated content in online videos grows in prevalence, researchers have explored using automatic speech recognition (ASR) to automate captioning. However, definitions of captions (as compared to subtitles) include non-speech sounds, which ASR typically does not capture as it focuses on speech. Thus, we explore DHH viewers’ and hearing video creators’ perspectives on captioning non-speech sounds in user-generated online videos using text or graphics. Formative interviews with 11 DHH participants informed the design and implementation of a prototype interface for authoring text-based and graphic captions using automatic sound event detection, which was then evaluated with 10 hearing video creators. Our findings include identifying DHH viewers’ interests in having important non-speech sounds included in captions, as well as various criteria for sound selection and the appropriateness of text-based versus graphic captions of non-speech sounds. Our findings also include hearing creators’ requirements for automatic tools to assist them in captioning non-speech sounds.

Supplementary Material

This electronic appendix contains 6 files: 1. "Formative Interview Study Questionnaire": PDF file containing the questionnaire used in our formative interview study, as described in section 4.1 of the paper. 2. "Video Stimuli": PDF file containing details about our video stimuli, as described in sections 4.1 and 6.2 of the paper. 3. "Prototype Demo.mp4": MP4 file containing a video demonstration of our prototype, as described in section 6.1 of the paper. 3. "Prototype Demo.srt": subtitle file for video demonstration of our prototype. 3. "Prototype Demo with Open Captions.mp4": MP4 file identical to "Prototype Demo," but with captions burned into the video. 4. "Prototype Study Questionnaire": PDF file containing the questionnaire used in our prototype study, as described in section 6.2 of the paper. (Formative Interview Study Questionnaire.pdf)
This electronic appendix contains 6 files: 1. "Formative Interview Study Questionnaire": PDF file containing the questionnaire used in our formative interview study, as described in section 4.1 of the paper. 2. "Video Stimuli": PDF file containing details about our video stimuli, as described in sections 4.1 and 6.2 of the paper. 3. "Prototype Demo.mp4": MP4 file containing a video demonstration of our prototype, as described in section 6.1 of the paper. 3. "Prototype Demo.srt": subtitle file for video demonstration of our prototype. 3. "Prototype Demo with Open Captions.mp4": MP4 file identical to "Prototype Demo," but with captions burned into the video. 4. "Prototype Study Questionnaire": PDF file containing the questionnaire used in our prototype study, as described in section 6.2 of the paper. (Prototype Study Questionnaire.pdf)
This electronic appendix contains 6 files: 1. "Formative Interview Study Questionnaire": PDF file containing the questionnaire used in our formative interview study, as described in section 4.1 of the paper. 2. "Video Stimuli": PDF file containing details about our video stimuli, as described in sections 4.1 and 6.2 of the paper. 3. "Prototype Demo.mp4": MP4 file containing a video demonstration of our prototype, as described in section 6.1 of the paper. 3. "Prototype Demo.srt": subtitle file for video demonstration of our prototype. 3. "Prototype Demo with Open Captions.mp4": MP4 file identical to "Prototype Demo," but with captions burned into the video. 4. "Prototype Study Questionnaire": PDF file containing the questionnaire used in our prototype study, as described in section 6.2 of the paper. (Video Stimuli.pdf)
MP4 File (Prototype Demo with Open Captions.mp4)
This electronic appendix contains 6 files: 1. "Formative Interview Study Questionnaire": PDF file containing the questionnaire used in our formative interview study, as described in section 4.1 of the paper. 2. "Video Stimuli": PDF file containing details about our video stimuli, as described in sections 4.1 and 6.2 of the paper. 3. "Prototype Demo.mp4": MP4 file containing a video demonstration of our prototype, as described in section 6.1 of the paper. 3. "Prototype Demo.srt": subtitle file for video demonstration of our prototype. 3. "Prototype Demo with Open Captions.mp4": MP4 file identical to "Prototype Demo," but with captions burned into the video. 4. "Prototype Study Questionnaire": PDF file containing the questionnaire used in our prototype study, as described in section 6.2 of the paper.
MP4 File (Prototype Demo.mp4)
This electronic appendix contains 6 files: 1. "Formative Interview Study Questionnaire": PDF file containing the questionnaire used in our formative interview study, as described in section 4.1 of the paper. 2. "Video Stimuli": PDF file containing details about our video stimuli, as described in sections 4.1 and 6.2 of the paper. 3. "Prototype Demo.mp4": MP4 file containing a video demonstration of our prototype, as described in section 6.1 of the paper. 3. "Prototype Demo.srt": subtitle file for video demonstration of our prototype. 3. "Prototype Demo with Open Captions.mp4": MP4 file identical to "Prototype Demo," but with captions burned into the video. 4. "Prototype Study Questionnaire": PDF file containing the questionnaire used in our prototype study, as described in section 6.2 of the paper.

References

[1]
Mike Armstrong, Andy Brown, Michael Crabb, Chris J Hughes, Rhianne Jones, and James Sandford. 2016. Understanding the diverse needs of subtitle users in a rapidly evolving media landscape. SMPTE Motion Imaging Journal 125, 9 (2016), 33–41.
[2]
Mike Armstrong and Michael Crabb. 2017. Exploring ways of meeting a wider range of access needs through object-based media-workshop. In Conference on Accessibility in Film, Television and Interactive Media, York, UK.
[3]
BBC Research & Development. [n.d.]. Object-Based Media. https://www.bbc.co.uk/rd/object-based-media. Accessed: 2022-06-01.
[4]
Larwan Berke, Khaled Albusays, Matthew Seita, and Matt Huenerfauth. 2019. Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3290607.3312921
[5]
Larwan Berke, Sushant Kafle, and Matt Huenerfauth. 2018. Methods for Evaluation of Imperfect Captioning Tools by Deaf or Hard-of-Hearing Users at Different Reading Literacy Levels. Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173665
[6]
Larwan Berke, Matthew Seita, and Matt Huenerfauth. 2020. Deaf and Hard-of-Hearing Users’ Prioritization of Genres of Online Video Content Requiring Accurate Captions. In Proceedings of the 17th International Web for All Conference (Taipei, Taiwan) (W4A ’20). Association for Computing Machinery, New York, NY, USA, Article 3, 12 pages. https://doi.org/10.1145/3371300.3383337
[7]
Danielle Bragg, Nicholas Huynh, and Richard E. Ladner. 2016. A Personalizable Mobile Sound Detector App Design for Deaf and Hard-of-Hearing Users. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility (Reno, Nevada, USA) (ASSETS ’16). Association for Computing Machinery, New York, NY, USA, 3–13. https://doi.org/10.1145/2982142.2982171
[8]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa arXiv:https://www.tandfonline.com/doi/pdf/10.1191/1478088706qp063oa
[9]
Virginia Braun and Victoria Clarke. 2022. Thematic Analysis. https://www.thematicanalysis.net. Accessed: 2022-03-01.
[10]
Andy Brown, Rhia Jones, Mike Crabb, James Sandford, Matthew Brooks, Mike Armstrong, and Caroline Jay. 2015. Dynamic Subtitles: The User Experience. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video (Brussels, Belgium) (TVX ’15). Association for Computing Machinery, New York, NY, USA, 103–112. https://doi.org/10.1145/2745197.2745204
[11]
Andy Brown, Jayson Turner, Jake Patterson, Anastasia Schmitz, Mike Armstrong, and Maxine Glancy. 2017. Subtitles in 360-Degree Video. In Adjunct Publication of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video (Hilversum, The Netherlands) (TVX ’17 Adjunct). Association for Computing Machinery, New York, NY, USA, 3–8. https://doi.org/10.1145/3084289.3089915
[12]
Sourish Chaudhuri. 2017. Adding sound Effect information to Youtube captions. https://ai.googleblog.com/2017/03/adding-sound-effect-information-to.html Accessed: 2021-06-01.
[13]
Karen Collins and Peter J. Taillon. 2012. Visualized sound effect icons for improved multimedia accessibility: A pilot study. Entertainment Computing 3, 1 (2012), 11–17. https://doi.org/10.1016/j.entcom.2011.09.002
[14]
Michael Crabb, Rhianne Jones, and Mike Armstrong. 2015. The Development of a Framework for Understanding the UX of Subtitles. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility(Lisbon, Portugal) (ASSETS ’15). Association for Computing Machinery, New York, NY, USA, 347–348. https://doi.org/10.1145/2700648.2811372
[15]
Michael Crabb, Rhianne Jones, Mike Armstrong, and Chris J. Hughes. 2015. Online News Videos: The UX of Subtitle Position. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility (Lisbon, Portugal) (ASSETS ’15). Association for Computing Machinery, New York, NY, USA, 215–222. https://doi.org/10.1145/2700648.2809866
[16]
Sofia Enamorado. 2018. CVAA & FCC closed Captioning requirements for online video. https://www.3playmedia.com/blog/final-cvaa-and-fcc-online-video-closed-captioning-rules/ Accessed: 2021-06-01.
[17]
Deborah I Fels, Daniel G Lee, Carmen Branje, and Matthew Hornburg. 2005. Emotive Captioning and Access to Television. https://doi.org/10.1145/3173574.3173665
[18]
Leah Findlater, Bonnie Chinh, Dhruv Jain, Jon Froehlich, Raja Kushalnagar, and Angela Carey Lin. 2019. Deaf and Hard-of-Hearing Individuals’ Preferences for Wearable and Mobile Sound Awareness Technologies. Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300276
[19]
Benjamin M. Gorman, Michael Crabb, and Michael Armstrong. 2021. Adaptive Subtitles: Preferences and Trade-Offs in Real-Time Media Adaption. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 733, 11 pages. https://doi.org/10.1145/3411764.3445509
[20]
Michael Gower, Brent Shiver, Charu Pandhi, and Shari Trewin. 2018. Leveraging Pauses to Improve Video Captions. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (Galway, Ireland) (ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 414–416. https://doi.org/10.1145/3234695.3241023
[21]
Dhruv Jain, Sasa Junuzovic, Eyal Ofek, Mike Sinclair, John Porter, Chris Yoon, Swetha Machanavajhala, and Meredith Ringel Morris. 2021. A Taxonomy of Sounds in Virtual Reality. In Designing Interactive Systems Conference 2021(Virtual Event, USA) (DIS ’21). Association for Computing Machinery, New York, NY, USA, 160–170. https://doi.org/10.1145/3461778.3462106
[22]
Dhruv Jain, Angela Lin, Rose Guttman, Marcus Amalachandran, Aileen Zeng, Leah Findlater, and Jon Froehlich. 2019. Exploring Sound Awareness in the Home for People Who Are Deaf or Hard of Hearing. Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300324
[23]
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. 2020. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2880–2894. https://doi.org/10.1109/TASLP.2020.3030497
[24]
Raja S. Kushalnagar, Gary W. Behm, Joseph S. Stanislow, and Vasu Gupta. 2014. Enhancing Caption Accessibility through Simultaneous Multimodal Information: Visual-Tactile Captions. In Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility (Rochester, New York, USA) (ASSETS ’14). Association for Computing Machinery, New York, NY, USA, 185–192. https://doi.org/10.1145/2661334.2661381
[25]
Daniel G. Lee, Deborah I. Fels, and John Patrick Udo. 2007. Emotive Captioning. Comput. Entertain. 5, 2, Article 11 (April 2007), 15 pages. https://doi.org/10.1145/1279540.1279551
[26]
Tara Matthews, Janette Fong, F. Wai-Ling Ho-Ching, and Jennifer Mankoff. 2006. Evaluating non-speech sound visualizations for the deaf. Behaviour & Information Technology 25, 4 (2006), 333–351. https://doi.org/10.1080/01449290600636488 arXiv:https://doi.org/10.1080/01449290600636488
[27]
John McGowan, Grégory Leplâtre, and Iain McGregor. 2017. CymaSense: A Real-Time 3D Cymatics-Based Sound Visualisation Tool. In Proceedings of the 2017 ACM Conference Companion Publication on Designing Interactive Systems(Edinburgh, United Kingdom) (DIS ’17 Companion). Association for Computing Machinery, New York, NY, USA, 270–274. https://doi.org/10.1145/3064857.3079159
[28]
Carol Padden and Tom Humphries. 2005. Inside Deaf Culture. Harvard University Press. http://www.jstor.org/stable/j.ctvjz83v3
[29]
S. J. Parault and H. M. Williams. 2010. Reading Motivation, Reading Amount, and Text Comprehension in Deaf and Hearing Adults. Journal of Deaf Studies and Deaf Education 15, 2 (2010), 120–135. https://doi.org/10.1093/deafed/enp031
[30]
C. B. Traxler. 2000. The Stanford Achievement Test, 9th Edition: National Norming and Performance Standards for Deaf and Hard-of-Hearing Students. Journal of Deaf Studies and Deaf Education 5, 4 (Jan 2000), 337–348. https://doi.org/10.1093/deafed/5.4.337
[31]
M. Wald. 2011. Crowdsourcing Correction of Speech Recognition Captioning Errors. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1969289.1969318
[32]
Dawn Walton, Georgianna Borgna, Marc Marschark, Kathryn Crowe, and Jessica Trussell. 2019. I am not unskilled and unaware: deaf and hearing learners’ self-assessments of linguistic and nonlinguistic skills. European Journal of Special Needs Education 34, 1 (2019), 20–34. https://doi.org/10.1080/08856257.2018.1435010 arXiv:https://doi.org/10.1080/08856257.2018.1435010
[33]
Fangzhou Wang, Hidehisa Nagano, Kunio Kashino, and Takeo Igarashi. 2015. Visualizing video sounds with sound word animation. In 2015 IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME.2015.7177422
[34]
Noah Wang. 2017. Visualizing sound effects. https://youtube-eng.googleblog.com/2017/03/visualizing-sound-effects.html Accessed: 2021-06-01.
[35]
Sean Zdenek. 2015. Reading sounds: Closed-captioned media and popular culture. University of Chicago Press.

Cited By

View all
  • (2025)Visualizing speech styles in captions for deaf and hard-of-hearing viewersInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2024.103386194:COnline publication date: 1-Feb-2025
  • (2024)Language Context in the Future of Television and Video Industry: Exploring Trends and OpportunitiesThe Future of Television and Video Industry10.5772/intechopen.113309Online publication date: 12-Jun-2024
  • (2024)Digital accessibility in the era of artificial intelligence—Bibliometric analysis and systematic reviewFrontiers in Artificial Intelligence10.3389/frai.2024.13496687Online publication date: 16-Feb-2024
  • Show More Cited By

Index Terms

  1. Beyond Subtitles: Captioning and Visualizing Non-speech Sounds to Improve Accessibility of User-Generated Videos

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASSETS '22: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility
      October 2022
      902 pages
      ISBN:9781450392587
      DOI:10.1145/3517428
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 October 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. accessibility
      2. audio tagging
      3. automatic captions
      4. non-speech sounds

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ASSETS '22
      Sponsor:

      Acceptance Rates

      ASSETS '22 Paper Acceptance Rate 35 of 132 submissions, 27%;
      Overall Acceptance Rate 436 of 1,556 submissions, 28%

      Upcoming Conference

      ASSETS '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)218
      • Downloads (Last 6 weeks)20
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Visualizing speech styles in captions for deaf and hard-of-hearing viewersInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2024.103386194:COnline publication date: 1-Feb-2025
      • (2024)Language Context in the Future of Television and Video Industry: Exploring Trends and OpportunitiesThe Future of Television and Video Industry10.5772/intechopen.113309Online publication date: 12-Jun-2024
      • (2024)Digital accessibility in the era of artificial intelligence—Bibliometric analysis and systematic reviewFrontiers in Artificial Intelligence10.3389/frai.2024.13496687Online publication date: 16-Feb-2024
      • (2024)Towards a Rich Format for Closed-CaptioningProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688504(1-5)Online publication date: 27-Oct-2024
      • (2024)Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing ViewersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688495(1-6)Online publication date: 27-Oct-2024
      • (2024)SoundModVR: Sound Modifications in Virtual Reality to Support People who are Deaf and Hard of HearingProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675653(1-15)Online publication date: 27-Oct-2024
      • (2024)Envisioning Collective Communication Access: A Theoretically-Grounded Review of Captioning Literature from 2013-2023Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675649(1-18)Online publication date: 27-Oct-2024
      • (2024)“Caption It in an Accessible Way That Is Also Enjoyable”: Characterizing User-Driven Captioning Practices on TikTokProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642177(1-16)Online publication date: 11-May-2024
      • (2024)Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTubeProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642162(1-19)Online publication date: 11-May-2024
      • (2024)EmoWear: Exploring Emotional Teasers for Voice Message Interaction on SmartwatchesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642101(1-16)Online publication date: 11-May-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media