skip to main content
10.1145/3544549.3585609acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Work in Progress

Modeling and Improving Text Stability in Live Captions

Published:19 April 2023Publication History

ABSTRACT

In recent years, live captions have gained significant popularity through its availability in remote video conferences, mobile applications, and the web. Unlike preprocessed subtitles, live captions require real-time responsiveness by showing interim speech-to-text results. As the prediction confidence changes, the captions may update, leading to visual instability that interferes with the user’s viewing experience. In this paper, we characterize the stability of live captions by proposing a vision-based flickering metric using luminance contrast and Discrete Fourier Transform. Additionally, we assess the effect of unstable captions on the viewer through task load index surveys. Our analysis reveals significant correlations between the viewer’s experience and our proposed quantitative metric. To enhance the stability of live captions without compromising responsiveness, we propose the use of tokenized alignment, word updates with semantic similarity, and smooth animation. Results from a crowdsourced study (N=123), comparing four strategies, indicate that our stabilization algorithms lead to a significant reduction in viewer distraction and fatigue, while increasing viewers’ reading comfort.

Footnotes

Skip Supplemental Material Section

Supplemental Material

3544549.3585609-talk-video.mp4

mp4

13 MB

3544549.3585609-video-preview.mp4

mp4

53 MB

References

  1. Katrin Angerbauer, Heike Adel, and Ngoc Thang Vu. 2019. Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience. In Proc. Interspeech 2019. Interspeech, New York, NY, USA, 594–598. https://doi.org/10.21437/Interspeech.2019-1750Google ScholarGoogle ScholarCross RefCross Ref
  2. Jacob Aron. 2011. How innovative is Apple’s new voice assistant, Siri?Google ScholarGoogle Scholar
  3. Larwan Berke, Christopher Caulfield, and Matt Huenerfauth. 2017. Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 155–164. https://doi.org/10.1145/3132525.3132541Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Janine Butler, Brian Trager, and Byron Behm. 2019. Exploration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 32–42. https://doi.org/10.1145/3308561.3353772Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Teresa Hirzle, Maurice Cordts, Enrico Rukzio, and Andreas Bulling. 2020. A survey of digital eye strain in gaze-based interactive systems. In ACM Symposium on Eye Tracking Research and Applications. Association for Computing Machinery, New York, NY, USA, 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ippei Hisaki, Hiroaki Nanjo, and Takehiko Yoshimi. 2010. Evaluation of Speech Balloon Captions for Auditory Information Support in Small Meetings. In Proceedings of the 20th International Congress on Acoustics. Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3491102.3501920Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Richang Hong, Meng Wang, Mengdi Xu, Shuicheng Yan, and Tat-Seng Chua. 2010. Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment. In Proceedings of the 18th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 421–430. https://doi.org/10.1145/1873951.1874013Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dhruv Jain, Rachel Franz, Leah Findlater, Jackson Cannon, Raja Kushalnagar, and Jon Froehlich. 2018. Towards Accessible Conversations in a Mobile Context for People Who Are Deaf and Hard of Hearing. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (Galway, Ireland) (ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 81–92. https://doi.org/10.1145/3234695.3236362Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sushant Kafle, Becca Dingman, and Matt Huenerfauth. 2021. Deaf and Hard-of-Hearing Users Evaluating Designs for Highlighting Key Words in Educational Lecture Videos. ACM Transactions on Accessible Computing (TACCESS) 14, 4 (2021), 1–24. https://doi.org/10.1145/3470651Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sushant Kafle, Peter Yeung, and Matt Huenerfauth. 2019. Evaluating the Benefit of Highlighting Key Words in Captions for People Who Are Deaf or Hard of Hearing. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, PA, USA) (ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 43–55. https://doi.org/10.1145/3308561.3353781Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Saba Kawas, George Karalis, Tzu Wen, and Richard E Ladner. 2016. Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/2982142.2982164Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jan-Louis Kruger, Esté Hefer-Jordaan, and Gordon Matthew. 2013. Measuring the Impact of Subtitles on Cognitive Load: Eye Tracking and Dynamic Audiovisual Texts, In ACM. ACM International Conference Proceeding Series 1, 62–66. https://doi.org/10.1145/2509315.2509331Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jan-Louis Kruger and Faans Steyn. 2013. Subtitles and Eye Tracking: Reading and Performance. Reading Research Quarterly 49 (10 2013). https://doi.org/10.1002/rrq.59Google ScholarGoogle ScholarCross RefCross Ref
  14. Kuno Kurzhals, Emine Cetinkaya, Yongtao Hu, Wenping Wang, and Daniel Weiskopf. 2017. Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 6559–6568. https://doi.org/10.1145/3025453.3025772Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2012. A Readability Evaluation of Real-Time Crowd Captions in the Classroom. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/2384916.2384930Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2014. Accessibility Evaluation of Classroom Captions. ACM Transactions on Accessible Computing (TACCESS) 5, 3 (2014), 1–24. https://doi.org/10.1145/2982142.2982164Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Seongjae Lee, Sunmee Kang, Hanseok Ko, Jongseong Yoon, and Minseok Keum. 2013. Dialogue Enabling Speech-to-Text User Assistive Agent With Auditory Perceptual Beamforming for Hearing-Impaired. In 2013 IEEE International Conference on Consumer Electronics (ICCE). IEEE, IEEE, New York, NY, USA, 360–361. https://doi.org/10.1109/ICCE.2013.6486929Google ScholarGoogle ScholarCross RefCross Ref
  18. Kirill Levin, Irina Ponomareva, Anna Bulusheva, German Chernykh, Ivan Medennikov, N. Merkin, and Natalia Tomashenko. 2014. Automated Closed Captioning for Russian Live Broadcasting. In Interspeech. Interspeech, New York, NY, USA. https://doi.org/10.21437/Interspeech.2014-352Google ScholarGoogle ScholarCross RefCross Ref
  19. Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, and Yonghong Yan. 2022. Improving Streaming End-to-End ASR on Transformer-Based Causal Models With Encoder States Revision Strategies. https://doi.org/10.48550/ARXIV.2207.02495Google ScholarGoogle ScholarCross RefCross Ref
  20. Xingyu Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Xiang Chen, Alex Olwal, and Ruofei Du. 2023. Visual Captions: Augmenting Verbal Communication With On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI). Association for Computing Machinery, New York, NY, USA, 14. https://doi.org/10.1145/3544548.3581566Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang "Anthony" Chen, and Amy Pavel. 2022. CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 43, 14 pages. https://doi.org/10.1145/3526113.3545703Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2018. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. In Annual Meeting of the Association for Computational Linguistics. Association for Computing Machinery, New York, NY, USA, 10.Google ScholarGoogle Scholar
  23. Tara Matthews, Scott Carter, Carol Pai, Janette Fong, and Jennifer Mankoff. 2006. Scribe4Me: Evaluating a Mobile Sound Transcription Tool for the Deaf. In International Conference on Ubiquitous Computing. Springer, Springer, New York, NY, USA, 159–176. https://doi.org/10.1007/1185356_10Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mohammad Reza Mirzaei, Seyed Ghorshi, and Mohammad Mortazavi. 2012. Using Augmented Reality and Automatic Speech Recognition Techniques to Help Deaf and Hard of Hearing People. In Proceedings of the 2012 Virtual Reality International Conference. Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/2331714.2331720Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mark Neerincx, Anita Cremers, Judith Kessens, David Van Leeuwen, and Khiet Truong. 2009. Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice. Universal Access in the Information Society 8 (06 2009), 109–122. https://doi.org/10.1007/s10209-008-0136-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, and Alex Waibel. 2020. Low Latency ASR for Simultaneous Speech Translation. https://doi.org/10.48550/ARXIV.2003.09891Google ScholarGoogle ScholarCross RefCross Ref
  27. Alex Olwal, Kevin Balke, Dmitrii Votintcev, Thad Starner, Paula Conn, Bonnie Chinh, and Benoit Corda. 2020. Wearable Subtitles: Augmenting Spoken Communication With Lightweight Eyewear for All-Day Captioning. In UIST ’20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, New York, NY, USA, 10. https://doi.org/10.1145/3379337.3415817Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Agnès Piquard-Kipffer, Odile Mella, Jérémy Miranda, Denis Jouvet, and Luiza Orosanu. 2015. Qualitative Investigation of the Display of Speech Recognition Results for Communication With Deaf People. In Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies. Association for Computing Machinery, New York, NY, USA, 36–41.Google ScholarGoogle ScholarCross RefCross Ref
  29. Soraia Silva Prietch, Napoliana Silva de Souza, and Lucia Villela Leite Filgueiras. 2014. A Speech-to-Text System’s Acceptance Evaluation: Would Deaf Individuals Adopt This Technology in Their Lives?. In International Conference on Universal Access in Human-Computer Interaction. Springer, Springer, New York, NY, USA, 440–449. https://doi.org/10.1145/3290607.3312921Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nils Reimers and Iryna Gurevych. 2019. Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. ArXiv Preprint ArXiv:1908.10084 1 (2019), 10. https://arxiv.org/pdf/1908.10084Google ScholarGoogle Scholar
  31. Pablo Romero-Fresco. 2015. Accuracy Rate in Live Subtitling: The NER Model. Palgrave Macmillan UK, London, 28–50. https://doi.org/10.1057/978113755289_3Google ScholarGoogle ScholarCross RefCross Ref
  32. Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, and Françoise Beaufays. 2020. Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer. CoRR abs/2006.01416 (2020), 10. arXiv:2006.01416https://arxiv.org/abs/2006.01416Google ScholarGoogle Scholar
  33. Brent N Shiver and Rosalee J Wolfe. 2015. Evaluating Alternatives for Better Deaf Accessibility to Selected Web-Based Multimedia. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. Association for Computing Machinery, New York, NY, USA, 231–238. https://doi.org/10.1145/2700648.2809857Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. ACM SIGCHI. 2. CHI 2022 Town Hall. ACM SIGCHI. https://www.youtube.com/watch?v=dDPPNyUDmcoGoogle ScholarGoogle Scholar
  35. Michael S Stinson, Lisa B Elliot, and Ronald R Kelly. 2017. Deaf and Hard-of-Hearing High School and College Students’ Perceptions of Speech-to-Text and Interpreting/note Taking Services and Motivation. Journal of Developmental and Physical Disabilities 29, 1 (2017), 131–152. https://doi.org/10.1177/0022466907313453Google ScholarGoogle ScholarCross RefCross Ref
  36. Michael S Stinson, Lisa B Elliot, Ronald R Kelly, and Yufang Liu. 2009. Deaf and Hard-of-Hearing Students’ Memory of Lectures With Speech-to-Text and Interpreting/note Taking Services. The Journal of Special Education 43, 1 (2009), 52–64.Google ScholarGoogle ScholarCross RefCross Ref
  37. Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, and Björn Hoffmeister. 2019. Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings.. In Interspeech. Interspeech, New York, NY, USA, 2175–2179.Google ScholarGoogle Scholar
  38. Stefan Winkler, Elisa Drelie Gelasca, and Touradj Ebrahimi. 2003. Toward Perceptual Metrics for Video Watermark Evaluation. In Applications of Digital Image Processing XXVI, Andrew G. Tescher (Ed.). Vol. 5203. International Society for Optics and Photonics, SPIE, New York, NY, USA, 371 – 378. https://doi.org/10.1117/12.512550Google ScholarGoogle ScholarCross RefCross Ref
  39. Yuekun Yao and Barry Haddow. 2020. Dynamic Masking for Improved Stability in Online Spoken Language Translation. In Conference of the Association for Machine Translation in the Americas. Association for Computing Machinery, New York, NY, USA, 10.Google ScholarGoogle Scholar

Index Terms

  1. Modeling and Improving Text Stability in Live Captions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
      April 2023
      3914 pages
      ISBN:9781450394222
      DOI:10.1145/3544549

      Copyright © 2023 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 April 2023

      Check for updates

      Qualifiers

      • Work in Progress
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate6,164of23,696submissions,26%

      Upcoming Conference

      CHI '24
      CHI Conference on Human Factors in Computing Systems
      May 11 - 16, 2024
      Honolulu , HI , USA
    • Article Metrics

      • Downloads (Last 12 months)149
      • Downloads (Last 6 weeks)22

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format