Work in Progress

Modeling and Improving Text Stability in Live Captions

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing SystemsApril 2023Article No.: 208Pages 1–9https://doi.org/10.1145/3544549.3585609

Published:19 April 2023Publication History

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

Pages 1–9

ABSTRACT

In recent years, live captions have gained significant popularity through its availability in remote video conferences, mobile applications, and the web. Unlike preprocessed subtitles, live captions require real-time responsiveness by showing interim speech-to-text results. As the prediction confidence changes, the captions may update, leading to visual instability that interferes with the user’s viewing experience. In this paper, we characterize the stability of live captions by proposing a vision-based flickering metric using luminance contrast and Discrete Fourier Transform. Additionally, we assess the effect of unstable captions on the viewer through task load index surveys. Our analysis reveals significant correlations between the viewer’s experience and our proposed quantitative metric. To enhance the stability of live captions without compromising responsiveness, we propose the use of tokenized alignment, word updates with semantic similarity, and smooth animation. Results from a crowdsourced study (N=123), comparing four strategies, indicate that our stabilization algorithms lead to a significant reduction in viewer distraction and fatigue, while increasing viewers’ reading comfort.

Footnotes

¹ WordNetLemmatizer: https://www.nltk.org/_modules/nltk/stem/wordnet.html
Footnote

Supplemental Material

3544549.3585609-talk-video.mp4

mp4

13 MB

Download

3544549.3585609-video-preview.mp4

mp4

53 MB

Download

Available for Download

zip

Supplemental Materials (22.2 MB)

References

Katrin Angerbauer, Heike Adel, and Ngoc Thang Vu. 2019. Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience. In Proc. Interspeech 2019. Interspeech, New York, NY, USA, 594–598. https://doi.org/10.21437/Interspeech.2019-1750Google ScholarCross Ref
Jacob Aron. 2011. How innovative is Apple’s new voice assistant, Siri?Google Scholar
Larwan Berke, Christopher Caulfield, and Matt Huenerfauth. 2017. Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 155–164. https://doi.org/10.1145/3132525.3132541Google ScholarDigital Library
Janine Butler, Brian Trager, and Byron Behm. 2019. Exploration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 32–42. https://doi.org/10.1145/3308561.3353772Google ScholarDigital Library
Teresa Hirzle, Maurice Cordts, Enrico Rukzio, and Andreas Bulling. 2020. A survey of digital eye strain in gaze-based interactive systems. In ACM Symposium on Eye Tracking Research and Applications. Association for Computing Machinery, New York, NY, USA, 1–12.Google ScholarDigital Library
Ippei Hisaki, Hiroaki Nanjo, and Takehiko Yoshimi. 2010. Evaluation of Speech Balloon Captions for Auditory Information Support in Small Meetings. In Proceedings of the 20th International Congress on Acoustics. Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3491102.3501920Google ScholarDigital Library
Richang Hong, Meng Wang, Mengdi Xu, Shuicheng Yan, and Tat-Seng Chua. 2010. Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment. In Proceedings of the 18th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 421–430. https://doi.org/10.1145/1873951.1874013Google ScholarDigital Library
Dhruv Jain, Rachel Franz, Leah Findlater, Jackson Cannon, Raja Kushalnagar, and Jon Froehlich. 2018. Towards Accessible Conversations in a Mobile Context for People Who Are Deaf and Hard of Hearing. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (Galway, Ireland) (ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 81–92. https://doi.org/10.1145/3234695.3236362Google ScholarDigital Library
Sushant Kafle, Becca Dingman, and Matt Huenerfauth. 2021. Deaf and Hard-of-Hearing Users Evaluating Designs for Highlighting Key Words in Educational Lecture Videos. ACM Transactions on Accessible Computing (TACCESS) 14, 4 (2021), 1–24. https://doi.org/10.1145/3470651Google ScholarDigital Library
Sushant Kafle, Peter Yeung, and Matt Huenerfauth. 2019. Evaluating the Benefit of Highlighting Key Words in Captions for People Who Are Deaf or Hard of Hearing. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, PA, USA) (ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 43–55. https://doi.org/10.1145/3308561.3353781Google ScholarDigital Library
Saba Kawas, George Karalis, Tzu Wen, and Richard E Ladner. 2016. Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/2982142.2982164Google ScholarDigital Library
Jan-Louis Kruger, Esté Hefer-Jordaan, and Gordon Matthew. 2013. Measuring the Impact of Subtitles on Cognitive Load: Eye Tracking and Dynamic Audiovisual Texts, In ACM. ACM International Conference Proceeding Series 1, 62–66. https://doi.org/10.1145/2509315.2509331Google ScholarDigital Library
Jan-Louis Kruger and Faans Steyn. 2013. Subtitles and Eye Tracking: Reading and Performance. Reading Research Quarterly 49 (10 2013). https://doi.org/10.1002/rrq.59Google ScholarCross Ref
Kuno Kurzhals, Emine Cetinkaya, Yongtao Hu, Wenping Wang, and Daniel Weiskopf. 2017. Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 6559–6568. https://doi.org/10.1145/3025453.3025772Google ScholarDigital Library
Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2012. A Readability Evaluation of Real-Time Crowd Captions in the Classroom. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/2384916.2384930Google ScholarDigital Library
Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2014. Accessibility Evaluation of Classroom Captions. ACM Transactions on Accessible Computing (TACCESS) 5, 3 (2014), 1–24. https://doi.org/10.1145/2982142.2982164Google ScholarDigital Library
Seongjae Lee, Sunmee Kang, Hanseok Ko, Jongseong Yoon, and Minseok Keum. 2013. Dialogue Enabling Speech-to-Text User Assistive Agent With Auditory Perceptual Beamforming for Hearing-Impaired. In 2013 IEEE International Conference on Consumer Electronics (ICCE). IEEE, IEEE, New York, NY, USA, 360–361. https://doi.org/10.1109/ICCE.2013.6486929Google ScholarCross Ref
Kirill Levin, Irina Ponomareva, Anna Bulusheva, German Chernykh, Ivan Medennikov, N. Merkin, and Natalia Tomashenko. 2014. Automated Closed Captioning for Russian Live Broadcasting. In Interspeech. Interspeech, New York, NY, USA. https://doi.org/10.21437/Interspeech.2014-352Google ScholarCross Ref
Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, and Yonghong Yan. 2022. Improving Streaming End-to-End ASR on Transformer-Based Causal Models With Encoder States Revision Strategies. https://doi.org/10.48550/ARXIV.2207.02495Google ScholarCross Ref
Xingyu Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Xiang Chen, Alex Olwal, and Ruofei Du. 2023. Visual Captions: Augmenting Verbal Communication With On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI). Association for Computing Machinery, New York, NY, USA, 14. https://doi.org/10.1145/3544548.3581566Google ScholarDigital Library
Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang "Anthony" Chen, and Amy Pavel. 2022. CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 43, 14 pages. https://doi.org/10.1145/3526113.3545703Google ScholarDigital Library
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2018. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. In Annual Meeting of the Association for Computational Linguistics. Association for Computing Machinery, New York, NY, USA, 10.Google Scholar
Tara Matthews, Scott Carter, Carol Pai, Janette Fong, and Jennifer Mankoff. 2006. Scribe4Me: Evaluating a Mobile Sound Transcription Tool for the Deaf. In International Conference on Ubiquitous Computing. Springer, Springer, New York, NY, USA, 159–176. https://doi.org/10.1007/1185356_10Google ScholarDigital Library
Mohammad Reza Mirzaei, Seyed Ghorshi, and Mohammad Mortazavi. 2012. Using Augmented Reality and Automatic Speech Recognition Techniques to Help Deaf and Hard of Hearing People. In Proceedings of the 2012 Virtual Reality International Conference. Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/2331714.2331720Google ScholarDigital Library
Mark Neerincx, Anita Cremers, Judith Kessens, David Van Leeuwen, and Khiet Truong. 2009. Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice. Universal Access in the Information Society 8 (06 2009), 109–122. https://doi.org/10.1007/s10209-008-0136-xGoogle ScholarDigital Library
Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, and Alex Waibel. 2020. Low Latency ASR for Simultaneous Speech Translation. https://doi.org/10.48550/ARXIV.2003.09891Google ScholarCross Ref
Alex Olwal, Kevin Balke, Dmitrii Votintcev, Thad Starner, Paula Conn, Bonnie Chinh, and Benoit Corda. 2020. Wearable Subtitles: Augmenting Spoken Communication With Lightweight Eyewear for All-Day Captioning. In UIST ’20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, New York, NY, USA, 10. https://doi.org/10.1145/3379337.3415817Google ScholarDigital Library
Agnès Piquard-Kipffer, Odile Mella, Jérémy Miranda, Denis Jouvet, and Luiza Orosanu. 2015. Qualitative Investigation of the Display of Speech Recognition Results for Communication With Deaf People. In Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies. Association for Computing Machinery, New York, NY, USA, 36–41.Google ScholarCross Ref
Soraia Silva Prietch, Napoliana Silva de Souza, and Lucia Villela Leite Filgueiras. 2014. A Speech-to-Text System’s Acceptance Evaluation: Would Deaf Individuals Adopt This Technology in Their Lives?. In International Conference on Universal Access in Human-Computer Interaction. Springer, Springer, New York, NY, USA, 440–449. https://doi.org/10.1145/3290607.3312921Google ScholarDigital Library
Nils Reimers and Iryna Gurevych. 2019. Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. ArXiv Preprint ArXiv:1908.10084 1 (2019), 10. https://arxiv.org/pdf/1908.10084Google Scholar
Pablo Romero-Fresco. 2015. Accuracy Rate in Live Subtitling: The NER Model. Palgrave Macmillan UK, London, 28–50. https://doi.org/10.1057/978113755289_3Google ScholarCross Ref
Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, and Françoise Beaufays. 2020. Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer. CoRR abs/2006.01416 (2020), 10. arXiv:2006.01416https://arxiv.org/abs/2006.01416Google Scholar
Brent N Shiver and Rosalee J Wolfe. 2015. Evaluating Alternatives for Better Deaf Accessibility to Selected Web-Based Multimedia. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. Association for Computing Machinery, New York, NY, USA, 231–238. https://doi.org/10.1145/2700648.2809857Google ScholarDigital Library
ACM SIGCHI. 2. CHI 2022 Town Hall. ACM SIGCHI. https://www.youtube.com/watch?v=dDPPNyUDmcoGoogle Scholar
Michael S Stinson, Lisa B Elliot, and Ronald R Kelly. 2017. Deaf and Hard-of-Hearing High School and College Students’ Perceptions of Speech-to-Text and Interpreting/note Taking Services and Motivation. Journal of Developmental and Physical Disabilities 29, 1 (2017), 131–152. https://doi.org/10.1177/0022466907313453Google ScholarCross Ref
Michael S Stinson, Lisa B Elliot, Ronald R Kelly, and Yufang Liu. 2009. Deaf and Hard-of-Hearing Students’ Memory of Lectures With Speech-to-Text and Interpreting/note Taking Services. The Journal of Special Education 43, 1 (2009), 52–64.Google ScholarCross Ref
Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, and Björn Hoffmeister. 2019. Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings.. In Interspeech. Interspeech, New York, NY, USA, 2175–2179.Google Scholar
Stefan Winkler, Elisa Drelie Gelasca, and Touradj Ebrahimi. 2003. Toward Perceptual Metrics for Video Watermark Evaluation. In Applications of Digital Image Processing XXVI, Andrew G. Tescher (Ed.). Vol. 5203. International Society for Optics and Photonics, SPIE, New York, NY, USA, 371 – 378. https://doi.org/10.1117/12.512550Google ScholarCross Ref
Yuekun Yao and Barry Haddow. 2020. Dynamic Masking for Improved Stability in Online Spoken Language Translation. In Conference of the Association for Machine Translation in the Americas. Association for Computing Machinery, New York, NY, USA, 10.Google Scholar

Index Terms

Modeling and Improving Text Stability in Live Captions
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools

Recommendations

Global asymptotic stability and robust stability of a class of Cohen-Grossberg neural networks with mixed delays

This paper is concerned with the global asymptotic stability of a class of Cohen-Grossberg neural networks with both multiple time-varying delays and continuously distributed delays. Two classes of amplification functions are considered, and some ...
Read More
Stability of hybrid systems with time delay
Read More
A less conservative robust stability criteria for uncertain neutral systems with mixed delays

This paper is concerned with the problem of the delay-dependent robust stability of neutral systems with mixed delays and time-varying structured uncertainties. By considering the cross-terms with additional design parameters, a complete form of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
April 2023
3914 pages
ISBN:9781450394222
DOI:10.1145/3544549
Editors:
Albrecht Schmidt
LMU Munich, Germany
,
Kaisa Väänänen
Tampere University, Finland
,
Tesh Goyal
Google Research, USA
,
Per Ola Kristensson
University of Cambridge, UK
,
Anicia Peters
University of Namibia, Namibia
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 April 2023
Check for updates
Author Tags
Flickering Metric
Live Captions
Real-Time Transcription
Speech-to-text
Text Stability;
Visual Instability
Qualifiers
- Work in Progress
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6,164of23,696submissions,26%
Upcoming Conference
CHI '24

Sponsor:

sigchi

CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

Honolulu , HI , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 149
  Total Downloads
- Downloads (Last 12 months)149
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Modeling and Improving Text Stability in Live Captions

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

ABSTRACT

Footnotes

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Global asymptotic stability and robust stability of a class of Cohen-Grossberg neural networks with mixed delays

Stability of hybrid systems with time delay

A less conservative robust stability criteria for uncertain neutral systems with mixed delays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media