ABSTRACT
In recent years, live captions have gained significant popularity through its availability in remote video conferences, mobile applications, and the web. Unlike preprocessed subtitles, live captions require real-time responsiveness by showing interim speech-to-text results. As the prediction confidence changes, the captions may update, leading to visual instability that interferes with the user’s viewing experience. In this paper, we characterize the stability of live captions by proposing a vision-based flickering metric using luminance contrast and Discrete Fourier Transform. Additionally, we assess the effect of unstable captions on the viewer through task load index surveys. Our analysis reveals significant correlations between the viewer’s experience and our proposed quantitative metric. To enhance the stability of live captions without compromising responsiveness, we propose the use of tokenized alignment, word updates with semantic similarity, and smooth animation. Results from a crowdsourced study (N=123), comparing four strategies, indicate that our stabilization algorithms lead to a significant reduction in viewer distraction and fatigue, while increasing viewers’ reading comfort.
Footnotes
1 WordNetLemmatizer: https://www.nltk.org/_modules/nltk/stem/wordnet.html
Footnote
Supplemental Material
- Katrin Angerbauer, Heike Adel, and Ngoc Thang Vu. 2019. Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience. In Proc. Interspeech 2019. Interspeech, New York, NY, USA, 594–598. https://doi.org/10.21437/Interspeech.2019-1750Google ScholarCross Ref
- Jacob Aron. 2011. How innovative is Apple’s new voice assistant, Siri?Google Scholar
- Larwan Berke, Christopher Caulfield, and Matt Huenerfauth. 2017. Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 155–164. https://doi.org/10.1145/3132525.3132541Google ScholarDigital Library
- Janine Butler, Brian Trager, and Byron Behm. 2019. Exploration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 32–42. https://doi.org/10.1145/3308561.3353772Google ScholarDigital Library
- Teresa Hirzle, Maurice Cordts, Enrico Rukzio, and Andreas Bulling. 2020. A survey of digital eye strain in gaze-based interactive systems. In ACM Symposium on Eye Tracking Research and Applications. Association for Computing Machinery, New York, NY, USA, 1–12.Google ScholarDigital Library
- Ippei Hisaki, Hiroaki Nanjo, and Takehiko Yoshimi. 2010. Evaluation of Speech Balloon Captions for Auditory Information Support in Small Meetings. In Proceedings of the 20th International Congress on Acoustics. Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/3491102.3501920Google ScholarDigital Library
- Richang Hong, Meng Wang, Mengdi Xu, Shuicheng Yan, and Tat-Seng Chua. 2010. Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment. In Proceedings of the 18th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 421–430. https://doi.org/10.1145/1873951.1874013Google ScholarDigital Library
- Dhruv Jain, Rachel Franz, Leah Findlater, Jackson Cannon, Raja Kushalnagar, and Jon Froehlich. 2018. Towards Accessible Conversations in a Mobile Context for People Who Are Deaf and Hard of Hearing. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (Galway, Ireland) (ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 81–92. https://doi.org/10.1145/3234695.3236362Google ScholarDigital Library
- Sushant Kafle, Becca Dingman, and Matt Huenerfauth. 2021. Deaf and Hard-of-Hearing Users Evaluating Designs for Highlighting Key Words in Educational Lecture Videos. ACM Transactions on Accessible Computing (TACCESS) 14, 4 (2021), 1–24. https://doi.org/10.1145/3470651Google ScholarDigital Library
- Sushant Kafle, Peter Yeung, and Matt Huenerfauth. 2019. Evaluating the Benefit of Highlighting Key Words in Captions for People Who Are Deaf or Hard of Hearing. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, PA, USA) (ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 43–55. https://doi.org/10.1145/3308561.3353781Google ScholarDigital Library
- Saba Kawas, George Karalis, Tzu Wen, and Richard E Ladner. 2016. Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/2982142.2982164Google ScholarDigital Library
- Jan-Louis Kruger, Esté Hefer-Jordaan, and Gordon Matthew. 2013. Measuring the Impact of Subtitles on Cognitive Load: Eye Tracking and Dynamic Audiovisual Texts, In ACM. ACM International Conference Proceeding Series 1, 62–66. https://doi.org/10.1145/2509315.2509331Google ScholarDigital Library
- Jan-Louis Kruger and Faans Steyn. 2013. Subtitles and Eye Tracking: Reading and Performance. Reading Research Quarterly 49 (10 2013). https://doi.org/10.1002/rrq.59Google ScholarCross Ref
- Kuno Kurzhals, Emine Cetinkaya, Yongtao Hu, Wenping Wang, and Daniel Weiskopf. 2017. Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 6559–6568. https://doi.org/10.1145/3025453.3025772Google ScholarDigital Library
- Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2012. A Readability Evaluation of Real-Time Crowd Captions in the Classroom. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/2384916.2384930Google ScholarDigital Library
- Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2014. Accessibility Evaluation of Classroom Captions. ACM Transactions on Accessible Computing (TACCESS) 5, 3 (2014), 1–24. https://doi.org/10.1145/2982142.2982164Google ScholarDigital Library
- Seongjae Lee, Sunmee Kang, Hanseok Ko, Jongseong Yoon, and Minseok Keum. 2013. Dialogue Enabling Speech-to-Text User Assistive Agent With Auditory Perceptual Beamforming for Hearing-Impaired. In 2013 IEEE International Conference on Consumer Electronics (ICCE). IEEE, IEEE, New York, NY, USA, 360–361. https://doi.org/10.1109/ICCE.2013.6486929Google ScholarCross Ref
- Kirill Levin, Irina Ponomareva, Anna Bulusheva, German Chernykh, Ivan Medennikov, N. Merkin, and Natalia Tomashenko. 2014. Automated Closed Captioning for Russian Live Broadcasting. In Interspeech. Interspeech, New York, NY, USA. https://doi.org/10.21437/Interspeech.2014-352Google ScholarCross Ref
- Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, and Yonghong Yan. 2022. Improving Streaming End-to-End ASR on Transformer-Based Causal Models With Encoder States Revision Strategies. https://doi.org/10.48550/ARXIV.2207.02495Google ScholarCross Ref
- Xingyu Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Xiang Chen, Alex Olwal, and Ruofei Du. 2023. Visual Captions: Augmenting Verbal Communication With On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI). Association for Computing Machinery, New York, NY, USA, 14. https://doi.org/10.1145/3544548.3581566Google ScholarDigital Library
- Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang "Anthony" Chen, and Amy Pavel. 2022. CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 43, 14 pages. https://doi.org/10.1145/3526113.3545703Google ScholarDigital Library
- Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2018. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. In Annual Meeting of the Association for Computational Linguistics. Association for Computing Machinery, New York, NY, USA, 10.Google Scholar
- Tara Matthews, Scott Carter, Carol Pai, Janette Fong, and Jennifer Mankoff. 2006. Scribe4Me: Evaluating a Mobile Sound Transcription Tool for the Deaf. In International Conference on Ubiquitous Computing. Springer, Springer, New York, NY, USA, 159–176. https://doi.org/10.1007/1185356_10Google ScholarDigital Library
- Mohammad Reza Mirzaei, Seyed Ghorshi, and Mohammad Mortazavi. 2012. Using Augmented Reality and Automatic Speech Recognition Techniques to Help Deaf and Hard of Hearing People. In Proceedings of the 2012 Virtual Reality International Conference. Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/2331714.2331720Google ScholarDigital Library
- Mark Neerincx, Anita Cremers, Judith Kessens, David Van Leeuwen, and Khiet Truong. 2009. Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice. Universal Access in the Information Society 8 (06 2009), 109–122. https://doi.org/10.1007/s10209-008-0136-xGoogle ScholarDigital Library
- Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, and Alex Waibel. 2020. Low Latency ASR for Simultaneous Speech Translation. https://doi.org/10.48550/ARXIV.2003.09891Google ScholarCross Ref
- Alex Olwal, Kevin Balke, Dmitrii Votintcev, Thad Starner, Paula Conn, Bonnie Chinh, and Benoit Corda. 2020. Wearable Subtitles: Augmenting Spoken Communication With Lightweight Eyewear for All-Day Captioning. In UIST ’20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, New York, NY, USA, 10. https://doi.org/10.1145/3379337.3415817Google ScholarDigital Library
- Agnès Piquard-Kipffer, Odile Mella, Jérémy Miranda, Denis Jouvet, and Luiza Orosanu. 2015. Qualitative Investigation of the Display of Speech Recognition Results for Communication With Deaf People. In Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies. Association for Computing Machinery, New York, NY, USA, 36–41.Google ScholarCross Ref
- Soraia Silva Prietch, Napoliana Silva de Souza, and Lucia Villela Leite Filgueiras. 2014. A Speech-to-Text System’s Acceptance Evaluation: Would Deaf Individuals Adopt This Technology in Their Lives?. In International Conference on Universal Access in Human-Computer Interaction. Springer, Springer, New York, NY, USA, 440–449. https://doi.org/10.1145/3290607.3312921Google ScholarDigital Library
- Nils Reimers and Iryna Gurevych. 2019. Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. ArXiv Preprint ArXiv:1908.10084 1 (2019), 10. https://arxiv.org/pdf/1908.10084Google Scholar
- Pablo Romero-Fresco. 2015. Accuracy Rate in Live Subtitling: The NER Model. Palgrave Macmillan UK, London, 28–50. https://doi.org/10.1057/978113755289_3Google ScholarCross Ref
- Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, and Françoise Beaufays. 2020. Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer. CoRR abs/2006.01416 (2020), 10. arXiv:2006.01416https://arxiv.org/abs/2006.01416Google Scholar
- Brent N Shiver and Rosalee J Wolfe. 2015. Evaluating Alternatives for Better Deaf Accessibility to Selected Web-Based Multimedia. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. Association for Computing Machinery, New York, NY, USA, 231–238. https://doi.org/10.1145/2700648.2809857Google ScholarDigital Library
- ACM SIGCHI. 2. CHI 2022 Town Hall. ACM SIGCHI. https://www.youtube.com/watch?v=dDPPNyUDmcoGoogle Scholar
- Michael S Stinson, Lisa B Elliot, and Ronald R Kelly. 2017. Deaf and Hard-of-Hearing High School and College Students’ Perceptions of Speech-to-Text and Interpreting/note Taking Services and Motivation. Journal of Developmental and Physical Disabilities 29, 1 (2017), 131–152. https://doi.org/10.1177/0022466907313453Google ScholarCross Ref
- Michael S Stinson, Lisa B Elliot, Ronald R Kelly, and Yufang Liu. 2009. Deaf and Hard-of-Hearing Students’ Memory of Lectures With Speech-to-Text and Interpreting/note Taking Services. The Journal of Special Education 43, 1 (2009), 52–64.Google ScholarCross Ref
- Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, and Björn Hoffmeister. 2019. Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings.. In Interspeech. Interspeech, New York, NY, USA, 2175–2179.Google Scholar
- Stefan Winkler, Elisa Drelie Gelasca, and Touradj Ebrahimi. 2003. Toward Perceptual Metrics for Video Watermark Evaluation. In Applications of Digital Image Processing XXVI, Andrew G. Tescher (Ed.). Vol. 5203. International Society for Optics and Photonics, SPIE, New York, NY, USA, 371 – 378. https://doi.org/10.1117/12.512550Google ScholarCross Ref
- Yuekun Yao and Barry Haddow. 2020. Dynamic Masking for Improved Stability in Online Spoken Language Translation. In Conference of the Association for Machine Translation in the Americas. Association for Computing Machinery, New York, NY, USA, 10.Google Scholar
Index Terms
- Modeling and Improving Text Stability in Live Captions
Recommendations
Global asymptotic stability and robust stability of a class of Cohen-Grossberg neural networks with mixed delays
This paper is concerned with the global asymptotic stability of a class of Cohen-Grossberg neural networks with both multiple time-varying delays and continuously distributed delays. Two classes of amplification functions are considered, and some ...
A less conservative robust stability criteria for uncertain neutral systems with mixed delays
This paper is concerned with the problem of the delay-dependent robust stability of neutral systems with mixed delays and time-varying structured uncertainties. By considering the cross-terms with additional design parameters, a complete form of ...
Comments