skip to main content
10.1145/3544549.3585724acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Work in Progress

Accuracy of AI-generated Captions With Collaborative Manual Corrections in Real-Time

Published: 19 April 2023 Publication History

Abstract

Automatic Speech Recognition (ASR) is a cost-efficient and scalable tool to automate real-time captioning. Even though its overall quality has improved rapidly, generated transcripts can be inaccurate. While manual correction helps to increase transcription accuracy, this causes new real-time challenges, especially for live-streaming. Crowd-sourcing can make the high workload more manageable by distributing the work across multiple individuals. In this paper, we developed a prototype that enables humans to collaboratively correct AI-generated captions in real-time. We conducted an experiment with 40 participants to measure the accuracy of the created and manually corrected captions. The results show that manual corrections improved the overall text accuracy according to multiple metrics as well as overall qualitative analysis.

Supplementary Material

MP4 File (3544549.3585724-video-preview.mp4)
Video Preview
MP4 File (3544549.3585724-talk-video.mp4)
Pre-recorded Video Presentation

References

[1]
Chuck Adams, Alastair Campbell, Michael Cooper, and Andrew Kirkpatrick. 2021. Web Content Accessibility Guidelines (WCAG) 2.2. World Wide Web Consortium (W3C). Retrieved January 12, 2023 from https://www.w3.org/TR/WCAG22/
[2]
Alëna Aksënova, Daan van Esch, James Flynn, and Pavel Golik. 2021. How Might We Create Better Benchmarks for Speech Recognition?. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. Association for Computational Linguistics, Online, 22–34. https://doi.org/10.18653/v1/2021.bppf-1.4
[3]
Amazon. 2023. AWS - Amazon Transcribe Features. Retrieved January 18, 2023 from https://aws.amazon.com/transcribe/features/
[4]
Sheryl Ballenger. 2022. Access for Deaf and Hard of Hearing Individuals in Informational and Educational Remote Sessions. Assistive Technology Outcomes & Benefits 16, 2 (2022), 45–55.
[5]
Mohamed Benzeghiba, Renato De Mori, Olivier Deroo, Stephane Dupont, Teodora Erbes, Denis Jouvet, Luciano Fissore, Pietro Laface, Alfred Mertins, Christophe Ris, Richard Rose, Vivek Tyagi, and Christian Wellekens. 2007. Automatic Speech Recognition and Speech Variability: A Review. Speech Commun. 49, 10–11 (oct 2007), 763–786. https://doi.org/10.1016/j.specom.2007.02.006
[6]
Bhavya Bhavya, Si Chen, Zhilin Zhang, Wenting Li, Chengxiang Zhai, Lawrence Angrave, and Yun Huang. 2022. Exploring collaborative caption editing to augment video-based learning. Educational technology research and development 70, 5 (01 Oct 2022), 1755–1779. https://doi.org/10.1007/s11423-022-10137-5
[7]
Stephen Bird and John Williams. 2002. The effect of bimodal input on implicit and explicit memory: An investigation into the benefits of within-language subtitling. Applied Psycholinguistics 23, 4 (2002), 509–533. https://doi.org/10.1017/S0142716402004022
[8]
Xiaoyin Che, Sheng Luo, Haojin Yang, and Christoph Meinel. 2017. Automatic Lecture Subtitle Generation and How It Helps. In 2017 IEEE 17th International Conference on Advanced Learning Technologies (ICALT). IEEE Computer Society, Timisoara, Romania, 34–38. https://doi.org/10.1109/ICALT.2017.11
[9]
Federal Communications Commission. 2021. Closed Captioning on Television. Retrieved January 18, 2023 from https://www.fcc.gov/consumers/guides/closed-captioning-television
[10]
Rucha Deshpande, Tayfun Tuna, Jaspal Subhlok, and Lecia Barker. 2014. A crowdsourcing caption editor for educational videos. In 2014 IEEE Frontiers in Education Conference (FIE) Proceedings. IEEE Computer Society, Madrid, Spain, 1–8. https://doi.org/10.1109/FIE.2014.7044040
[11]
Benoit Favre, Kyla Cheung, Siavash Kazemian, Adam Lee, Yang Liu, Cosmin Munteanu, Ani Nenkova, Dennis Ochei, Gerald Penn, Stephen Tratz, Clare Voss, and Frauke Zeller. 2013. Automatic Human Utility Evaluation of ASR Systems: Does WER Really Predict Performance?. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013. ISCA, Lyon, France, 3463–3467.
[12]
Siyuan Feng, Olya Kudina, Bence Mark Halpern, and Odette Scharenborg. 2021. Quantifying bias in automatic speech recognition. https://doi.org/10.48550/ARXIV.2103.15122
[13]
Jean-Luc Gauvain, Lori Lamel, and Gilles Adda. 2001. Audio Partitioning and Transcription for Broadcast Data Indexation. Multimedia Tools and Applications 14, 2 (01 Jun 2001), 187–200. https://doi.org/10.1023/A:1011303401042
[14]
Google. 2023. Google Cloud - Speech-to-Text. Retrieved January 18, 2023 from https://cloud.google.com/speech-to-text/
[15]
Sushant Kafle and Matt Huenerfauth. 2017. Evaluating the Usability of Automatically Generated Captions for People Who Are Deaf or Hard of Hearing. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 165–174. https://doi.org/10.1145/3132525.3132542
[16]
Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, and Tomohiro Nakatani. 2020. Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Barcelona, Spain, 7009–7013. https://doi.org/10.1109/ICASSP40776.2020.9053266
[17]
Raja Kushalnagar, Walter Lasecki, and Jeffrey Bigham. 2012. A Readability Evaluation of Real-Time Crowd Captions in the Classroom. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (Boulder, Colorado, USA) (ASSETS ’12). Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/2384916.2384930
[18]
Walter Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, and Jeffrey Bigham. 2012. Real-Time Captioning by Groups of Non-Experts. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). Association for Computing Machinery, New York, NY, USA, 23–34. https://doi.org/10.1145/2380116.2380122
[19]
Shawn Lawton Henry, Geoff Freed, and Judy Brewer. 2022. Making Audio and Video Media Accessible - Captions/Subtitles. Web Accessibility Initiative. Retrieved January 12, 2023 from https://www.w3.org/WAI/media/av/captions/
[20]
Jinyu Li. 2022. Recent advances in end-to-end automatic speech recognition.
[21]
Microsoft. 2023. Azure - Speech to Text. Retrieved January 18, 2023 from https://azure.microsoft.com/en-us/products/cognitive-services/speech-to-text/
[22]
Microsoft. 2023. Microsoft - Test accuracy of a Custom Speech model. Retrieved January 18, 2023 from https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-evaluate-data
[23]
Taniya Mishra, Andrej Ljolje, and Mazin Gilbert. 2011. Predicting Human Perceived Accuracy of ASR Systems. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011. ISCA, Florence, Italy, 1945–1948. https://doi.org/10.21437/Interspeech.2011-364
[24]
Andrew Morris, Viktoria Maier, and Phil Green. 2004. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In Eighth International Conference on Spoken Language Processing. ISCA, Jeju Island, Korea, 2765–2768. https://doi.org/10.21437/Interspeech.2004-668
[25]
Cosmin Munteanu, Ron Baecker, and Gerald Penn. 2008. Collaborative Editing for Improved Usefulness and Usability of Transcript-Enhanced Webcasts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI ’08). Association for Computing Machinery, New York, NY, USA, 373–382. https://doi.org/10.1145/1357054.1357117
[26]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, South Brisbane, QLD, Australia, 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
[27]
Becky Parton. 2016. Video Captions for Online Courses: Do YouTube’s Auto-generated Captions Meet Deaf Students’ Needs?Journal of Open, Flexible, and Distance Learning 20, 1 (August 2016), 8–18. https://www.learntechlib.org/p/174235
[28]
Patricia Piskorek, Nadine Sienel, Korbinian Kuhn, Verena Kersken, and Gottfried Zimmermann. 2022. Evaluating collaborative editing of ai-generated live subtitles by non-professionals in German university lectures. In Assistive Technology, Accessibility and (e)Inclusion: 18th International Conference, ICCHP-AAATE 2022, Lecco, Italy, July 11–15, 2022, Open Access Compendium, Part I. ICCHP, Lecco, Italy, 165–175.
[29]
Pablo Romero-Fresco. 2009. More haste less speed: Edited versus verbatim respoken subtitles. Vigo International Journal of Applied Linguistics 6 (01 2009), 109–133.
[30]
Pablo Romero-Fresco and Juan Martínez Pérez. 2015. Accuracy Rate in Live Subtitling: The NER Model. Palgrave Macmillan UK, London, UK, 28–50. https://doi.org/10.1057/9781137552891_3
[31]
Than Htut Soe, Frode Guribye, and Marija Slavkovik. 2021. Evaluating AI Assisted Subtitling. In ACM International Conference on Interactive Media Experiences (Virtual Event, USA) (IMX ’21). Association for Computing Machinery, New York, NY, USA, 96–107. https://doi.org/10.1145/3452918.3458792
[32]
Nik Vaessen. 2022. JiWER: Similarity measures for automatic speech recognition evaluation. Retrieved November 11, 2022 from https://pypi.org/project/jiwer/
[33]
Mike Wald. 2006. Captioning for Deaf and Hard of Hearing People by Editing Automatic Speech Recognition in Real Time. In Proceedings of the 10th International Conference on Computers Helping People with Special Needs (Linz, Austria) (ICCHP’06). Springer-Verlag, Berlin, Heidelberg, 683–690. https://doi.org/10.1007/11788713_100
[34]
Mike Wald. 2006. Creating Accessible Educational Multimedia through Editing Automatic Speech Recognition Captioning in Real Time. Interactive Technology and Smart Education 3 (05 2006), 131–141. https://doi.org/10.1108/17415650680000058
[35]
Ye-Yi Wang, A. Acero, and C. Chelba. 2003. Is word error rate a good indicator for spoken language understanding accuracy. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, St Thomas, VI, USA, 577–582. https://doi.org/10.1109/ASRU.2003.1318504
[36]
Tian Wells, Dylan Christoffels, Christian Vogler, and Raja Kushalnagar. 2022. Comparing the Accuracy of ACE and WER Caption Metrics When Applied to Live Television Captioning. In Computers Helping People with Special Needs: 18th International Conference, ICCHP-AAATE 2022, Lecco, Italy, July 11–15, 2022, Proceedings, Part I (Lecco, Italy). Springer-Verlag, Berlin, Heidelberg, 522–528. https://doi.org/10.1007/978-3-031-08648-9_61
[37]
Paula Winke, Susan Gass, and Tetyana Sydorenko. 2010. The Effects of Captioning Videos Used for Foreign Language Listening Activities. Language Learning and Technology 14 (02 2010), 65–86.
[38]
Papers with Code. 2023. Speech Recognition on LibriSpeech test-other. Retrieved January 18, 2023 from https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-other
[39]
Joong-O Yoon and Minjeong Kim. 2011. The Effects of Captions on Deaf Students’ Content Comprehension, Cognitive Load, and Motivation in Online Learning. American annals of the deaf 156 (06 2011), 283–9. https://doi.org/10.1353/aad.2011.0026

Cited By

View all
  • (2024)Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing ViewersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688495(1-6)Online publication date: 27-Oct-2024
  • (2024)Conversational AI for Students with Hearing Disabilities: Approach to the Text Quality Evaluation2024 International Conference Automatics and Informatics (ICAI)10.1109/ICAI63388.2024.10851526(130-135)Online publication date: 10-Oct-2024
  • (2023)The influence of sociodemographic factors on students' attitudes toward AI-generated video content creationSmart Learning Environments10.1186/s40561-023-00276-410:1Online publication date: 6-Nov-2023

Index Terms

  1. Accuracy of AI-generated Captions With Collaborative Manual Corrections in Real-Time

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
        April 2023
        3914 pages
        ISBN:9781450394222
        DOI:10.1145/3544549
        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 April 2023

        Check for updates

        Author Tags

        1. automatic speech recognition
        2. captioning
        3. crowd-sourcing
        4. real-time
        5. subtitles;

        Qualifiers

        • Work in progress
        • Research
        • Refereed limited

        Conference

        CHI '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

        Upcoming Conference

        CHI 2025
        ACM CHI Conference on Human Factors in Computing Systems
        April 26 - May 1, 2025
        Yokohama , Japan

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)237
        • Downloads (Last 6 weeks)59
        Reflects downloads up to 07 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing ViewersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688495(1-6)Online publication date: 27-Oct-2024
        • (2024)Conversational AI for Students with Hearing Disabilities: Approach to the Text Quality Evaluation2024 International Conference Automatics and Informatics (ICAI)10.1109/ICAI63388.2024.10851526(130-135)Online publication date: 10-Oct-2024
        • (2023)The influence of sociodemographic factors on students' attitudes toward AI-generated video content creationSmart Learning Environments10.1186/s40561-023-00276-410:1Online publication date: 6-Nov-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media