skip to main content
10.1145/3452918.3458792acmconferencesArticle/Chapter ViewAbstractPublication PagesimxConference Proceedingsconference-collections
research-article

Evaluating AI assisted subtitling

Published: 23 June 2021 Publication History

Abstract

Recent advances in artificial intelligence (AI) have led to an increased focus on automating media production. One relevant application area for AI is using speech recognition to create subtitles and closed captions for videos. The AI methods based on machine learning are still not sufficiently reliable in terms of producing perfect or acceptable subtitles. To compensate for this unreliability, AI can be used to build tools that support, rather than replace, human efforts and to create semi-automated workflows. In this paper, we present a prototype for including automated speech recognition for subtitling in an existing production-grade video editing tool. We devised an experiment with 25 participants and tested the efficiency and effectiveness of this tool compared to a fully manual process. The results show that there is a significant increase in both effectiveness and efficiency for novices in subtitling. Furthermore, the participants found the augmented process to be more demanding. We identify some usability issues and design choices that pertain to making augmented subtitling easier.

References

[1]
Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, and Paul N. Bennett. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, Glasgow, Scotland Uk, 1–13. https://doi.org/10.1145/3290605.3300233
[2]
Marie-Josée Bisson, Walter J. B. Van Heuven, Kathy Conklin, and Richard J. Tunney. 2014. Processing of native and foreign language subtitles in films: An eye tracking study. Applied Psycholinguistics 35, 02 (March 2014), 399–418. https://doi.org/10.1017/S0142716412000434
[3]
Julie Brousseau, Jean-Francois Beaumont, Gilles Boulianne, Patrick Cardinal, Claude Chapdelaine, Michel Comeau, Frederic Osterrath, and Pierre Ouellet. 2003. Automated Closed-Captioning of Live TV Broadcast News in French. (2003), 5.
[4]
Andy Brown, Rhia Jones, Mike Crabb, James Sandford, Matthew Brooks, Mike Armstrong, and Caroline Jay. 2015. Dynamic Subtitles: The User Experience. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video - TVX ’15. ACM Press, Brussels, Belgium, 103–112. https://doi.org/10.1145/2745197.2745204
[5]
Andy Brown, Jayson Turner, Jake Patterson, Anastasia Schmitz, Mike Armstrong, and Maxine Glancy. 2017. Subtitles in 360-degree Video. In Adjunct Publication of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video - TVX ’17 Adjunct. ACM Press, Hilversum, The Netherlands, 3–8. https://doi.org/10.1145/3084289.3089915
[6]
Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2017. State-of-the-art Speech Recognition With Sequence-to-Sequence Models. arXiv:1712.01769 [cs, eess, stat] (Dec. 2017). http://arxiv.org/abs/1712.01769 arXiv:1712.01769.
[7]
Christopher Cieri, David Miller, and Kevin Walker. [n.d.]. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text. ([n. d.]), 3.
[8]
Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In Proceedings of the 2018 Conference on Human Information Interaction&Retrieval - IUI ’18. ACM Press, Tokyo, Japan, 329–340. https://doi.org/10.1145/3172944.3172983
[9]
British Broadcasting Corporation.2019. Subtitle Guidelines Version 1.1.8. https://bbc.github.io/subtitle-guidelines/
[10]
British Broadcasting Corporation.2020. How do I create subtitles?http://www.bbc.co.uk/guides/zmgnng8
[11]
Li Deng and Xiao Li. 2013. Machine Learning Paradigms for Speech Recognition: An Overview. IEEE Transactions on Audio, Speech, and Language Processing 21, 5 (May 2013), 1060–1089. https://doi.org/10.1109/TASL.2013.2244083
[12]
Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working with Machine Learning as a Design Material. ACM Press, 278–288. https://doi.org/10.1145/3025453.3025739
[13]
Géry d’Ydewalle and Wim De Bruycker. 2007. Eye Movements of Children and Adults While Reading Television Subtitles. European Psychologist 12, 3 (Jan. 2007), 196–205. https://doi.org/10.1027/1016-9040.12.3.196
[14]
Pozo et al.2014. SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling. (May 2014).
[15]
Jerry Alan Fails and Dan R Olsen. [n.d.]. Interactive Machine Learning. ([n. d.]), 7.
[16]
Alex Graves. 2012. Sequence Transduction with Recurrent Neural Networks. arXiv:1211.3711 [cs, stat] (Nov. 2012). http://arxiv.org/abs/1211.3711 arXiv:1211.3711.
[17]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human factors in computing systems the CHI is the limit - CHI ’99. ACM Press, Pittsburgh, Pennsylvania, United States, 159–166. https://doi.org/10.1145/302979.303030
[18]
Ting-Yao Hsu, Yen-Chia Hsu, and Ting-Hao ’Kenneth’ Huang. 2019. On How Users Edit Computer-Generated Visual Stories. arXiv:1902.08327 [cs] (Feb. 2019). http://arxiv.org/abs/1902.08327 arXiv:1902.08327.
[19]
Chih-wei Huang. [n.d.]. Automatic Closed Caption Alignment Based on Speech Recognition Transcripts. ([n. d.]), 14.
[20]
X. D. Huang, H. W. Hon, and K. F. Lee. 1989. Large-vocabulary speaker-independent continuous speech recognition with semi-continuous hidden Markov models. In Proceedings of the workshop on Speech and Natural Language - HLT ’89. Association for Computational Linguistics, Cape Cod, Massachusetts, 276. https://doi.org/10.3115/1075434.1075480
[21]
René F. Kizilcec. 2016. How Much Information?: Effects of Transparency on Trust in an Algorithmic Interface. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI ’16. ACM Press, Santa Clara, California, USA, 2390–2395. https://doi.org/10.1145/2858036.2858402
[22]
Julius Kunze, Louis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier, and Sebastian Stober. 2017. Transfer Learning for Speech Recognition on a Budget. arXiv:1706.00290 [cs, stat] (June 2017). http://arxiv.org/abs/1706.00290 arXiv:1706.00290.
[23]
Veton Këpuska. 2017. Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx). International Journal of Engineering Research and Applications 07, 03 (March 2017), 20–24. https://doi.org/10.9790/9622-0703022024
[24]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 1–14. https://doi.org/10.1145/3072959.3073653
[25]
Kevin Lenzo. 2014. The CMU pronouncing dictionary. (2014). http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[26]
Iain A McCowan, Darren Moore, John Dines, Daniel Gatica-Perez, Mike Flynn, Pierre Wellner, and Hervé Bourlard. 2004. On the use of information retrieval measures for speech recognition evaluation. Technical Report. IDIAP.
[27]
Andrew C Morris, Viktoria Maier, and Phil Green. [n.d.]. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. ([n. d.]), 5.
[28]
Obach. [n.d.]. Automatic Speech Recognition for Live TV Subtitling for Hearing-Impaired People. http://ebooks.iospress.nl/publication/641
[29]
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019 (Sept. 2019), 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680 arXiv:1904.08779.
[30]
Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, and Alexander Waibel. 2019. Very Deep Self-Attention Networks for End-to-End Speech Recognition. arXiv:1904.13377 [cs, eess] (May 2019). http://arxiv.org/abs/1904.13377 arXiv:1904.13377.
[31]
Dhevi J. Rajendran, Andrew T. Duchowski, Pilar Orero, Juan Martínez, and Pablo Romero-Fresco. 2013. Effects of text chunking on subtitling: A quantitative and qualitative examination. Perspectives 21, 1 (March 2013), 5–21. https://doi.org/10.1080/0907676X.2012.722651
[32]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs, stat] (Feb. 2016). http://arxiv.org/abs/1602.04938 arXiv:1602.04938.
[33]
Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. 2015. Best of both worlds: Human-machine collaboration for object annotation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, 2121–2131. https://doi.org/10.1109/CVPR.2015.7298824
[34]
Burr Settles. [n.d.]. Active Learning Literature Survey. ([n. d.]), 47.
[35]
Aaron Springer, Victoria Hollis, and Steve Whittaker. [n.d.]. Dice in the Black Box: User Experiences with an Inscrutable Algorithm. ([n. d.]), 4.
[36]
Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction 8, 1 (March 2001), 60–98. https://doi.org/10.1145/371127.371166
[37]
Vincent Vandeghinste and Yi Pan. [n.d.]. Sentence Compression for Automated Subtitling: A Hybrid Approach. ([n. d.]), 7.
[38]
Keith Vertanen and Per Ola Kristensson. 2008. On the benefits of confidence visualization in speech recognition. In Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI ’08. ACM Press, Florence, Italy, 1497. https://doi.org/10.1145/1357054.1357288
[39]
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. arXiv:1804.00015 [cs] (March 2018). http://arxiv.org/abs/1804.00015 arXiv:1804.00015.
[40]
Andrew Gordon Wilson, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. The Human Kernel. arXiv:1510.07389 [cs, stat] (Oct. 2015). http://arxiv.org/abs/1510.07389 arXiv:1510.07389.
[41]
Qian Yang. [n.d.]. The Role of Design in Creating Machine-Learning-Enhanced User Experience. ([n. d.]), 6.
[42]
Aitor Álvarez, Carlos Mendes, Matteo Raffaelli, Tiago Luís, Sérgio Paulo, Nicola Piccinini, Haritz Arzelus, João Neto, Carlo Aliprandi, and Arantza del Pozo. 2016. Automating live and batch subtitling of multimedia contents for several European languages. Multimedia Tools and Applications 75, 18 (Sept. 2016), 10823–10853. https://doi.org/10.1007/s11042-015-2794-z

Cited By

View all
  • (2025)From the Rise of AI to Future Shock – State of the Art2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883625(000107-000114)Online publication date: 23-Jan-2025
  • (2024)Barriers to Industry Adoption of AI Video Generation Tools: A Study Based on the Perspectives of Video Production Professionals in ChinaApplied Sciences10.3390/app1413577014:13(5770)Online publication date: 1-Jul-2024
  • (2024)Fandom meets artificial intelligence: Rethinking participatory culture as human–community–machine interactionsEuropean Journal of Cultural Studies10.1177/1367549424123614627:4(778-787)Online publication date: 10-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IMX '21: Proceedings of the 2021 ACM International Conference on Interactive Media Experiences
June 2021
331 pages
ISBN:9781450383899
DOI:10.1145/3452918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Assisted Subtitling
  2. Augmented Intelligence
  3. Machine Learning
  4. Subtitling
  5. Subtitling Tool

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IMX '21

Acceptance Rates

Overall Acceptance Rate 69 of 245 submissions, 28%

Upcoming Conference

IMX '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)22
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)From the Rise of AI to Future Shock – State of the Art2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883625(000107-000114)Online publication date: 23-Jan-2025
  • (2024)Barriers to Industry Adoption of AI Video Generation Tools: A Study Based on the Perspectives of Video Production Professionals in ChinaApplied Sciences10.3390/app1413577014:13(5770)Online publication date: 1-Jul-2024
  • (2024)Fandom meets artificial intelligence: Rethinking participatory culture as human–community–machine interactionsEuropean Journal of Cultural Studies10.1177/1367549424123614627:4(778-787)Online publication date: 10-Mar-2024
  • (2024)Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing ViewersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688495(1-6)Online publication date: 27-Oct-2024
  • (2024)Human Interest or Conflict? Leveraging LLMs for Automated Framing Analysis in TV ShowsProceedings of the 2024 ACM International Conference on Interactive Media Experiences10.1145/3639701.3656308(157-167)Online publication date: 7-Jun-2024
  • (2024)Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00062(271-275)Online publication date: 13-Oct-2024
  • (2024)Customization of Closed Captions via Large Language ModelsComputers Helping People with Special Needs10.1007/978-3-031-62849-8_7(50-58)Online publication date: 5-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media