research-article

Evaluating AI assisted subtitling

Authors:

Marija SlavkovikAuthors Info & Claims

IMX '21: Proceedings of the 2021 ACM International Conference on Interactive Media Experiences

Pages 96 - 107

https://doi.org/10.1145/3452918.3458792

Published: 23 June 2021 Publication History

Abstract

Recent advances in artificial intelligence (AI) have led to an increased focus on automating media production. One relevant application area for AI is using speech recognition to create subtitles and closed captions for videos. The AI methods based on machine learning are still not sufficiently reliable in terms of producing perfect or acceptable subtitles. To compensate for this unreliability, AI can be used to build tools that support, rather than replace, human efforts and to create semi-automated workflows. In this paper, we present a prototype for including automated speech recognition for subtitling in an existing production-grade video editing tool. We devised an experiment with 25 participants and tested the efficiency and effectiveness of this tool compared to a fully manual process. The results show that there is a significant increase in both effectiveness and efficiency for novices in subtitling. Furthermore, the participants found the augmented process to be more demanding. We identify some usability issues and design choices that pertain to making augmented subtitling easier.

References

[1]

Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, and Paul N. Bennett. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, Glasgow, Scotland Uk, 1–13. https://doi.org/10.1145/3290605.3300233

Digital Library

[2]

Marie-Josée Bisson, Walter J. B. Van Heuven, Kathy Conklin, and Richard J. Tunney. 2014. Processing of native and foreign language subtitles in films: An eye tracking study. Applied Psycholinguistics 35, 02 (March 2014), 399–418. https://doi.org/10.1017/S0142716412000434

[3]

Julie Brousseau, Jean-Francois Beaumont, Gilles Boulianne, Patrick Cardinal, Claude Chapdelaine, Michel Comeau, Frederic Osterrath, and Pierre Ouellet. 2003. Automated Closed-Captioning of Live TV Broadcast News in French. (2003), 5.

[4]

Andy Brown, Rhia Jones, Mike Crabb, James Sandford, Matthew Brooks, Mike Armstrong, and Caroline Jay. 2015. Dynamic Subtitles: The User Experience. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video - TVX ’15. ACM Press, Brussels, Belgium, 103–112. https://doi.org/10.1145/2745197.2745204

Digital Library

[5]

Andy Brown, Jayson Turner, Jake Patterson, Anastasia Schmitz, Mike Armstrong, and Maxine Glancy. 2017. Subtitles in 360-degree Video. In Adjunct Publication of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video - TVX ’17 Adjunct. ACM Press, Hilversum, The Netherlands, 3–8. https://doi.org/10.1145/3084289.3089915

Digital Library

[6]

Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2017. State-of-the-art Speech Recognition With Sequence-to-Sequence Models. arXiv:1712.01769 [cs, eess, stat] (Dec. 2017). http://arxiv.org/abs/1712.01769 arXiv:1712.01769.

[7]

Christopher Cieri, David Miller, and Kevin Walker. [n.d.]. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text. ([n. d.]), 3.

[8]

Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In Proceedings of the 2018 Conference on Human Information Interaction&Retrieval - IUI ’18. ACM Press, Tokyo, Japan, 329–340. https://doi.org/10.1145/3172944.3172983

Digital Library

[9]

British Broadcasting Corporation.2019. Subtitle Guidelines Version 1.1.8. https://bbc.github.io/subtitle-guidelines/

[10]

British Broadcasting Corporation.2020. How do I create subtitles?http://www.bbc.co.uk/guides/zmgnng8

[11]

Li Deng and Xiao Li. 2013. Machine Learning Paradigms for Speech Recognition: An Overview. IEEE Transactions on Audio, Speech, and Language Processing 21, 5 (May 2013), 1060–1089. https://doi.org/10.1109/TASL.2013.2244083

Digital Library

[12]

Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working with Machine Learning as a Design Material. ACM Press, 278–288. https://doi.org/10.1145/3025453.3025739

Digital Library

[13]

Géry d’Ydewalle and Wim De Bruycker. 2007. Eye Movements of Children and Adults While Reading Television Subtitles. European Psychologist 12, 3 (Jan. 2007), 196–205. https://doi.org/10.1027/1016-9040.12.3.196

[14]

Pozo et al.2014. SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling. (May 2014).

[15]

Jerry Alan Fails and Dan R Olsen. [n.d.]. Interactive Machine Learning. ([n. d.]), 7.

[16]

Alex Graves. 2012. Sequence Transduction with Recurrent Neural Networks. arXiv:1211.3711 [cs, stat] (Nov. 2012). http://arxiv.org/abs/1211.3711 arXiv:1211.3711.

[17]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human factors in computing systems the CHI is the limit - CHI ’99. ACM Press, Pittsburgh, Pennsylvania, United States, 159–166. https://doi.org/10.1145/302979.303030

Digital Library

[18]

Ting-Yao Hsu, Yen-Chia Hsu, and Ting-Hao ’Kenneth’ Huang. 2019. On How Users Edit Computer-Generated Visual Stories. arXiv:1902.08327 [cs] (Feb. 2019). http://arxiv.org/abs/1902.08327 arXiv:1902.08327.

[19]

Chih-wei Huang. [n.d.]. Automatic Closed Caption Alignment Based on Speech Recognition Transcripts. ([n. d.]), 14.

[20]

X. D. Huang, H. W. Hon, and K. F. Lee. 1989. Large-vocabulary speaker-independent continuous speech recognition with semi-continuous hidden Markov models. In Proceedings of the workshop on Speech and Natural Language - HLT ’89. Association for Computational Linguistics, Cape Cod, Massachusetts, 276. https://doi.org/10.3115/1075434.1075480

Digital Library

[21]

René F. Kizilcec. 2016. How Much Information?: Effects of Transparency on Trust in an Algorithmic Interface. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI ’16. ACM Press, Santa Clara, California, USA, 2390–2395. https://doi.org/10.1145/2858036.2858402

Digital Library

[22]

Julius Kunze, Louis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier, and Sebastian Stober. 2017. Transfer Learning for Speech Recognition on a Budget. arXiv:1706.00290 [cs, stat] (June 2017). http://arxiv.org/abs/1706.00290 arXiv:1706.00290.

[23]

Veton Këpuska. 2017. Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx). International Journal of Engineering Research and Applications 07, 03 (March 2017), 20–24. https://doi.org/10.9790/9622-0703022024

[24]

Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 1–14. https://doi.org/10.1145/3072959.3073653

Digital Library

[25]

Kevin Lenzo. 2014. The CMU pronouncing dictionary. (2014). http://www.speech.cs.cmu.edu/cgi-bin/cmudict

[26]

Iain A McCowan, Darren Moore, John Dines, Daniel Gatica-Perez, Mike Flynn, Pierre Wellner, and Hervé Bourlard. 2004. On the use of information retrieval measures for speech recognition evaluation. Technical Report. IDIAP.

[27]

Andrew C Morris, Viktoria Maier, and Phil Green. [n.d.]. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. ([n. d.]), 5.

[28]

Obach. [n.d.]. Automatic Speech Recognition for Live TV Subtitling for Hearing-Impaired People. http://ebooks.iospress.nl/publication/641

[29]

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019 (Sept. 2019), 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680 arXiv:1904.08779.

[30]

Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, and Alexander Waibel. 2019. Very Deep Self-Attention Networks for End-to-End Speech Recognition. arXiv:1904.13377 [cs, eess] (May 2019). http://arxiv.org/abs/1904.13377 arXiv:1904.13377.

[31]

Dhevi J. Rajendran, Andrew T. Duchowski, Pilar Orero, Juan Martínez, and Pablo Romero-Fresco. 2013. Effects of text chunking on subtitling: A quantitative and qualitative examination. Perspectives 21, 1 (March 2013), 5–21. https://doi.org/10.1080/0907676X.2012.722651

[32]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs, stat] (Feb. 2016). http://arxiv.org/abs/1602.04938 arXiv:1602.04938.

[33]

Olga Russakovsky, Li-Jia Li, and Li Fei-Fei. 2015. Best of both worlds: Human-machine collaboration for object annotation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, 2121–2131. https://doi.org/10.1109/CVPR.2015.7298824

[34]

Burr Settles. [n.d.]. Active Learning Literature Survey. ([n. d.]), 47.

[35]

Aaron Springer, Victoria Hollis, and Steve Whittaker. [n.d.]. Dice in the Black Box: User Experiences with an Inscrutable Algorithm. ([n. d.]), 4.

[36]

Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction 8, 1 (March 2001), 60–98. https://doi.org/10.1145/371127.371166

Digital Library

[37]

Vincent Vandeghinste and Yi Pan. [n.d.]. Sentence Compression for Automated Subtitling: A Hybrid Approach. ([n. d.]), 7.

[38]

Keith Vertanen and Per Ola Kristensson. 2008. On the benefits of confidence visualization in speech recognition. In Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI ’08. ACM Press, Florence, Italy, 1497. https://doi.org/10.1145/1357054.1357288

Digital Library

[39]

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. arXiv:1804.00015 [cs] (March 2018). http://arxiv.org/abs/1804.00015 arXiv:1804.00015.

[40]

Andrew Gordon Wilson, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. The Human Kernel. arXiv:1510.07389 [cs, stat] (Oct. 2015). http://arxiv.org/abs/1510.07389 arXiv:1510.07389.

[41]

Qian Yang. [n.d.]. The Role of Design in Creating Machine-Learning-Enhanced User Experience. ([n. d.]), 6.

[42]

Aitor Álvarez, Carlos Mendes, Matteo Raffaelli, Tiago Luís, Sérgio Paulo, Nicola Piccinini, Haritz Arzelus, João Neto, Carlo Aliprandi, and Arantza del Pozo. 2016. Automating live and batch subtitling of multimedia contents for several European languages. Multimedia Tools and Applications 75, 18 (Sept. 2016), 10823–10853. https://doi.org/10.1007/s11042-015-2794-z

Digital Library

Cited By

Al-Akhras LTick A(2025)From the Rise of AI to Future Shock – State of the Art2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883625(000107-000114)Online publication date: 23-Jan-2025
https://doi.org/10.1109/SAMI63904.2025.10883625
Yu TYang WXu JPan Y(2024)Barriers to Industry Adoption of AI Video Generation Tools: A Study Based on the Perspectives of Video Production Professionals in ChinaApplied Sciences10.3390/app1413577014:13(5770)Online publication date: 1-Jul-2024
https://doi.org/10.3390/app14135770
Li EPang K(2024)Fandom meets artificial intelligence: Rethinking participatory culture as human–community–machine interactionsEuropean Journal of Cultural Studies10.1177/1367549424123614627:4(778-787)Online publication date: 10-Mar-2024
https://doi.org/10.1177/13675494241236146
Show More Cited By

Recommendations

Disruptive Approaches for Subtitling in Immersive Environments
TVX '19: Proceedings of the 2019 ACM International Conference on Interactive Experiences for TV and Online Video

The Immersive Accessibility Project (ImAc) explores how accessibility services can be integrated with 360o video as well as new methods for enabling universal access to immersive content. ImAc is focused on inclusivity and addresses the needs of all ...
Quality is in the eye of the stakeholders: what do professional subtitlers and viewers think about subtitling?
Abstract
Quality is a rather slippery concept, and its assessment in subtitling can be a challenging task, as its appreciation can easily vary depending on the different stakeholders involved in the production and reception of subtitles. In this paper, we ...
Study of the use of subtitling of media fictions in language learning related to character identification
TEEM'18: Proceedings of the Sixth International Conference on Technological Ecosystems for Enhancing Multiculturality

This research project is related to the Spanish/Chinese subtitled media consumption and the Spanish/Chinese learning for Spanish/Chinese speaking students. It focuses particularly on analyzing the uses the subtitled audiovisual products. For this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IMX '21: Proceedings of the 2021 ACM International Conference on Interactive Media Experiences

June 2021

331 pages

ISBN:9781450383899

DOI:10.1145/3452918

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IMX '21

Sponsor:

IMX '21: ACM International Conference on Interactive Media Experiences

June 21 - 23, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 69 of 245 submissions, 28%

Upcoming Conference

IMX '25

Sponsor:
sigchi

ACM International Conference on Interactive Media Experiences

June 3 - 6, 2025

Niter?i , Brazil

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)135
Downloads (Last 6 weeks)22

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Al-Akhras LTick A(2025)From the Rise of AI to Future Shock – State of the Art2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883625(000107-000114)Online publication date: 23-Jan-2025
https://doi.org/10.1109/SAMI63904.2025.10883625
Yu TYang WXu JPan Y(2024)Barriers to Industry Adoption of AI Video Generation Tools: A Study Based on the Perspectives of Video Production Professionals in ChinaApplied Sciences10.3390/app1413577014:13(5770)Online publication date: 1-Jul-2024
https://doi.org/10.3390/app14135770
Li EPang K(2024)Fandom meets artificial intelligence: Rethinking participatory culture as human–community–machine interactionsEuropean Journal of Cultural Studies10.1177/1367549424123614627:4(778-787)Online publication date: 10-Mar-2024
https://doi.org/10.1177/13675494241236146
Kuhn KReuter BEgger NZimmermann G(2024)Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing ViewersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688495(1-6)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3688495
Alonso del Barrio DTiel MGatica-Perez D(2024)Human Interest or Conflict? Leveraging LLMs for Automated Framing Analysis in TV ShowsProceedings of the 2024 ACM International Conference on Interactive Media Experiences10.1145/3639701.3656308(157-167)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3639701.3656308
Ha SLim CCrouser ROttley A(2024)Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00062(271-275)Online publication date: 13-Oct-2024
https://doi.org/10.1109/VIS55277.2024.00062
Arroyo Chavez MThompson BFeanny MAlabi KKim MMing LGlasser AKushalnagar RVogler C(2024)Customization of Closed Captions via Large Language ModelsComputers Helping People with Special Needs10.1007/978-3-031-62849-8_7(50-58)Online publication date: 5-Jul-2024
https://doi.org/10.1007/978-3-031-62849-8_7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten