skip to main content
10.1145/3663548.3688495acmconferencesArticle/Chapter ViewAbstractPublication PagesassetsConference Proceedingsconference-collections
poster

Record, Transcribe, Share: An Accessible Open-Source Video Platform for Deaf and Hard of Hearing Viewers

Published: 27 October 2024 Publication History

Abstract

Providing accessible videos is crucial for enabling access to a diverse audience. However, creating and distributing such videos demands significant effort and technical expertise. While several commercial platforms offer all-in-one solutions with a strong user experience, their use can be hindered by privacy concerns and budget constraints, particularly in Higher Education settings. To address this issue, we present an open-source platform that integrates several open-source developments in Automatic Speech Recognition and real-time collaboration.1 The platform serves both as a production-ready system and as a testbed for exploring new technologies and ideas through user evaluations. It supports a seamless workflow from video capture to transcription and delivery in both offline and real-time scenarios. We describe the design of the system, the design decisions informed by previous studies, its implementation and, preliminary evaluation results. The platform can be used by educational institutions to provide accessible video content and by researchers for further development and experimentation.

References

[1]
Oliver Alonzo, Matthew Seita, Abraham Glasser, and Matt Huenerfauth. 2020. Automatic Text Simplification Tools for Deaf and Hard of Hearing Adults: Benefits of Lexical Simplification and Providing Users with Autonomy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376563
[2]
Oliver Alonzo, Hijung Valentina Shin, and Dingzeyu Li. 2022. Beyond Subtitles: Captioning and Visualizing Non-speech Sounds to Improve Accessibility of User-Generated Videos. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (Athens, Greece) (ASSETS ’22). Association for Computing Machinery, New York, NY, USA, Article 26, 12 pages. https://doi.org/10.1145/3517428.3544808
[3]
Akhter Al Amin, Abraham Glasser, Raja Kushalnagar, Christian Vogler, and Matt Huenerfauth. 2021. Preferences of Deaf or Hard of Hearing Users for Live-TV Caption Appearance. In Universal Access in Human-Computer Interaction. Access to Media, Learning and Assistive Environments: 15th International Conference, UAHCI 2021, Held as Part of the 23rd HCI International Conference, HCII 2021, Virtual Event, July 24–29, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 189–201. https://doi.org/10.1007/978-3-030-78095-1_15
[4]
Akhter Al Amin, Saad Hassan, Sooyeon Lee, and Matt Huenerfauth. 2022. Watch It, Don’t Imagine It: Creating a Better Caption-Occlusion Metric by Collecting More Ecologically Valid Judgments from DHH Viewers. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 459, 14 pages. https://doi.org/10.1145/3491102.3517681
[5]
Akhter Al Amin, Joseph Mendis, Raja Kushalnagar, Christian Vogler, and Matt Huenerfauth. 2023. Who is speaking: Unpacking In-text Speaker Identification Preference of Viewers who are Deaf and Hard of Hearing while Watching Live Captioned Television Program. In Proceedings of the 20th International Web for All Conference (Austin, TX, USA) (W4A ’23). Association for Computing Machinery, New York, NY, USA, 44–53. https://doi.org/10.1145/3587281.3587286
[6]
Mariana Arroyo Chavez, Molly Feanny, Matthew Seita, Bernard Thompson, Keith Delk, Skyler Officer, Abraham Glasser, Raja Kushalnagar, and Christian Vogler. 2024. How Users Experience Closed Captions on Live Television: Quality Metrics Remain a Challenge. In Proceedings of the CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 54, 16 pages. https://doi.org/10.1145/3613904.3641988
[7]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin (Eds.). Vol. 33. Curran Associates, Inc., virtual, 12449–12460. https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
[8]
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. In Proc. INTERSPEECH 2023. ISCA, Dublin, Ireland, 4489–4493. https://doi.org/10.21437/Interspeech.2023-78
[9]
Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, 2023. SeamlessM4T-Massively Multilingual & Multimodal Machine Translation. arxiv:2308.11596
[10]
Larwan Berke, Khaled Albusays, Matthew Seita, and Matt Huenerfauth. 2019. Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3290607.3312921
[11]
Larwan Berke, Christopher Caulfield, and Matt Huenerfauth. 2017. Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 155–164. https://doi.org/10.1145/3132525.3132541
[12]
Hervé Bredin. 2023. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023. ISCA, Dublin, Ireland, 1983–1987. https://doi.org/10.21437/Interspeech.2023-105
[13]
Janine Butler. 2019. Perspectives of deaf and hard of hearing viewers of captions. American Annals of the Deaf 163, 5 (2019), 534–553. https://www.jstor.org/stable/26663593
[14]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Shanghai, China, 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621
[15]
William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, and Mohammad Norouzi. 2021. SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. arxiv:2104.02133
[16]
Xiaoyin Che, Sheng Luo, Haojin Yang, and Christoph Meinel. 2017. Automatic Lecture Subtitle Generation and How It Helps. In 2017 IEEE 17th International Conference on Advanced Learning Technologies (ICALT). IEEE Computer Society, Timisoara, Romania, 34–38. https://doi.org/10.1109/ICALT.2017.11
[17]
Rucha Deshpande, Tayfun Tuna, Jaspal Subhlok, and Lecia Barker. 2014. A crowdsourcing caption editor for educational videos. In 2014 IEEE Frontiers in Education Conference (FIE) Proceedings. IEEE Computer Society, Madrid, Spain, 1–8. https://doi.org/10.1109/FIE.2014.7044040
[18]
Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Calgary, AB, Canada, 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
[19]
Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. 2023. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arxiv:2311.00430 [cs.CL] https://arxiv.org/abs/2311.00430
[20]
Abraham Glasser, Joseline Garcia, Chang Hwang, Christian Vogler, and Raja Kushalnagar. 2021. Effect of caption width on the TV user experience by deaf and hard of hearing viewers. In Proceedings of the 18th International Web for All Conference (Ljubljana, Slovenia) (W4A ’21). Association for Computing Machinery, New York, NY, USA, Article 27, 5 pages. https://doi.org/10.1145/3430263.3452435
[21]
Rebecca Perkins Harrington and Gregg C. Vanderheiden. 2013. Crowd caption correction (CCC). In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (Bellevue, Washington) (ASSETS ’13). Association for Computing Machinery, New York, NY, USA, Article 45, 2 pages. https://doi.org/10.1145/2513383.2513413
[22]
François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. 2018. TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation. In Speech and Computer, Alexey Karpov, Oliver Jokisch, and Rodmonga Potapova (Eds.). Springer International Publishing, Cham, 198–208.
[23]
Sushant Kafle, Peter Yeung, and Matt Huenerfauth. 2019. Evaluating the Benefit of Highlighting Key Words in Captions for People who are Deaf or Hard of Hearing. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, PA, USA) (ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 43–55. https://doi.org/10.1145/3308561.3353781
[24]
Saba Kawas, George Karalis, Tzu Wen, and Richard E. Ladner. 2016. Improving Real-Time Captioning Experiences for Deaf and Hard of Hearing Students. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility (Reno, Nevada, USA) (ASSETS ’16). Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/2982142.2982164
[25]
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New Orleans, LA, USA, 4835–4839. https://doi.org/10.1109/ICASSP.2017.7953075
[26]
Yeon Soo Kim, Sunok Lee, and Sangsu Lee. 2022. A Participatory Design Approach to Explore Design Directions for Enhancing Videoconferencing Experience for Non-signing Deaf and Hard of Hearing Users. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (Athens, Greece) (ASSETS ’22). Association for Computing Machinery, New York, NY, USA, Article 47, 4 pages. https://doi.org/10.1145/3517428.3550375
[27]
Korbinian Kuhn, Verena Kersken, Benedikt Reuter, Niklas Egger, and Gottfried Zimmermann. 2024. Measuring the Accuracy of Automatic Speech Recognition Solutions. ACM Trans. Access. Comput. 16, 4, Article 25 (jan 2024), 23 pages. https://doi.org/10.1145/3636513
[28]
Korbinian Kuhn, Verena Kersken, and Gottfried Zimmermann. 2023. Accuracy of AI-Generated Captions With Collaborative Manual Corrections in Real-Time. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, Article 22, 7 pages. https://doi.org/10.1145/3544549.3585724
[29]
Raja Kushalnagar, Walter Lasecki, and Jeffrey Bigham. 2012. A Readability Evaluation of Real-Time Crowd Captions in the Classroom. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (Boulder, Colorado, USA) (ASSETS ’12). Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/2384916.2384930
[30]
Raja S. Kushalnagar, Walter S. Lasecki, and Jeffrey P. Bigham. 2014. Accessibility Evaluation of Classroom Captions. ACM Trans. Access. Comput. 5, 3, Article 7 (jan 2014), 24 pages. https://doi.org/10.1145/2543578
[31]
Walter Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, and Jeffrey Bigham. 2012. Real-Time Captioning by Groups of Non-Experts. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). Association for Computing Machinery, New York, NY, USA, 23–34. https://doi.org/10.1145/2380116.2380122
[32]
Daniel G. Lee, Deborah I. Fels, and John Patrick Udo. 2007. Emotive captioning. Comput. Entertain. 5, 2, Article 11 (apr 2007), 15 pages. https://doi.org/10.1145/1279540.1279551
[33]
Jinyu Li. 2022. Recent Advances in End-to-End Automatic Speech Recognition. APSIPA Transactions on Signal and Information Processing 11, 1 (2022), –. https://doi.org/10.1561/116.00000050
[34]
Mingkun Li and I.K. Sethi. 2006. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 8 (2006), 1251–1261. https://doi.org/10.1109/TPAMI.2006.156
[35]
Dominik Macháček, Raj Dabre, and Ondřej Bojar. 2023. Turning Whisper into Real-Time Transcription System. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, Sriparna Saha and Herry Sujaini (Eds.). Association for Computational Linguistics, Bali, Indonesia, 17–24. https://aclanthology.org/2023.ijcnlp-demo.3
[36]
Emma J. McDonnell, Ping Liu, Steven M. Goodman, Raja Kushalnagar, Jon E. Froehlich, and Leah Findlater. 2021. Social, Environmental, and Technical: Factors at Play in the Current Use and Future Design of Small-Group Captioning. Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 434 (oct 2021), 25 pages. https://doi.org/10.1145/3479578
[37]
Emma J McDonnell, Soo Hyun Moon, Lucy Jiang, Steven M. Goodman, Raja Kushalnagar, Jon E. Froehlich, and Leah Findlater. 2023. “Easier or Harder, Depending on Who the Hearing Person Is”: Codesigning Videoconferencing Tools for Small Groups with Mixed Hearing Status. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 780, 15 pages. https://doi.org/10.1145/3544548.3580809
[38]
Cosmin Munteanu, Ron Baecker, and Gerald Penn. 2008. Collaborative Editing for Improved Usefulness and Usability of Transcript-Enhanced Webcasts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI ’08). Association for Computing Machinery, New York, NY, USA, 373–382. https://doi.org/10.1145/1357054.1357117
[39]
Rosiana Natalie, Ebrima Jarjue, Hernisa Kacorri, and Kotaro Hara. 2020. ViScene: A Collaborative Authoring Tool for Scene Descriptions in Videos. In Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility (Virtual Event, Greece) (ASSETS ’20). Association for Computing Machinery, New York, NY, USA, Article 87, 4 pages. https://doi.org/10.1145/3373625.3418030
[40]
Petru Nicolaescu, Kevin Jahns, Michael Derntl, and Ralf Klamma. 2016. Near Real-Time Peer-to-Peer Shared Editing on Extensible Data Types. In Proceedings of the 2016 ACM International Conference on Supporting Group Work (Sanibel Island, Florida, USA) (GROUP ’16). Association for Computing Machinery, New York, NY, USA, 39–49. https://doi.org/10.1145/2957276.2957310
[41]
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V. Le. 2020. Improved Noisy Student Training for Automatic Speech Recognition. In Proc. Interspeech 2020. ISCA, Shanghai, China, 2817–2821. https://doi.org/10.21437/Interspeech.2020-1470
[42]
Patricia Piskorek, Nadine Sienel, Korbinian Kuhn, Verena Kersken, and Gottfried Zimmermann. 2022. Evaluating collaborative editing of ai-generated live subtitles by non-professionals in German university lectures. In Assistive Technology, Accessibility and (e)Inclusion: 18th International Conference, ICCHP-AAATE 2022, Lecco, Italy, July 11–15, 2022, Open Access Compendium, Part I. ICCHP, Lecco, Italy, 165–175.
[43]
Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023. ISCA, Dublin, Ireland, 3222–3226. https://doi.org/10.21437/Interspeech.2023-205
[44]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, Honolulu, Hawaii, USA, 28492–28518. https://proceedings.mlr.press/v202/radford23a.html
[45]
Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence 4 (01 2017), 103. https://doi.org/10.9781/ijimai.2017.09.001
[46]
Matthew Seita, Sooyeon Lee, Sarah Andrew, Kristen Shinohara, and Matt Huenerfauth. 2022. Remotely Co-Designing Features for Communication Applications Using Automatic Captioning with Deaf and Hearing Pairs. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 460, 13 pages. https://doi.org/10.1145/3491102.3501843
[47]
Than Htut Soe, Frode Guribye, and Marija Slavkovik. 2021. Evaluating AI Assisted Subtitling. In ACM International Conference on Interactive Media Experiences (Virtual Event, USA) (IMX ’21). Association for Computing Machinery, New York, NY, USA, 96–107. https://doi.org/10.1145/3452918.3458792
[48]
Toinon Vigier, Yoann Baveye, Josselin Rousseau, and Patrick Le Callet. 2016. Visual attention as a dimension of QoE: Subtitles in UHD videos. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, Lisbon, Portugal, 1–6. https://doi.org/10.1109/QoMEX.2016.7498924
[49]
Mike Wald. 2006. Captioning for Deaf and Hard of Hearing People by Editing Automatic Speech Recognition in Real Time. In Proceedings of the 10th International Conference on Computers Helping People with Special Needs (Linz, Austria) (ICCHP’06). Springer-Verlag, Berlin, Heidelberg, 683–690. https://doi.org/10.1007/11788713_100
[50]
James M. Waller and Raja S. Kushalnagar. 2016. Evaluation of Automatic Caption Segmentation. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility (Reno, Nevada, USA) (ASSETS ’16). Association for Computing Machinery, New York, NY, USA, 331–332. https://doi.org/10.1145/2982142.2982205
[51]
Kai Yu, Mark Gales, Lan Wang, and Philip C. Woodland. 2010. Unsupervised training and directed manual transcription for LVCSR. Speech Communication 52, 7 (2010), 652–663. https://doi.org/10.1016/j.specom.2010.02.014

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASSETS '24: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility
October 2024
1475 pages
ISBN:9798400706776
DOI:10.1145/3663548
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Author Tags

  1. automatic speech recognition
  2. collaboration
  3. crowdsourcing
  4. real-time
  5. streaming

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

ASSETS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 436 of 1,556 submissions, 28%

Upcoming Conference

ASSETS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 85
    Total Downloads
  • Downloads (Last 12 months)85
  • Downloads (Last 6 weeks)32
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media