An automatic caption alignment mechanism for off-the-shelf speech recognition technologies

Federico, Maria; Furini, Marco

doi:10.1007/s11042-012-1318-3

An automatic caption alignment mechanism for off-the-shelf speech recognition technologies

Published: 22 December 2012

Volume 72, pages 21–40, (2014)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Maria Federico¹ &
Marco Furini²

421 Accesses
18 Citations
Explore all metrics

Abstract

With a growing number of online videos, many producers feel the need to use video captions in order to expand content accessibility and face two main issues: production and alignment of the textual transcript. Both activities are expensive either for the high labor of human resources or for the employment of dedicated software. In this paper, we focus on caption alignment and we propose a novel, automatic, simple and low-cost mechanism that does not require human transcriptions or special dedicated software to align captions. Our mechanism uses a unique audio markup and intelligently introduces copies of it into the audio stream before giving it to an off-the-shelf automatic speech recognition (ASR) application; then it transforms the plain transcript produced by the ASR application into a timecoded transcript, which allows video players to know when to display every single caption while playing out the video. The experimental study evaluation shows that our proposal is effective in producing timecoded transcripts and therefore it can be helpful to expand video content accessibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Yogesh Kumar, Apeksha Koul & Chamkaur Singh

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

Stephanie Stoll, Necati Cihan Camgoz, … Richard Bowden

Notes

References

Canadian Association of Broadcasters (2008) Closed captioning standards and protocol for canadian english language television programming services. In: CAB’s closed captioning manual
Carnegie Mellon University CMU-Sphinx—open source toolkit for speech recognition. http://cmusphinx.sourceforge.net/wiki. Accessed 19 Sep 2012
Federico M, Furini M (2012) Enhancing learning accessibility through fully automatic captioning. In: Proceedings of the international cross-disciplinary conference on web accessibility, W4A ’12. New York, NY, USA, ACM, pp 40:1–40:4
Furini M (2008) Fast play: A novel feature for digital consumer video devices. IEEE Trans Consum Electron 54(2):513–520
Article Google Scholar
Garza T (1991) Evaluating the use of captioned video materials in advanced foreign language learning. Foreign Lang Ann 24(3):239–258
Article Google Scholar
Haubold A, Kender JR (2007) Alignment of speech to highly imperfect text transcriptions. In: Proceedings of the 2007 IEEE international conference on multimedia and expo, ICME 2007. IEEE, Beijing, China, pp 224–227, 2–5 July 2007
Hong R, Wang M, Xu M, Yan S, Chua TS (2010) Dynamic captioning: video accessibility enhancement for hearing impairment. In: Proceedings of the international conference on multimedia, MM ’10. New York, NY, USA, ACM, pp 421–430
Huang CW, Hsu W, Chang SF (2003) Automatic closed caption alignment based on speech recognition transcripts. Technical report, Columbia University
Jelinek L, Jackson DW (2001) Television literacy: comprehension of program content using closed captions for the deaf. J Deaf Stud Deaf Educ 6(1):43–53
Article Google Scholar
Johnson K (2011) Acoustic and auditory phonetics, 3rd edn. Wiley-Blackwell, Malden
Google Scholar
Kemp T, Schmidt M, Westphal M, Waibel A (2000) Strategies for automatic segmentation of audio data. In: Proceedings of the international IEEE conference on acoustics, speech, and signal processing (ICASSP), pp 1423–1426
Kim SK, Hwang DS, Kim JY, Seo YS (2005) An effective news anchorperson shot detection method based on adaptive audio/visual model generation. In: Proceedings of the international conference on image and video retrieval (CIVR), pp 276–285
Knight A, Almeroth KC (2010) Fast caption alignment for automatic indexing of audio. Int J Multimed Data Eng Manag 1(2):1–17
Article Google Scholar
Martone AF, Taskiran, CM, Delp EJ (2004) Automated closed-captioning using text alignment. In: SPIE Proceedings of Storage and retrieval methods and applications for multimedia, vol 5307. SPIE, pp 108–116
Reager SE (2009) Closed captioning for online video. In: Streaming media industry sourcebook, pp 100–102
Shimogori N, Ikeda T, Tsuboi S (2010) Automatically generated captions: will they help non-native speakers communicate in english? In: Proceedings of the 3rd international conference on intercultural collaboration, ICIC ’10. New York, NY, USA, ACM, pp 79–86
Zhang X, Zhao Y, Schopp L (2007) A novel method of language modeling for automatic captioning in tc video teleconferencing. IEEE Trans Inf Technol Biomed 11(3):332–337
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Servizio Accoglienza Studenti Disabili, Università di Modena e Reggio Emilia, Modena, Italy
Maria Federico
Dipartimento di Comunicazione ed Economia, Università di Modena e Reggio Emilia, Reggio Emilia, Italy
Marco Furini

Authors

Maria Federico
View author publications
You can also search for this author in PubMed Google Scholar
Marco Furini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Furini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Federico, M., Furini, M. An automatic caption alignment mechanism for off-the-shelf speech recognition technologies. Multimed Tools Appl 72, 21–40 (2014). https://doi.org/10.1007/s11042-012-1318-3

Download citation

Published: 22 December 2012
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11042-012-1318-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An automatic caption alignment mechanism for off-the-shelf speech recognition technologies

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An automatic caption alignment mechanism for off-the-shelf speech recognition technologies

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation