Skip to main content
Log in

Correcting automatic speech recognition captioning errors in real time

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search for relevant specific parts of the multimedia recording by means of the synchronised text. Automatic speech recognition has been used to provide real-time captioning directly from lecturers’ speech in classrooms but it has proved difficult to obtain accuracy comparable to stenography. This paper describes the development, testing and evaluation of a system that enables editors to correct errors in the captions as they are created by automatic speech recognition and makes suggestions for future possible improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baecker, R. M., Wolf, P., & Rankin, K. (2004). The ePresence Interactive Webcasting System: Technology overview and current research issues. In Proceedings of Elearn 2004 (pp. 2396–3069). Washington.

  • Bain, K., Basson, S., & Wald, M. (2002). Speech recognition in university classrooms. In Proceedings of the fifth international ACM SIGCAPH conference on assistive technologies (pp. 192–196). Edinburgh.

  • Bain, K., Basson, S., Faisman, A.A., & Kanevsky, D. (2005). Accessibility, transcription, and access everywhere. IBM Systems Journal, 44(3), 589–603. http://www.research.ibm.com/journal/sj/443/bain.pdf. Accessed 12 December 2005.

    Article  Google Scholar 

  • Brotherton, J. A., & Abowd, G. D. (2004). Lessons learned from eClass: Assessing automated capture and access in the classroom. ACM Transactions on Computer-Human Interaction, 11(2), 121–155.

    Article  Google Scholar 

  • Clements, M., Robertson, S., & Miller, M. S. (2002). Phonetic searching applied to on-line distance learning modules. In Digital signal processing workshop, 2002 and the 2nd signal processing education workshop. Proceedings of 2002 IEEE 10th (pp. 186–191). http://www.imtc.gatech.edu/news/multimedia/spe2002_paper.pdf. Accessed 8 December 2005.

  • Coffield, F., Moseley, D., Hall, E., & Ecclestone, K. (2004). Learning styles and pedagogy in post-16 learning: A systematic and critical review (Learning and Skills Research Centre Report). London. https://www.lsneducation.org.uk/user/login.aspx?code=041543&P=041543PD&action=pdfdl&src=xoweb. Accessed 12 December 2005.

  • Dufour, C., Toms, E. G., Bartlett, J., Ferenbok, J., & Baecker, R. M. (2004). Exploring user interaction with digital videos. In Proceedings of Graphics Interface 2004. London: Ontario.

  • Francis, P. M., & Stinson, M. (2003). The C-Print speech-to-text system for communication access and learning. In Proceedings of CSUN conference technology and persons with disabilities. Northridge, California State University. http://www.csun.edu/cod/conf/2003/proceedings/157.htm. Accessed 12 December 2005.

  • Howard-Spink, S. (2005). IBM’s Superhuman Speech initiative clears conversational confusion. http://www.research.ibm.com/thinkresearch/pages/2002/20020918_speech.shtml. Accessed 12 December 2005.

  • Huang, X. D. (2002). Making speech mainstream. Microsoft Speech Technologies Group.

  • IBM (2003). The Superhuman Speech Recognition Project. http://www.research.ibm.com/superhuman/superhuman.htm. Accessed 12 December 2005.

  • IBM (2005). IBM ViaScribe. http://www-306.ibm.com/able/solution_offerings/ViaScribe.html. Accessed 12 December 2005.

  • Imai, T., Matsui, A., Homma, S., Kobayakawa, T., Onoe, K., Sato, S., & Ando, A. (2002). Speech recognition with a re-speak method for subtitling live broadcasts. In ICSLP-2002 (pp. 1757–1760).

  • Karat, C. M., Halverson, C., Horn, D., & Karat, J. (1999). Patterns of entry and correction in large vocabulary continuous speech recognition systems. In: Proceedings of the SIGCHI conference on human factors in computing systems: the CHI is the limit (pp. 568–575). Pittsburgh, Pennsylvania.

  • Karat, J., Horn, D., Halverson, C. A., & Karat, C. M. (2000). Overcoming unusability: Developing efficient strategies in speech recognition systems. In Conference on human factors in computing systems CHI ’00 extended abstracts (pp. 141–142). The Hague, The Netherlands.

  • Kieras, D. (2001). Using the keystroke-level model to estimate execution times. ftp://www.eecs.umich.edu/people/kieras/GOMS/KLM.pdf. Accessed 23 February 2006.

  • Lambourne, A., Hewitt, J., Lyon, C., & Warren, S. (2004). Speech-based real-time subtitling service. International Journal of Speech Technology, 7(4), 269–279.

    Article  Google Scholar 

  • Leitch, D., & MacMillan, T. (2003). Innovative technology and inclusion: Current issues and future directions for liberated learning research. (Year IV research report on the liberated learning initiative). Saint Mary’s University, Nova Scotia. http://www.liberatedlearning.com/. Accessed 12 December 2005.

  • Lewis, J. R. (1999). Effect of error correction strategy on speech dictation throughput. In Proceedings of the human factors and ergonomics society (pp. 457–461). Houston, Texas, USA.

  • Marin (2006). http://www.marin.cc.ca.us/~holub/Equipmnt.htm. Accessed 17 May 2006.

  • Moore, R. (2005). Keynote paper. In Proc. SPECOM 2005 (pp. 17–19). Patras, Greece.

  • NCAM (2000). International Captioning Project. http://ncam.wgbh.org/resources/icr/europe.html. Accessed 12 December 2005.

  • Nuance (2005). Products. http://www.nuance.com/. Accessed 12 December 2005.

  • Olavsrud, T. (2002). IBM wants you to talk to your devices. http://www.internetnews.com/ent-news/article.php/1004901. Accessed 12 December 2005.

  • Robison, J., & Jensema, C. (1996). Computer speech recognition as an assistive device for deaf and hard of hearing people. In Challenge of change: beyond the horizon, proceedings from seventh biennial conference on postsecondary education for persons who are deaf or hard of hearing. http://sunsite.utk.edu/cod/pec/1996/robison.pdf. Accessed 8 November 2005.

  • RNID (2005). http://www.rnid.org.uk/howwehelp/research_and_technology/communication_and_broadcasting/virtual_signing/. Accessed 12 December 2005.

  • SENDA (2001). http://www.opsi.gov.uk/acts/acts2001/20010010.htm. Accessed 12 December 2005.

  • Shneiderman, B. (2000). The limits of speech recognition. Communications of the ACM, 43(9), 63–65.

    Article  Google Scholar 

  • Softel (2001). FAQ Live or ‘real-time’ subtitling. http://www.softel-usa.com/downloads/Softel_Live_Subtitling_FAQ.pdf. Accessed 12 December 2005.

  • Start-Stop Dictation and Transcription Systems (2005). Products. http://www.startstop.com/sst2.asp. Accessed 27 December 2005.

  • Stinson, M., Stuckless, E., Henderson, J., & Miller, L. (1988). Perceptions of hearing-impaired college students towards real-time speech to print: real-time graphic display and other educational support services. The Volta Review, 90, 341–347.

    Google Scholar 

  • Suhm, B., & Myers, B. (2001). Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction, 8(1), 60–98.

    Article  Google Scholar 

  • Suhm, B., Myers, B., & Waibel, A. (1999). Model-based and empirical evaluation of multimodal interactive error correction. In CHI 99 conference proceedings (pp. 584–591). Pittsburgh, Pennsylvania, United States.

  • Teletec International (2005). Remote communication support service. http://www.teletec.co.uk/remote/. Accessed 27 December 2005.

  • Tyre, P. (2005). Professor in your pocket, Newsweek MSNBC. http://www.msnbc.msn.com/id/10117475/site/newsweek. Accessed 8 December 2005.

  • WAI (2005). Web accessibility initiative. http://www.w3.org/WAI. Accessed 12 December 2005.

  • Wald, M. (2000). Developments in technology to increase access to education for deaf and hard of hearing students. In Proceedings of CSUN conference technology and persons with disabilities. California State University, Northridge. http://www.csun.edu/cod/conf/2000/proceedings/0218Wald.htm. Accessed 12 December 2005.

  • Wald, M. (2002). Hearing disability and technology. In Phipps, L., & Sutherland, A., Seale, J. (Eds.), Access all areas: disability, technology and learning (pp. 19–23). JISC TechDis and ALT.

  • Wald, M. (2005). Personalised displays. In Proceedings of speech technologies: captioning, transcription and beyond. IBM T.J. Watson Research Center, New York. http://www.nynj.avios.org/Proceedings.htm. Accessed 27 December 2005.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mike Wald.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wald, M., Bell, JM., Boulain, P. et al. Correcting automatic speech recognition captioning errors in real time. Int J Speech Technol 10, 1–15 (2007). https://doi.org/10.1007/s10772-008-9014-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-008-9014-4

Keywords

Navigation