Abstract
Speech Transcription is important for video/audio retrieval and many other applications. In automatic speech transcription, recognition errors are inevitable, which makes user feedback such as manual error correction necessary. In this paper, an approach is proposed to improve the accuracy of speech transcription by exploiting user feedback and word repetition. The method aims at learning from user feedback and recognition results of preceding utterances and then correcting errors when repeated words are falsely recognized in following utterances. An interaction scheme for user feedback is proposed, which facilitate error correction by candidate lists and provide a new kind of feedback referred to as word indication to extend error correction from repeated words to repeated phrases. For template extraction and matching, the representation of word template and recognition results based on syllable confusion network (SCN) is proposed. During the transcription, templates of multi-syllable words/phrases based on SCN are extracted from user feedback and the N-best lattice, and then matched in SCN corresponding to recognition results of subsequent utterances to yield a new candidate list when repeated words are detected. Experimental results show that considerate error reduction is achieved in the newly-generated candidate lists.
Similar content being viewed by others
References
Chen H, Cooper M, Joshi D, Girod B (2014) Multi-modal language models for lecture video retrieval. ACM International Conference on Multimedia, pp 1081–1084
Favre B, Rouvier M, Bechet F (2014) Reranked aligners for interactive transcript correction. Proc ICASSP 2014:146–150
Harwath D, Gruenstein A, Mcgraw I et al (2014) Choosing useful word alternates for automatic speech recognition correction interfaces. Proc INTERSPEECH 2014:949–953
Jia D, Wang X, Ma Y, Yang Y, Liu H, Qian Y (2016) Language model adaptation based on correction information for interactive speech transcription. The 2016 International Conference on Progress in Informatics and Computing (PIC-2016), Shanghai
Karat CM, Halverson C, Horn D, Karat J (1999) Patterns of entry and correction in large vocabulary continuous speech recognition systems. Proc. CHI, pp 568–575
Laurent A, Meignier S et al (2011) Computer-assisted transcription of speech based on confusion network reordering. ICASSP 2011:4884–4887
Lecouteux B, Linares G et al (2006) Imperfect transcript driven speech recognition. Interspeech 2006, Pittburgh
Li X, Wang X, Qian Y, Lin S (2009) Candidate generation for interactive Chinese speech recognition. Proc. joint conferences on pervasive computing (JCPC), pp 583–588
Liang Y, Iwano K, Shinoda K (2014, Dec 7) An Efficient error correction Interface for speech recognition on mobile touchscreen devices. Proc. Spoken Language Technology (SLT) Workshop, pp 454–459
Liang Y, Iwano K, Shinoda K (2014, Sept 16) Simple gesture-based error correction Interface for smartphone speech recognition. Proc. INTERSPEECH, pp 1194–1198
Mangu L, Brill E, Stolcke A (2000) Finding consensus in speech recognition: word error minization and other application of confusion network. Comput Speech Lang 14(4):373–400
Miro JDV, Silvestrecerda JA, Civera J, Turro C, Juan A (2015) Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories. Speech Comm 2015:65–75
Nie L, Wang M, Gao Y, Zha Z-J, Chua T-S (2013) Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans Multimedia 15(2):426–441
Ogata J, Goto M (2005) Speech repair: quick error correction just by using selection operation for speech input interfaces. In: Proc Interspeech, pp 133–136, 2006
Parada C, Sethy A, Dredze M, Jelinek F (2010) A spoken term detection framework for recovering out-of-vocabulary words using the web. Proc INTERSPEECH 2010:1269–1272
Rodríguez L, Casacuberta F, Vidal E (2007) Computer assisted transcription of speech. Lect Notes Comput Sci 4477:241–248
Rodríguez L, García-Varea I, Vidal E (2010) Multi-modal computer assisted speech transcription. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal interaction (ICMI-MLMI '10)
Sperber M, Neubig G, Nakamura S, Waibe A (2016) Optimizing computer-assisted transcription quality with iterative user interfaces. Proc. Language Resources and Evaluation (LREC)
Suhm B (1997) Empirical evaluation of interactive Multimodal error correction. Proc. IEEE Workshop on speech recognition and understanding, pp 583–590
Suhm B, Myers B, Waibel A (1996) Designing interactive error recovery methods for speech interfaces. Proceedings of ACM CHI. Workshop on Designing the User interface for Speech Recognition applications
Valor Miró JD, Spencer RN, Pérez González de Martos A, Garcés G, Díaz-Munío CT, Civera J, Juan A (2014) Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures. Open Learning: The Journal of Open and Distance Learning 29(1):72–85
Valor Miró JD, Silvestre-Cerdà JA, Civera J, Turró C, Juan A (2015) Efficient generation of high-quality multilingual subtitles for video lecture repositories. In: Conole G, Klobučar T, Rensing C, Konert J, Lavoué É (eds) Design for teaching and learning in a networked world. Lecture notes in Computer Science, vol 9307. Springer, Cham
Wang L, Hu T, Liu P, Soong FK (2008) Efficient handwriting correction of speech recognition errors with template constrained posterior (TCP). Proc. INTERSPEECH, pp 2659–2662
Wang X, Li X, Qian Y, Liu H (2016) Automatic error correction for repeated words in Mandarin speech recognition. Journal of Automation and Control Engineering 4(2):153–158
Xue J and Zhao Y-X (2005) Improved confusion network algorithm and shortest path search from word lattice. ICASSP 2005; 1: 853–856
Zhang H, Wang X, Qian Y, Lin S (2011) An interactive way to acquire internet documents for language model adaptation of speech recognition systems. International Conference on Intelligent Human-Machine ystems and Cybernetics (IHMSC 2011), pp 97–100
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, X., Yang, Y., Liu, H. et al. Improving speech transcription by exploiting user feedback and word repetition. Multimed Tools Appl 76, 20359–20376 (2017). https://doi.org/10.1007/s11042-017-4714-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4714-x