Skip to main content
Log in

Improving speech transcription by exploiting user feedback and word repetition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech Transcription is important for video/audio retrieval and many other applications. In automatic speech transcription, recognition errors are inevitable, which makes user feedback such as manual error correction necessary. In this paper, an approach is proposed to improve the accuracy of speech transcription by exploiting user feedback and word repetition. The method aims at learning from user feedback and recognition results of preceding utterances and then correcting errors when repeated words are falsely recognized in following utterances. An interaction scheme for user feedback is proposed, which facilitate error correction by candidate lists and provide a new kind of feedback referred to as word indication to extend error correction from repeated words to repeated phrases. For template extraction and matching, the representation of word template and recognition results based on syllable confusion network (SCN) is proposed. During the transcription, templates of multi-syllable words/phrases based on SCN are extracted from user feedback and the N-best lattice, and then matched in SCN corresponding to recognition results of subsequent utterances to yield a new candidate list when repeated words are detected. Experimental results show that considerate error reduction is achieved in the newly-generated candidate lists.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Chen H, Cooper M, Joshi D, Girod B (2014) Multi-modal language models for lecture video retrieval. ACM International Conference on Multimedia, pp 1081–1084

  2. Favre B, Rouvier M, Bechet F (2014) Reranked aligners for interactive transcript correction. Proc ICASSP 2014:146–150

    Google Scholar 

  3. Harwath D, Gruenstein A, Mcgraw I et al (2014) Choosing useful word alternates for automatic speech recognition correction interfaces. Proc INTERSPEECH 2014:949–953

    Google Scholar 

  4. Jia D, Wang X, Ma Y, Yang Y, Liu H, Qian Y (2016) Language model adaptation based on correction information for interactive speech transcription. The 2016 International Conference on Progress in Informatics and Computing (PIC-2016), Shanghai

  5. Karat CM, Halverson C, Horn D, Karat J (1999) Patterns of entry and correction in large vocabulary continuous speech recognition systems. Proc. CHI, pp 568–575

  6. Laurent A, Meignier S et al (2011) Computer-assisted transcription of speech based on confusion network reordering. ICASSP 2011:4884–4887

    Google Scholar 

  7. Lecouteux B, Linares G et al (2006) Imperfect transcript driven speech recognition. Interspeech 2006, Pittburgh

  8. Li X, Wang X, Qian Y, Lin S (2009) Candidate generation for interactive Chinese speech recognition. Proc. joint conferences on pervasive computing (JCPC), pp 583–588

  9. Liang Y, Iwano K, Shinoda K (2014, Dec 7) An Efficient error correction Interface for speech recognition on mobile touchscreen devices. Proc. Spoken Language Technology (SLT) Workshop, pp 454–459

  10. Liang Y, Iwano K, Shinoda K (2014, Sept 16) Simple gesture-based error correction Interface for smartphone speech recognition. Proc. INTERSPEECH, pp 1194–1198

  11. Mangu L, Brill E, Stolcke A (2000) Finding consensus in speech recognition: word error minization and other application of confusion network. Comput Speech Lang 14(4):373–400

    Article  Google Scholar 

  12. Miro JDV, Silvestrecerda JA, Civera J, Turro C, Juan A (2015) Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories. Speech Comm 2015:65–75

    Article  Google Scholar 

  13. Nie L, Wang M, Gao Y, Zha Z-J, Chua T-S (2013) Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans Multimedia 15(2):426–441

    Article  Google Scholar 

  14. Ogata J, Goto M (2005) Speech repair: quick error correction just by using selection operation for speech input interfaces. In: Proc Interspeech, pp 133–136, 2006

  15. Parada C, Sethy A, Dredze M, Jelinek F (2010) A spoken term detection framework for recovering out-of-vocabulary words using the web. Proc INTERSPEECH 2010:1269–1272

    Google Scholar 

  16. Rodríguez L, Casacuberta F, Vidal E (2007) Computer assisted transcription of speech. Lect Notes Comput Sci 4477:241–248

    Article  Google Scholar 

  17. Rodríguez L, García-Varea I, Vidal E (2010) Multi-modal computer assisted speech transcription. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal interaction (ICMI-MLMI '10)

  18. Sperber M, Neubig G, Nakamura S, Waibe A (2016) Optimizing computer-assisted transcription quality with iterative user interfaces. Proc. Language Resources and Evaluation (LREC)

  19. Suhm B (1997) Empirical evaluation of interactive Multimodal error correction. Proc. IEEE Workshop on speech recognition and understanding, pp 583–590

  20. Suhm B, Myers B, Waibel A (1996) Designing interactive error recovery methods for speech interfaces. Proceedings of ACM CHI. Workshop on Designing the User interface for Speech Recognition applications

  21. Valor Miró JD, Spencer RN, Pérez González de Martos A, Garcés G, Díaz-Munío CT, Civera J, Juan A (2014) Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures. Open Learning: The Journal of Open and Distance Learning 29(1):72–85

    Article  Google Scholar 

  22. Valor Miró JD, Silvestre-Cerdà JA, Civera J, Turró C, Juan A (2015) Efficient generation of high-quality multilingual subtitles for video lecture repositories. In: Conole G, Klobučar T, Rensing C, Konert J, Lavoué É (eds) Design for teaching and learning in a networked world. Lecture notes in Computer Science, vol 9307. Springer, Cham

  23. Wang L, Hu T, Liu P, Soong FK (2008) Efficient handwriting correction of speech recognition errors with template constrained posterior (TCP). Proc. INTERSPEECH, pp 2659–2662

  24. Wang X, Li X, Qian Y, Liu H (2016) Automatic error correction for repeated words in Mandarin speech recognition. Journal of Automation and Control Engineering 4(2):153–158

    Article  Google Scholar 

  25. Xue J and Zhao Y-X (2005) Improved confusion network algorithm and shortest path search from word lattice. ICASSP 2005; 1: 853–856

  26. Zhang H, Wang X, Qian Y, Lin S (2011) An interactive way to acquire internet documents for language model adaptation of speech recognition systems. International Conference on Intelligent Human-Machine ystems and Cybernetics (IHMSC 2011), pp 97–100

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangdong Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Yang, Y., Liu, H. et al. Improving speech transcription by exploiting user feedback and word repetition. Multimed Tools Appl 76, 20359–20376 (2017). https://doi.org/10.1007/s11042-017-4714-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4714-x

Keywords

Navigation