Skip to main content

Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction

  • Conference paper
  • First Online:
Book cover Statistical Language and Speech Processing (SLSP 2015)

Abstract

Recent advances in continuous word representation have been successfully used in several natural language processing tasks. This paper focuses on error prediction in Automatic Speech Recognition (ASR) outputs and proposes to investigate the use of continuous word representation (word embeddings) within a neural network architecture.

The main contribution of this paper is about word embeddings combination: several combination approaches are proposed in order to take advantage of their complementarity. The use of prosodic features, in addition to classical syntactic ones, is evaluated.

Experiments are made on automatic transcriptions generated by the LIUM ASR system applied on the ETAPE corpus. They show that the proposed neural architecture, using an effective continuous word representation combination and prosodic features as additional features, outperforms significantly state-of-the-art approach based on the use of Conditional Random Fields. Last, the proposed system produces a well calibrated confidence measure, evaluated in terms of Normalized Cross Entropy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://macaon.lif.univ-mrs.fr.

  2. 2.

    http://www.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm.

  3. 3.

    http://wapiti.limsi.fr.

References

  1. Adda-Decker, M., Gendrot, C., Nguyen, N.: Contributions du traitement automatique de la parole l’étude des voyelles orales du franais. Traitement Automatique des Langues 49(3), 13–46 (2008)

    Google Scholar 

  2. Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 809–815. Association for Computational Linguistics (2014)

    Google Scholar 

  3. Béchet, F., Favre, B.: ASR error segment localisation for spoken recovery strategy. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  4. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model, vol. 3, pp. 1137–1155. JMLR (2003)

    Google Scholar 

  5. Boersma, P., Weenink, D.: Praat, a system for doing phonetics by computer. Glot Int. 5(9), 341–345 (2001)

    Google Scholar 

  6. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural Language Processing (Almost) from Scratch. vol. 12, pp. 2493–2537. JMLR (2011)

    Google Scholar 

  7. Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech, Brighton, UK, September 2009

    Google Scholar 

  8. Gauvain, J.-L., Adda, G., Lamel, L., Lefvre, F., Schwenk, H.: Transcription de la parole conversationnelle. Traitement Automatique des Langues 45(3), 35–47 (2005)

    Google Scholar 

  9. Ghannay, S., Camelin, N., Estève, Y.: Which ASR errors are hard to detect? In: Errors by Humans and Machines in Multimedia, Multimodal and Multilingual Data Processing (ERRARE 2015) Workshop, Sinaia, Romania, pp. 11–13 (2015)

    Google Scholar 

  10. Ghannay, S., Estève, Y., Camelin, N.: Word embeddings combination and neural networks for robustness in asr error detection. In: European Signal Processing Conference (EUSIPCO 2015), Nice, France, 31 August–4 September (2015)

    Google Scholar 

  11. Goldwater, S., Jurafsky, D., Manning, C.D.: Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates. In: Speech Communication, pp. 181–200 (2010)

    Google Scholar 

  12. Gravier, G., Adda, G., Paulsson, N., Carr, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)

    Google Scholar 

  13. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Hirschberg, J., Litman, D., Swerts, M.: Prosodic and other cues to speech recognition failures. Speech Commun. 43(1), 155–175 (2004)

    Article  Google Scholar 

  15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)

    Google Scholar 

  16. Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, pp. 1081–1088. Curran Associates Inc. (2009)

    Google Scholar 

  17. Nasr, A., Béchet, F., Rey, J.-F., Favre, B., Le Roux, J.: Macaon: An nlp tool suite for processing word lattices. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, pp. 86–91. Association for Computational Linguistics (2011)

    Google Scholar 

  18. Nemoto, R., Adda-Decker, M., Durand, J.: Investigation of lexical f0 and duration patterns in french using large broadcast news speech corpora. In: Proceedings of Speech Prosody (2010)

    Google Scholar 

  19. Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves OOV detection in speech. In: Human Language Technologies: Proceddings of the North American Chapter of the Association for Computational Linguistics (NAACL 2010) (2010)

    Google Scholar 

  20. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12 (2014)

    Google Scholar 

  21. Schwenk, H., Dchelotte, D., Gauvain, J.-L.: Continuous space language models for statistical machine translation. In: Proceedings of COLING/ACL, COLING-ACL 2006, pp. 723–730, Stroudsburg, PA, USA. Association for Computational Linguistics (2006)

    Google Scholar 

  22. Stoyanchev, S., Salletmayr, P., Yang, J., Hirschberg, J.: Localized detection of speech recognition errors. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 25–30, December 2012

    Google Scholar 

  23. Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semisupervised learning, pp. 384–394 (2010)

    Google Scholar 

  24. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (2008)

    Google Scholar 

  25. Yik-Cheung, T., Lei, Y., Zheng, J., Wang, W.: ASR error detection using recurrent neural network language model and complementary ASR. In: Proceddings of Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 2312–2316 (2014)

    Google Scholar 

  26. Dong, Y., Li, J., Deng, L.: Calibration of confidence measures in speech recognition. IEEE Trans. Audio, Speech, Lang. Proces. 19, 2461–2473 (2011)

    Article  Google Scholar 

Download references

Acknowledgments

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call, by the French National Research Agency (ANR) through the VERA project, under the contract number ANR-12-BS02-006-01, and by the Région Pays de la Loire.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sahar Ghannay .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ghannay, S., Estève, Y., Camelin, N., Dutrey, C., Santiago, F., Adda-Decker, M. (2015). Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25789-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25788-4

  • Online ISBN: 978-3-319-25789-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics