Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction

Ghannay, Sahar; Estève, Yannick; Camelin, Nathalie; Dutrey, Camille; Santiago, Fabian; Adda-Decker, Martine

doi:10.1007/978-3-319-25789-1_9

Sahar Ghannay¹⁶,
Yannick Estève¹⁶,
Nathalie Camelin¹⁶,
Camille Dutrey¹⁷,
Fabian Santiago¹⁷ &
…
Martine Adda-Decker¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

779 Accesses
5 Citations

Abstract

Recent advances in continuous word representation have been successfully used in several natural language processing tasks. This paper focuses on error prediction in Automatic Speech Recognition (ASR) outputs and proposes to investigate the use of continuous word representation (word embeddings) within a neural network architecture.

The main contribution of this paper is about word embeddings combination: several combination approaches are proposed in order to take advantage of their complementarity. The use of prosodic features, in addition to classical syntactic ones, is evaluated.

Experiments are made on automatic transcriptions generated by the LIUM ASR system applied on the ETAPE corpus. They show that the proposed neural architecture, using an effective continuous word representation combination and prosodic features as additional features, outperforms significantly state-of-the-art approach based on the use of Conditional Random Fields. Last, the proposed system produces a well calibrated confidence measure, evaluated in terms of Normalized Cross Entropy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Adda-Decker, M., Gendrot, C., Nguyen, N.: Contributions du traitement automatique de la parole l’étude des voyelles orales du franais. Traitement Automatique des Langues 49(3), 13–46 (2008)
Google Scholar
Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 809–815. Association for Computational Linguistics (2014)
Google Scholar
Béchet, F., Favre, B.: ASR error segment localisation for spoken recovery strategy. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model, vol. 3, pp. 1137–1155. JMLR (2003)
Google Scholar
Boersma, P., Weenink, D.: Praat, a system for doing phonetics by computer. Glot Int. 5(9), 341–345 (2001)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural Language Processing (Almost) from Scratch. vol. 12, pp. 2493–2537. JMLR (2011)
Google Scholar
Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech, Brighton, UK, September 2009
Google Scholar
Gauvain, J.-L., Adda, G., Lamel, L., Lefvre, F., Schwenk, H.: Transcription de la parole conversationnelle. Traitement Automatique des Langues 45(3), 35–47 (2005)
Google Scholar
Ghannay, S., Camelin, N., Estève, Y.: Which ASR errors are hard to detect? In: Errors by Humans and Machines in Multimedia, Multimodal and Multilingual Data Processing (ERRARE 2015) Workshop, Sinaia, Romania, pp. 11–13 (2015)
Google Scholar
Ghannay, S., Estève, Y., Camelin, N.: Word embeddings combination and neural networks for robustness in asr error detection. In: European Signal Processing Conference (EUSIPCO 2015), Nice, France, 31 August–4 September (2015)
Google Scholar
Goldwater, S., Jurafsky, D., Manning, C.D.: Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates. In: Speech Communication, pp. 181–200 (2010)
Google Scholar
Gravier, G., Adda, G., Paulsson, N., Carr, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Hirschberg, J., Litman, D., Swerts, M.: Prosodic and other cues to speech recognition failures. Speech Commun. 43(1), 155–175 (2004)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Google Scholar
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, pp. 1081–1088. Curran Associates Inc. (2009)
Google Scholar
Nasr, A., Béchet, F., Rey, J.-F., Favre, B., Le Roux, J.: Macaon: An nlp tool suite for processing word lattices. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, pp. 86–91. Association for Computational Linguistics (2011)
Google Scholar
Nemoto, R., Adda-Decker, M., Durand, J.: Investigation of lexical f0 and duration patterns in french using large broadcast news speech corpora. In: Proceedings of Speech Prosody (2010)
Google Scholar
Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves OOV detection in speech. In: Human Language Technologies: Proceddings of the North American Chapter of the Association for Computational Linguistics (NAACL 2010) (2010)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12 (2014)
Google Scholar
Schwenk, H., Dchelotte, D., Gauvain, J.-L.: Continuous space language models for statistical machine translation. In: Proceedings of COLING/ACL, COLING-ACL 2006, pp. 723–730, Stroudsburg, PA, USA. Association for Computational Linguistics (2006)
Google Scholar
Stoyanchev, S., Salletmayr, P., Yang, J., Hirschberg, J.: Localized detection of speech recognition errors. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 25–30, December 2012
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semisupervised learning, pp. 384–394 (2010)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (2008)
Google Scholar
Yik-Cheung, T., Lei, Y., Zheng, J., Wang, W.: ASR error detection using recurrent neural network language model and complementary ASR. In: Proceddings of Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 2312–2316 (2014)
Google Scholar
Dong, Y., Li, J., Deng, L.: Calibration of confidence measures in speech recognition. IEEE Trans. Audio, Speech, Lang. Proces. 19, 2461–2473 (2011)
Article Google Scholar

Download references

Acknowledgments

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call, by the French National Research Agency (ANR) through the VERA project, under the contract number ANR-12-BS02-006-01, and by the Région Pays de la Loire.

Author information

Authors and Affiliations

LIUM - University of Le Mans, Le Mans, France
Sahar Ghannay, Yannick Estève & Nathalie Camelin
LPP - Université Sorbonne Nouvelle, Paris, France
Camille Dutrey, Fabian Santiago & Martine Adda-Decker

Authors

Sahar Ghannay
View author publications
You can also search for this author in PubMed Google Scholar
Yannick Estève
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Camelin
View author publications
You can also search for this author in PubMed Google Scholar
Camille Dutrey
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Santiago
View author publications
You can also search for this author in PubMed Google Scholar
Martine Adda-Decker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sahar Ghannay .

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Klára Vicsi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghannay, S., Estève, Y., Camelin, N., Dutrey, C., Santiago, F., Adda-Decker, M. (2015). Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-25789-1_9
Published: 17 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics