Abstract
Recent advances in continuous word representation have been successfully used in several natural language processing tasks. This paper focuses on error prediction in Automatic Speech Recognition (ASR) outputs and proposes to investigate the use of continuous word representation (word embeddings) within a neural network architecture.
The main contribution of this paper is about word embeddings combination: several combination approaches are proposed in order to take advantage of their complementarity. The use of prosodic features, in addition to classical syntactic ones, is evaluated.
Experiments are made on automatic transcriptions generated by the LIUM ASR system applied on the ETAPE corpus. They show that the proposed neural architecture, using an effective continuous word representation combination and prosodic features as additional features, outperforms significantly state-of-the-art approach based on the use of Conditional Random Fields. Last, the proposed system produces a well calibrated confidence measure, evaluated in terms of Normalized Cross Entropy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adda-Decker, M., Gendrot, C., Nguyen, N.: Contributions du traitement automatique de la parole l’étude des voyelles orales du franais. Traitement Automatique des Langues 49(3), 13–46 (2008)
Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 809–815. Association for Computational Linguistics (2014)
Béchet, F., Favre, B.: ASR error segment localisation for spoken recovery strategy. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model, vol. 3, pp. 1137–1155. JMLR (2003)
Boersma, P., Weenink, D.: Praat, a system for doing phonetics by computer. Glot Int. 5(9), 341–345 (2001)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural Language Processing (Almost) from Scratch. vol. 12, pp. 2493–2537. JMLR (2011)
Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech, Brighton, UK, September 2009
Gauvain, J.-L., Adda, G., Lamel, L., Lefvre, F., Schwenk, H.: Transcription de la parole conversationnelle. Traitement Automatique des Langues 45(3), 35–47 (2005)
Ghannay, S., Camelin, N., Estève, Y.: Which ASR errors are hard to detect? In: Errors by Humans and Machines in Multimedia, Multimodal and Multilingual Data Processing (ERRARE 2015) Workshop, Sinaia, Romania, pp. 11–13 (2015)
Ghannay, S., Estève, Y., Camelin, N.: Word embeddings combination and neural networks for robustness in asr error detection. In: European Signal Processing Conference (EUSIPCO 2015), Nice, France, 31 August–4 September (2015)
Goldwater, S., Jurafsky, D., Manning, C.D.: Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates. In: Speech Communication, pp. 181–200 (2010)
Gravier, G., Adda, G., Paulsson, N., Carr, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hirschberg, J., Litman, D., Swerts, M.: Prosodic and other cues to speech recognition failures. Speech Commun. 43(1), 155–175 (2004)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, pp. 1081–1088. Curran Associates Inc. (2009)
Nasr, A., Béchet, F., Rey, J.-F., Favre, B., Le Roux, J.: Macaon: An nlp tool suite for processing word lattices. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, pp. 86–91. Association for Computational Linguistics (2011)
Nemoto, R., Adda-Decker, M., Durand, J.: Investigation of lexical f0 and duration patterns in french using large broadcast news speech corpora. In: Proceedings of Speech Prosody (2010)
Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves OOV detection in speech. In: Human Language Technologies: Proceddings of the North American Chapter of the Association for Computational Linguistics (NAACL 2010) (2010)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12 (2014)
Schwenk, H., Dchelotte, D., Gauvain, J.-L.: Continuous space language models for statistical machine translation. In: Proceedings of COLING/ACL, COLING-ACL 2006, pp. 723–730, Stroudsburg, PA, USA. Association for Computational Linguistics (2006)
Stoyanchev, S., Salletmayr, P., Yang, J., Hirschberg, J.: Localized detection of speech recognition errors. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 25–30, December 2012
Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semisupervised learning, pp. 384–394 (2010)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (2008)
Yik-Cheung, T., Lei, Y., Zheng, J., Wang, W.: ASR error detection using recurrent neural network language model and complementary ASR. In: Proceddings of Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 2312–2316 (2014)
Dong, Y., Li, J., Deng, L.: Calibration of confidence measures in speech recognition. IEEE Trans. Audio, Speech, Lang. Proces. 19, 2461–2473 (2011)
Acknowledgments
This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call, by the French National Research Agency (ANR) through the VERA project, under the contract number ANR-12-BS02-006-01, and by the Région Pays de la Loire.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ghannay, S., Estève, Y., Camelin, N., Dutrey, C., Santiago, F., Adda-Decker, M. (2015). Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-25789-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)