Abstract
This document describes the automatic speech-to-text transcription used by Vocapia Research for the Evalita 2011 evaluation for the open unconstrained automatic speech recognition (ASR) task. The aim of this evaluation was to perform automatic speech recognition of parliament audio sessions in the Italian language. About 30h of untranscribed audio data and one year of minutes from parliament sessions were provided as training corpus. This corpus was used to carry out an unsupervised adaptation of Vocapia’s Italian broadcast speech transcription system. Transcriptions produced by two systems were submitted. The primary system has a single decoding pass and was optimized to run in real time. The contrastive system, developed in collaboration with Limsi-CNRS, has two decoding passes and runs in about 5×RT. The case-insensitive word error rates (WER) of these systems are respectively 10.2% and 9.3% on the Evalita development data and 6.4% and 5.4% on the evaluation data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kimball, O., Kao, C.L., Iyer, R., Arvizo, T., Makhoul, J.: Using Quick Transcriptions to Improve Conversational Speech Models. In: INTERSPEECH, Jeju Island, pp. 2265–2268 (2004)
Cieri, C., Miller, D., Walker, K.: The Fisher Corpus: a Resource for the Next Generations of Speech-To-Text. In: LREC, Lisbon, pp. 69–71 (2004)
Gollan, C., Bisani, M., Kanthak, S., Schluter, R., Ney, H.: Cross Domain Automatic Transcription on the TC-STAR EPPS Corpus. In: ICASSP, Philadelphia, pp. 825–828 (2005)
Bisani, M., Ney, H.: Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Communication 50(5), 434–451 (2008)
Wessel, F., Ney, H.: Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing 13(1), 23–31 (2005)
Ma, J., Schwartz, R.: Unsupervised Versus Supervised Training of Acoustic Models. In: INTERSPEECH, Brisbane, pp. 2374–2377 (2008)
Brugnara, F., Cettolo, M., Federico, M., Giuliani, D.: A Baseline for the Transcription of Italian Broadcast News. In: ICASSP, Istanbul, pp. 1667–1670 (2000)
Brugnara, F., Cettolo, M., Federico, M., Giuliani, D.: Advances in Automatic Transcription of Italian Broadcast News. In: ICSLP, Beijing, vol. II, pp. 660–663 (2000)
Bertoldi, N., Brugnara, F., Cettolo, M., Federico, M., Giuliani, D.: Cross-task Portability of a Broadcast News Speech Recognition System. Speech Communication 38(3-4), 335–347 (2002)
Lefevre, F., Gauvain, J.-L., Lamel, L.: Towards Task-Independent Speech Recognition. In: ICASSP, Salt Lake City, pp. 521–524 (2001)
Gauvain, J.-L., Lamel, L., Adda, G.: The Limsi Broadcast News Transcription System. Speech Communication 37(1-2), 89–108 (2002)
Gauvain, J.-L., Lamel, L., Adda, G.: Partitioning and Transcription of Broadcast News Data. In: ICSLP, Sydney, pp. 1335–1338 (1998)
Schwenk, H., Gauvain, J.-L.: Training Neural Network Language Models On Very Large Corpora. In: HLT/EMNLP, Vancouver, pp. 201–208 (2005)
Schwenk, H.: Continuous Space Language Models. Computer, Speech & Language 21, 492–518 (2007)
Hermansky, H.: Perceptual Linear Prediction (PLP) Analysis for Speech. Journal of the Acoustical Society of America 87, 1738–1752 (1990)
Fousek, P., Lamel, L., Gauvain, J.-L.: On the Use of MLP Features for Broadcast News Transcription. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2008. LNCS (LNAI), vol. 5246, pp. 303–310. Springer, Heidelberg (2008)
Grézl, F., Fousek, P.: Optimizing Bottle-Neck Features for LVCSR. In: ICASSP, Las Vegas, pp. 4729–4732 (2008)
Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Using MLP Features in SRI’s Conversational Speech Recognition System. In: INTERSPEECH, Lisbon, pp. 2141–2144 (2005)
Lamel, L., Gauvain, J.-L., Adda, G.: Lightly Supervised and Unsupervised Acoustic Model Training. Computer Speech and Language 16, 115–129 (2002)
Gauvain, J.-L., Lee, C.H.: Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2, 291–298 (1994)
Leggetter, C.J., Woodland, P.C.: Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech & Language 9(2), 171–185 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Despres, J. et al. (2013). The Vocapia Research ASR Systems for Evalita 2011. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds) Evaluation of Natural Language and Speech Tools for Italian. EVALITA 2012. Lecture Notes in Computer Science(), vol 7689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35828-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-35828-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35827-2
Online ISBN: 978-3-642-35828-9
eBook Packages: Computer ScienceComputer Science (R0)