Abstract
Luxembourgish is embedded in a multilingual context on the divide between Romance and Germanic cultures and remains one of Europe’s low-resourced languages. We describe our efforts in building a large vocabulary ASR system for such a “minority” language without resorting to any prior transcribed audio training data. Instead, acoustic models are derived from major European languages. Furthermore, most Luxembourgish written sources include significant parts in other languages. This poses specific challenges to Language Model estimation. Some scientific and technological issues addressed include: (i) how to build acoustic models if no labeled acoustic training data are available for the under-resourced target language? (ii) how to make use of the new system to accelerate resource production for the target language? (iii) how to build a vocabulary and a language model with multilingual written texts? (iv) how to determine the “best” phonemic inventory for ASR? First ASR results illustrate the accuracy of the various sets of monolingual and multilingual acoustic models and what these suggest concerning language typology issues.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
some residual non-Luxembourgish languages such as ancient Greek was rejected because of its special coding alphabet.
References
Schanen, F.: Parlons Luxembourgeois. L’Harmattan, Paris (2004)
Adda-Decker, M., Pellegrini, T., Bilinski, E., Adda, G.: Developments of Lëtzebuergesch resources for automatic speech processing and linguistic studies. In: Proceedings of the International Language Resources and Evaluation Conference LREC (2008)
Krummes, C.: Sinn si or si si? mobile-n deletion in Luxembourgish. Papers in Linguistics from the University of Manchester: Proceedings of the 15th Postgraduate Conference in Linguistics, Manchester (2006)
Snoeren, N.D., Adda-Decker, M., Adda, G.: The study of writing variants in an under-resourced language: some evidence from mobile N-deletion in Luxembourgish. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), 19–21 May, Valletta, Malta (2010)
Schultz, T., Waibel, A.: Experiments on cross-language acoustic modeling. In: Proceedings of Eurospeech, Aalborg (2001)
Allauzen, A., Gauvain, J.-L.: Construction automatique du vocabulaire d’un système de transcription. Journées d’Etude sur la Parole 2004, Fès (2004)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology (Harvard University), August 1998
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 181–184 (1995)
Snoeren, N.D., Adda-Decker, M.: Pronunciation and writing variants in Luxembourgish: the case of mobile N-deletion in large corpora. In: Proceedings of 4th Language&Technology Conference, 6–8 November, Poznan, Poland, pp. 119–123 (2009)
Adda-Decker, M., Lamel, L., Snoeren, N.D.: Studying Luxembourgish phonetics via multilingual forced alignments. In: Proceedings of the 17th International Congress of Phonetic Sciences (ICPhS XVII), 17–21 August, Hong Kong (2011)
Lavergne, T.: Wapiti - a simple and fast discriminative sequence labelling toolkit. http://wapiti.limsi.fr/
Adda-Decker, M., Barras, C., Adda, G., Paroubek, P., Boula De Mareüil, P., Habert, B.: Annotation and analysis of overlapping speech in political interviews. In: Proceedings of the International Language Resources and Evaluation Conference LREC (2008)
Adda-Decker, M., Adda, G., Lavergne, T.: Luxembourgish: towards a linguistic description based on large corpora and automatic speech processing. In: Proceedings of PPLC 13 ‘Phonetics, Phonology and Language Contact’ Workshop, 21–23 August, Paris (2013)
Acknowledgments
This work has been partially financed by Oseo under the Quaero program, and supported by LABEX EFL (ANR/CGI).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Adda-Decker, M., Lamel, L., Adda, G., Lavergne, T. (2014). A First LVCSR System for Luxembourgish, a Low-Resourced European Language. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-08958-4_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)