Multilingual acoustic model training combines data from multiple languages to train an automatic speech recognition system. Such a system is beneficial when training data for a target language is limited. Lattice-Free Maximum Mutual Information (LF-MMI) training performs sequence discrimination by introducing competing hypotheses through a denominator graph in the cost function. The standard approach to train a multilingual model with LF-MMI is to combine the acoustic units from all languages and use a common denominator graph. The resulting model is either used as a feature extractor to train an acoustic model for the target language or directly fine-tuned. In this work, we propose a scalable approach to train the multilingual acoustic model using a typical multitask network for the LF-MMI framework. A set of language-dependent denominator graphs is used to compute the cost function. The proposed approach is evaluated under typical multilingual ASR tasks using GlobalPhone and BABEL datasets. Relative improvements up to 13.2% in WER are obtained when compared to the corresponding monolingual LF-MMI baselines. The implementation is made available as a part of the Kaldi speech recognition toolkit.
Cite as: Madikeri, S., Khonglah, B.K., Tong, S., Motlicek, P., Bourlard, H., Povey, D. (2020) Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems. Proc. Interspeech 2020, 4746-4750, doi: 10.21437/Interspeech.2020-2919
@inproceedings{madikeri20_interspeech, author={Srikanth Madikeri and Banriskhem K. Khonglah and Sibo Tong and Petr Motlicek and Hervé Bourlard and Daniel Povey}, title={{Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={4746--4750}, doi={10.21437/Interspeech.2020-2919} }