Abstract
Representation learning or pre-training has shown promising performance for low-resource speech recognition which suffers from the data shortage. Recently, self-supervised methods have achieved surprising performance for speech pre-training by effectively utilizing large amount of un-annotated data. In this paper, we propose a new pre-training framework, Cross-Lingual Self-Training (XLST), to further improve the effectiveness for multilingual representation learning. Specifically, XLST first trains a phoneme classification model with a small amount of annotated data of a non-target language and then uses it to produce initial targets for training another model on multilingual un-annotated data, i.e., maximizing frame-level similarity between the output embeddings of two models. Furthermore, we employ the moving average and multi-view data augmentation mechanisms to better generalize the learned representations. Experimental results on downstream speech recognition tasks for 5 low-resource languages demonstrate the effectiveness of XLST. Specifically, leveraging additional 100 h of annotated English data for pre-training, the proposed XLST achieves a relative 24.8% PER reduction over the state-of-the-art self-supervised methods.
Similar content being viewed by others
Data Availability
The training datasets generated during the current study are available in the CommonVoice (https://commonvoice.mozilla.org) and the LibriSpeech corpus (http://www.openslr.org/12). The evaluation datasets generated during this study are included in this published article [40].
Notes
As the older version is no longer available, we use the December 2019 release, maintaining the same number of hours of data as [10] for 11 languages. CommonVoice corpus is publicly available in https://commonvoice.mozilla.org.
LibriSpeech corpus is publicly available in http://www.openslr.org/12, we follow [34] to obtain the aligned frame-level phoneme labels from http://www.kaldi-asr.org/downloads/build/6/trunk/egs/librispeech.
Reproduced from the CommonVoice test data which can be accessed in [40].
As suggested in [16], for random initialization experiment we add an extra MLP predictor (same architecture as the projector) on the top of the Main Network, and moving average mechanism (\(\lambda =0.99,\Lambda =32\)) is applied.
References
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, G. Weber, Common voice: a massively-multilingual speech corpus, in Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association (2020), pp. 4218–4222
A. Baevski, S. Schneider, M. Auli, vq-wav2vec: Self-supervised learning of discrete speech representations, in International Conference on Learning Representations (2020)
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014). https://doi.org/10.1016/j.specom.2013.07.008
T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in Proceedings of the 37th International Conference on Machine Learning, ed. by H.D.A. Singh III, vol. 119 (PMLR, 2020), pp. 1597–1607
X. Chen, K. He, Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566 (2020)
J. Cho, M.K. Baskar, R. Li, M. Wiesner, S.H. Mallidi, N. Yalta, M. Karafiát, S. Watanabe, T. Hori, Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling, in SLT (2018), pp. 521–527. https://doi.org/10.1109/SLT.2018.8639655
J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019). https://doi.org/10.1109/TASLP.2019.2938863
Y.A. Chung, W.N. Hsu, H. Tang, J. Glass, An unsupervised autoregressive model for speech representation learning, in Proceedings of Interspeech 2019 (2019), pp. 146–150
A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)
S. Dalmia, R. Sanabria, F. Metze, A.W. Black, Sequence-based multi-lingual low resource speech recognition, in ICASSP (2018), pp. 4909–4913. https://doi.org/10.1109/ICASSP.2018.8461802
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers) (2019), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
S. Feng, T. Lee, Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019). https://doi.org/10.1109/TASLP.2019.2937953
A. Ghoshal, P. Swietojanski, S. Renals, Multilingual training of deep neural networks, in ICASSP (2013), pp. 7319–7323. https://doi.org/10.1109/ICASSP.2013.6639084
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 369–376
J.B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko, Bootstrap your own latent—a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
J. Huang, J. Li, D. Yu, L. Deng, Y. Gong, Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, in ICASSP (2013), pp. 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081
J. Kahn, A. Lee, A. Hannun, Self-training for end-to-end speech recognition, in ICASSP 2020 (2020), pp. 7084–7088. https://doi.org/10.1109/ICASSP40776.2020.9054295
M. Karafiát, M.K. Baskar, S. Watanabe, T. Hori, M. Wiesner, J. Cernocký, Analysis of multilingual sequence-to-sequence speech recognition systems, in Proceedings of Interspeech 2019 (2019), pp. 2220–2224. https://doi.org/10.21437/Interspeech.2019-2355
S. Khurana, A. Laurent, W.N. Hsu, J. Chorowski, A. Lancucki, R. Marxer, J. Glass, A convolutional deep Markov model for unsupervised speech representation learning, in Proceedings of Interspeech 2020 (2020), pp. 3790–3794. https://doi.org/10.21437/Interspeech.2020-3084
S. Khurana, N. Moritz, T. Hori, J.L. Roux, Unsupervised domain adaptation for speech recognition via uncertainty driven self-training, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 6553–6557. https://doi.org/10.1109/ICASSP39728.2021.9414299
P. Lal, S. King, Cross-lingual automatic speech recognition using tandem features. IEEE Trans. Audio Speech Lang. Process. 21(12), 2506–2515 (2013). https://doi.org/10.1109/TASL.2013.2277932
J. Li, M.L. Seltzer, X. Wang, R. Zhao, Y. Gong, Large-scale domain adaptation via teacher-student learning, in Proceedings of Interspeech 2017 (2017), pp. 2386–2390. https://doi.org/10.21437/Interspeech.2017-519
J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, S. Liu, On the comparison of popular end-to-end models for large scale speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 1–5. https://doi.org/10.21437/Interspeech.2020-2846
J. Li, R. Zhao, J.T. Huang, Y. Gong, Learning small-size DNN with output-distribution-based criteria, in Interspeech (2014), pp. 1910–1914
S. Ling, Y. Liu, Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020)
S. Ling, Y. Liu, J. Salazar, K. Kirchhoff, Deep contextualized acoustic representations for semi-supervised speech recognition, in ICASSP 2020 (2020), pp. 6429–6433. https://doi.org/10.1109/ICASSP40776.2020.9053176
A.H. Liu, Y.A. Chung, J. Glass, Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)
A.T. Liu, S.W. Li, H.Y. Lee, Tera: self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028 (2020)
A.T. Liu, S.W. Yang, P.H. Chi, P.C. Hsu, H.Y. Lee, Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 6419–6423. https://doi.org/10.1109/ICASSP40776.2020.9054458
Z. Meng, J. Li, Y. Gaur, Y. Gong, Domain adaptation via teacher-student learning for end-to-end speech recognition, in ASRU (2019), pp. 268–275. https://doi.org/10.1109/ASRU46091.2019.9003776
A. Mohamed, D. Okhonko, L. Zettlemoyer, Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660 (2019)
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, K. Kashino, Byol for audio: self-supervised learning for general-purpose audio representation. arXiv preprint arXiv:2103.06695 (2021)
A.V.D. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, M. Auli, fairseq: a fast, extensible toolkit for sequence modeling, in Proceedings of NAACL-HLT 2019: Demonstrations (2019)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in ICASSP (2015), pp. 5206–5210
D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, SpecAugment: a simple data augmentation method for automatic speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680
D.S. Park, Y. Zhang, Y. Jia, W. Han, C.C. Chiu, B. Li, Y. Wu, Q.V. Le, Improved noisy student training for automatic speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 2817–2821. https://doi.org/10.21437/Interspeech.2020-1470
S. Pascual, M. Ravanelli, J.Serrá, A. Bonafonte, Y. Bengio, Learning problem-agnostic speech representations from multiple self-supervised tasks, in Proceedings of Interspeech 2019 (2019), pp. 161–165. https://doi.org/10.21437/Interspeech.2019-2605
M. Rivière, A. Joulin, P. Mazaré, E. Dupoux, Unsupervised pretraining transfers well across languages, in ICASSP 2020 (2020), pp. 7414–7418. https://doi.org/10.1109/ICASSP40776.2020.9054548
S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 3465–3469. https://doi.org/10.21437/Interspeech.2019-1873
T. Sercu, C. Puhrsch, B. Kingsbury, Y. LeCun, Very deep multilingual convolutional neural networks for LVCSR, in ICASSP (2016), pp. 4955–4959. https://doi.org/10.1109/ICASSP.2016.7472620
H. Shibata, T. Kato, T. Shinozaki, S. Watanabet, Composite embedding systems for zerospeech2017 track1, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2017), pp. 747–753. https://doi.org/10.1109/ASRU.2017.8269012
A. Stolcke, F. Grezl, Hwang, Mei-Yuh. Lei, Xin. N. Morgan, D. Vergyri, Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons, in ICASSP, vol. 1 (2006). https://doi.org/10.1109/ICASSP.2006.1660022
S. Thomas, S. Ganapathy, H. Hermansky, Cross-lingual and multi-stream posterior features for low resource LVCSR systems, in Interspeech (2010)
S. Thomas, S. Ganapathy, H. Hermansky, Multilingual MLP features for low-resource LVCSR systems, in ICASSP (2012), pp. 4269–4272. https://doi.org/10.1109/ICASSP.2012.6288862
S. Thomas, M.L. Seltzer, K. Church, H. Hermansky, Deep neural network features and semi-supervised training for low resource speech recognition, in ICASSP (2013), pp. 6704–6708. https://doi.org/10.1109/ICASSP.2013.6638959
S. Tong, P.N. Garner, H. Bourlard, Cross-lingual adaptation of a CTC-based multilingual acoustic model. Speech Commun. 104, 39–46 (2018). https://doi.org/10.1016/j.specom.2018.09.001
V. Verma, A. Lamb, J. Kannala, Y. Bengio, D. Lopez-Paz, Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825 (2019)
K. Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova, The language-independent bottleneck features, in SLT (2012), pp. 336–341. https://doi.org/10.1109/SLT.2012.6424246
N.T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, H. Bourlard, Multilingual deep neural network based acoustic modeling for rapid language adaptation, in ICASSP (2014), pp. 7639–7643. https://doi.org/10.1109/ICASSP.2014.6855086
C. Wang, Y. Wu, Y. Qian, K. Kumatani, S. Liu, F. Wei, M. Zeng, X. Huang, Unispeech: unified speech representation learning with labeled and unlabeled data, in Proceedings of the 38th International Conference on Machine Learning, vol. 139. PMLR (2021), pp. 10937–10947
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, C. Fuegen, G. Zweig, M.L. Seltzer, Transformer-based acoustic modeling for hybrid speech recognition, in ICASSP 2020 (2020), pp. 6874–6878. https://doi.org/10.1109/ICASSP40776.2020.9054345
H. Zhang, M. Cisse, Y.N. Dauphin, D. Lopez-Paz, mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Acknowledgements
This work was supported by the Leading Plan of CAS (Grant No. XDC08030200).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, ZQ., Song, Y., Wu, MH. et al. Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition. Circuits Syst Signal Process 41, 6827–6843 (2022). https://doi.org/10.1007/s00034-022-02075-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02075-7