Skip to main content
Log in

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Representation learning or pre-training has shown promising performance for low-resource speech recognition which suffers from the data shortage. Recently, self-supervised methods have achieved surprising performance for speech pre-training by effectively utilizing large amount of un-annotated data. In this paper, we propose a new pre-training framework, Cross-Lingual Self-Training (XLST), to further improve the effectiveness for multilingual representation learning. Specifically, XLST first trains a phoneme classification model with a small amount of annotated data of a non-target language and then uses it to produce initial targets for training another model on multilingual un-annotated data, i.e., maximizing frame-level similarity between the output embeddings of two models. Furthermore, we employ the moving average and multi-view data augmentation mechanisms to better generalize the learned representations. Experimental results on downstream speech recognition tasks for 5 low-resource languages demonstrate the effectiveness of XLST. Specifically, leveraging additional 100 h of annotated English data for pre-training, the proposed XLST achieves a relative 24.8% PER reduction over the state-of-the-art self-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data Availability

The training datasets generated during the current study are available in the CommonVoice (https://commonvoice.mozilla.org) and the LibriSpeech corpus (http://www.openslr.org/12). The evaluation datasets generated during this study are included in this published article  [40].

Notes

  1. As the older version is no longer available, we use the December 2019 release, maintaining the same number of hours of data as [10] for 11 languages. CommonVoice corpus is publicly available in https://commonvoice.mozilla.org.

  2. LibriSpeech corpus is publicly available in http://www.openslr.org/12, we follow [34] to obtain the aligned frame-level phoneme labels from http://www.kaldi-asr.org/downloads/build/6/trunk/egs/librispeech.

  3. Reproduced from the CommonVoice test data which can be accessed in [40].

  4. As suggested in [16], for random initialization experiment we add an extra MLP predictor (same architecture as the projector) on the top of the Main Network, and moving average mechanism (\(\lambda =0.99,\Lambda =32\)) is applied.

References

  1. R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, G. Weber, Common voice: a massively-multilingual speech corpus, in Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association (2020), pp. 4218–4222

  2. A. Baevski, S. Schneider, M. Auli, vq-wav2vec: Self-supervised learning of discrete speech representations, in International Conference on Learning Representations (2020)

  3. A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  4. L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014). https://doi.org/10.1016/j.specom.2013.07.008

    Article  Google Scholar 

  5. T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in Proceedings of the 37th International Conference on Machine Learning, ed. by H.D.A. Singh III, vol. 119 (PMLR, 2020), pp. 1597–1607

  6. X. Chen, K. He, Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566 (2020)

  7. J. Cho, M.K. Baskar, R. Li, M. Wiesner, S.H. Mallidi, N. Yalta, M. Karafiát, S. Watanabe, T. Hori, Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling, in SLT (2018), pp. 521–527. https://doi.org/10.1109/SLT.2018.8639655

  8. J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019). https://doi.org/10.1109/TASLP.2019.2938863

    Article  Google Scholar 

  9. Y.A. Chung, W.N. Hsu, H. Tang, J. Glass, An unsupervised autoregressive model for speech representation learning, in Proceedings of Interspeech 2019 (2019), pp. 146–150

  10. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)

  11. S. Dalmia, R. Sanabria, F. Metze, A.W. Black, Sequence-based multi-lingual low resource speech recognition, in ICASSP (2018), pp. 4909–4913. https://doi.org/10.1109/ICASSP.2018.8461802

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers) (2019), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

  13. S. Feng, T. Lee, Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019). https://doi.org/10.1109/TASLP.2019.2937953

    Article  Google Scholar 

  14. A. Ghoshal, P. Swietojanski, S. Renals, Multilingual training of deep neural networks, in ICASSP (2013), pp. 7319–7323. https://doi.org/10.1109/ICASSP.2013.6639084

  15. A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 369–376

  16. J.B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko, Bootstrap your own latent—a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)

    Google Scholar 

  17. J. Huang, J. Li, D. Yu, L. Deng, Y. Gong, Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, in ICASSP (2013), pp. 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081

  18. J. Kahn, A. Lee, A. Hannun, Self-training for end-to-end speech recognition, in ICASSP 2020 (2020), pp. 7084–7088. https://doi.org/10.1109/ICASSP40776.2020.9054295

  19. M. Karafiát, M.K. Baskar, S. Watanabe, T. Hori, M. Wiesner, J. Cernocký, Analysis of multilingual sequence-to-sequence speech recognition systems, in Proceedings of Interspeech 2019 (2019), pp. 2220–2224. https://doi.org/10.21437/Interspeech.2019-2355

  20. S. Khurana, A. Laurent, W.N. Hsu, J. Chorowski, A. Lancucki, R. Marxer, J. Glass, A convolutional deep Markov model for unsupervised speech representation learning, in Proceedings of Interspeech 2020 (2020), pp. 3790–3794. https://doi.org/10.21437/Interspeech.2020-3084

  21. S. Khurana, N. Moritz, T. Hori, J.L. Roux, Unsupervised domain adaptation for speech recognition via uncertainty driven self-training, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 6553–6557. https://doi.org/10.1109/ICASSP39728.2021.9414299

  22. P. Lal, S. King, Cross-lingual automatic speech recognition using tandem features. IEEE Trans. Audio Speech Lang. Process. 21(12), 2506–2515 (2013). https://doi.org/10.1109/TASL.2013.2277932

    Article  Google Scholar 

  23. J. Li, M.L. Seltzer, X. Wang, R. Zhao, Y. Gong, Large-scale domain adaptation via teacher-student learning, in Proceedings of Interspeech 2017 (2017), pp. 2386–2390. https://doi.org/10.21437/Interspeech.2017-519

  24. J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, S. Liu, On the comparison of popular end-to-end models for large scale speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 1–5. https://doi.org/10.21437/Interspeech.2020-2846

  25. J. Li, R. Zhao, J.T. Huang, Y. Gong, Learning small-size DNN with output-distribution-based criteria, in Interspeech (2014), pp. 1910–1914

  26. S. Ling, Y. Liu, Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020)

  27. S. Ling, Y. Liu, J. Salazar, K. Kirchhoff, Deep contextualized acoustic representations for semi-supervised speech recognition, in ICASSP 2020 (2020), pp. 6429–6433. https://doi.org/10.1109/ICASSP40776.2020.9053176

  28. A.H. Liu, Y.A. Chung, J. Glass, Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)

  29. A.T. Liu, S.W. Li, H.Y. Lee, Tera: self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028 (2020)

  30. A.T. Liu, S.W. Yang, P.H. Chi, P.C. Hsu, H.Y. Lee, Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 6419–6423. https://doi.org/10.1109/ICASSP40776.2020.9054458

  31. Z. Meng, J. Li, Y. Gaur, Y. Gong, Domain adaptation via teacher-student learning for end-to-end speech recognition, in ASRU (2019), pp. 268–275. https://doi.org/10.1109/ASRU46091.2019.9003776

  32. A. Mohamed, D. Okhonko, L. Zettlemoyer, Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660 (2019)

  33. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, K. Kashino, Byol for audio: self-supervised learning for general-purpose audio representation. arXiv preprint arXiv:2103.06695 (2021)

  34. A.V.D. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  35. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, M. Auli, fairseq: a fast, extensible toolkit for sequence modeling, in Proceedings of NAACL-HLT 2019: Demonstrations (2019)

  36. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in ICASSP (2015), pp. 5206–5210

  37. D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, SpecAugment: a simple data augmentation method for automatic speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680

  38. D.S. Park, Y. Zhang, Y. Jia, W. Han, C.C. Chiu, B. Li, Y. Wu, Q.V. Le, Improved noisy student training for automatic speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 2817–2821. https://doi.org/10.21437/Interspeech.2020-1470

  39. S. Pascual, M. Ravanelli, J.Serrá, A. Bonafonte, Y. Bengio, Learning problem-agnostic speech representations from multiple self-supervised tasks, in Proceedings of Interspeech 2019 (2019), pp. 161–165. https://doi.org/10.21437/Interspeech.2019-2605

  40. M. Rivière, A. Joulin, P. Mazaré, E. Dupoux, Unsupervised pretraining transfers well across languages, in ICASSP 2020 (2020), pp. 7414–7418. https://doi.org/10.1109/ICASSP40776.2020.9054548

  41. S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 3465–3469. https://doi.org/10.21437/Interspeech.2019-1873

  42. T. Sercu, C. Puhrsch, B. Kingsbury, Y. LeCun, Very deep multilingual convolutional neural networks for LVCSR, in ICASSP (2016), pp. 4955–4959. https://doi.org/10.1109/ICASSP.2016.7472620

  43. H. Shibata, T. Kato, T. Shinozaki, S. Watanabet, Composite embedding systems for zerospeech2017 track1, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2017), pp. 747–753. https://doi.org/10.1109/ASRU.2017.8269012

  44. A. Stolcke, F. Grezl, Hwang, Mei-Yuh. Lei, Xin. N. Morgan, D. Vergyri, Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons, in ICASSP, vol. 1 (2006). https://doi.org/10.1109/ICASSP.2006.1660022

  45. S. Thomas, S. Ganapathy, H. Hermansky, Cross-lingual and multi-stream posterior features for low resource LVCSR systems, in Interspeech (2010)

  46. S. Thomas, S. Ganapathy, H. Hermansky, Multilingual MLP features for low-resource LVCSR systems, in ICASSP (2012), pp. 4269–4272. https://doi.org/10.1109/ICASSP.2012.6288862

  47. S. Thomas, M.L. Seltzer, K. Church, H. Hermansky, Deep neural network features and semi-supervised training for low resource speech recognition, in ICASSP (2013), pp. 6704–6708. https://doi.org/10.1109/ICASSP.2013.6638959

  48. S. Tong, P.N. Garner, H. Bourlard, Cross-lingual adaptation of a CTC-based multilingual acoustic model. Speech Commun. 104, 39–46 (2018). https://doi.org/10.1016/j.specom.2018.09.001

    Article  Google Scholar 

  49. V. Verma, A. Lamb, J. Kannala, Y. Bengio, D. Lopez-Paz, Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825 (2019)

  50. K. Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova, The language-independent bottleneck features, in SLT (2012), pp. 336–341. https://doi.org/10.1109/SLT.2012.6424246

  51. N.T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, H. Bourlard, Multilingual deep neural network based acoustic modeling for rapid language adaptation, in ICASSP (2014), pp. 7639–7643. https://doi.org/10.1109/ICASSP.2014.6855086

  52. C. Wang, Y. Wu, Y. Qian, K. Kumatani, S. Liu, F. Wei, M. Zeng, X. Huang, Unispeech: unified speech representation learning with labeled and unlabeled data, in Proceedings of the 38th International Conference on Machine Learning, vol. 139. PMLR (2021), pp. 10937–10947

  53. Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, C. Fuegen, G. Zweig, M.L. Seltzer, Transformer-based acoustic modeling for hybrid speech recognition, in ICASSP 2020 (2020), pp. 6874–6878. https://doi.org/10.1109/ICASSP40776.2020.9054345

  54. H. Zhang, M. Cisse, Y.N. Dauphin, D. Lopez-Paz, mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Download references

Acknowledgements

This work was supported by the Leading Plan of CAS (Grant No. XDC08030200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, ZQ., Song, Y., Wu, MH. et al. Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition. Circuits Syst Signal Process 41, 6827–6843 (2022). https://doi.org/10.1007/s00034-022-02075-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02075-7

Keywords

Navigation