Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Zhang, Zi-Qiang; Song, Yan; Wu, Ming-Hui; Fang, Xin; McLoughlin, Ian; Dai, Li-Rong

doi:10.1007/s00034-022-02075-7

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Published: 23 July 2022

Volume 41, pages 6827–6843, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Zi-Qiang Zhang¹,
Yan Song¹,
Ming-Hui Wu¹,
Xin Fang¹,
Ian McLoughlin² &
…
Li-Rong Dai¹

399 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Representation learning or pre-training has shown promising performance for low-resource speech recognition which suffers from the data shortage. Recently, self-supervised methods have achieved surprising performance for speech pre-training by effectively utilizing large amount of un-annotated data. In this paper, we propose a new pre-training framework, Cross-Lingual Self-Training (XLST), to further improve the effectiveness for multilingual representation learning. Specifically, XLST first trains a phoneme classification model with a small amount of annotated data of a non-target language and then uses it to produce initial targets for training another model on multilingual un-annotated data, i.e., maximizing frame-level similarity between the output embeddings of two models. Furthermore, we employ the moving average and multi-view data augmentation mechanisms to better generalize the learned representations. Experimental results on downstream speech recognition tasks for 5 low-resource languages demonstrate the effectiveness of XLST. Specifically, leveraging additional 100 h of annotated English data for pre-training, the proposed XLST achieves a relative 24.8% PER reduction over the state-of-the-art self-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Automatic speech recognition: a survey

Article 10 November 2020

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Data Availability

The training datasets generated during the current study are available in the CommonVoice (https://commonvoice.mozilla.org) and the LibriSpeech corpus (http://www.openslr.org/12). The evaluation datasets generated during this study are included in this published article [40].

Notes

As the older version is no longer available, we use the December 2019 release, maintaining the same number of hours of data as [10] for 11 languages. CommonVoice corpus is publicly available in https://commonvoice.mozilla.org.
LibriSpeech corpus is publicly available in http://www.openslr.org/12, we follow [34] to obtain the aligned frame-level phoneme labels from http://www.kaldi-asr.org/downloads/build/6/trunk/egs/librispeech.
Reproduced from the CommonVoice test data which can be accessed in [40].
As suggested in [16], for random initialization experiment we add an extra MLP predictor (same architecture as the projector) on the top of the Main Network, and moving average mechanism (\(\lambda =0.99,\Lambda =32\)) is applied.

References

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, G. Weber, Common voice: a massively-multilingual speech corpus, in Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association (2020), pp. 4218–4222
A. Baevski, S. Schneider, M. Auli, vq-wav2vec: Self-supervised learning of discrete speech representations, in International Conference on Learning Representations (2020)
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014). https://doi.org/10.1016/j.specom.2013.07.008
Article Google Scholar
T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in Proceedings of the 37th International Conference on Machine Learning, ed. by H.D.A. Singh III, vol. 119 (PMLR, 2020), pp. 1597–1607
X. Chen, K. He, Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566 (2020)
J. Cho, M.K. Baskar, R. Li, M. Wiesner, S.H. Mallidi, N. Yalta, M. Karafiát, S. Watanabe, T. Hori, Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling, in SLT (2018), pp. 521–527. https://doi.org/10.1109/SLT.2018.8639655
J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019). https://doi.org/10.1109/TASLP.2019.2938863
Article Google Scholar
Y.A. Chung, W.N. Hsu, H. Tang, J. Glass, An unsupervised autoregressive model for speech representation learning, in Proceedings of Interspeech 2019 (2019), pp. 146–150
A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)
S. Dalmia, R. Sanabria, F. Metze, A.W. Black, Sequence-based multi-lingual low resource speech recognition, in ICASSP (2018), pp. 4909–4913. https://doi.org/10.1109/ICASSP.2018.8461802
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers) (2019), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
S. Feng, T. Lee, Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019). https://doi.org/10.1109/TASLP.2019.2937953
Article Google Scholar
A. Ghoshal, P. Swietojanski, S. Renals, Multilingual training of deep neural networks, in ICASSP (2013), pp. 7319–7323. https://doi.org/10.1109/ICASSP.2013.6639084
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in Proceedings of the 23rd International Conference on Machine Learning (2006), pp. 369–376
J.B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko, Bootstrap your own latent—a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
J. Huang, J. Li, D. Yu, L. Deng, Y. Gong, Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, in ICASSP (2013), pp. 7304–7308. https://doi.org/10.1109/ICASSP.2013.6639081
J. Kahn, A. Lee, A. Hannun, Self-training for end-to-end speech recognition, in ICASSP 2020 (2020), pp. 7084–7088. https://doi.org/10.1109/ICASSP40776.2020.9054295
M. Karafiát, M.K. Baskar, S. Watanabe, T. Hori, M. Wiesner, J. Cernocký, Analysis of multilingual sequence-to-sequence speech recognition systems, in Proceedings of Interspeech 2019 (2019), pp. 2220–2224. https://doi.org/10.21437/Interspeech.2019-2355
S. Khurana, A. Laurent, W.N. Hsu, J. Chorowski, A. Lancucki, R. Marxer, J. Glass, A convolutional deep Markov model for unsupervised speech representation learning, in Proceedings of Interspeech 2020 (2020), pp. 3790–3794. https://doi.org/10.21437/Interspeech.2020-3084
S. Khurana, N. Moritz, T. Hori, J.L. Roux, Unsupervised domain adaptation for speech recognition via uncertainty driven self-training, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 6553–6557. https://doi.org/10.1109/ICASSP39728.2021.9414299
P. Lal, S. King, Cross-lingual automatic speech recognition using tandem features. IEEE Trans. Audio Speech Lang. Process. 21(12), 2506–2515 (2013). https://doi.org/10.1109/TASL.2013.2277932
Article Google Scholar
J. Li, M.L. Seltzer, X. Wang, R. Zhao, Y. Gong, Large-scale domain adaptation via teacher-student learning, in Proceedings of Interspeech 2017 (2017), pp. 2386–2390. https://doi.org/10.21437/Interspeech.2017-519
J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, S. Liu, On the comparison of popular end-to-end models for large scale speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 1–5. https://doi.org/10.21437/Interspeech.2020-2846
J. Li, R. Zhao, J.T. Huang, Y. Gong, Learning small-size DNN with output-distribution-based criteria, in Interspeech (2014), pp. 1910–1914
S. Ling, Y. Liu, Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020)
S. Ling, Y. Liu, J. Salazar, K. Kirchhoff, Deep contextualized acoustic representations for semi-supervised speech recognition, in ICASSP 2020 (2020), pp. 6429–6433. https://doi.org/10.1109/ICASSP40776.2020.9053176
A.H. Liu, Y.A. Chung, J. Glass, Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)
A.T. Liu, S.W. Li, H.Y. Lee, Tera: self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028 (2020)
A.T. Liu, S.W. Yang, P.H. Chi, P.C. Hsu, H.Y. Lee, Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 6419–6423. https://doi.org/10.1109/ICASSP40776.2020.9054458
Z. Meng, J. Li, Y. Gaur, Y. Gong, Domain adaptation via teacher-student learning for end-to-end speech recognition, in ASRU (2019), pp. 268–275. https://doi.org/10.1109/ASRU46091.2019.9003776
A. Mohamed, D. Okhonko, L. Zettlemoyer, Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660 (2019)
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, K. Kashino, Byol for audio: self-supervised learning for general-purpose audio representation. arXiv preprint arXiv:2103.06695 (2021)
A.V.D. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, M. Auli, fairseq: a fast, extensible toolkit for sequence modeling, in Proceedings of NAACL-HLT 2019: Demonstrations (2019)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in ICASSP (2015), pp. 5206–5210
D.S. Park, W. Chan, Y. Zhang, C.C. Chiu, B. Zoph, E.D. Cubuk, Q.V. Le, SpecAugment: a simple data augmentation method for automatic speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680
D.S. Park, Y. Zhang, Y. Jia, W. Han, C.C. Chiu, B. Li, Y. Wu, Q.V. Le, Improved noisy student training for automatic speech recognition, in Proceedings of Interspeech 2020 (2020), pp. 2817–2821. https://doi.org/10.21437/Interspeech.2020-1470
S. Pascual, M. Ravanelli, J.Serrá, A. Bonafonte, Y. Bengio, Learning problem-agnostic speech representations from multiple self-supervised tasks, in Proceedings of Interspeech 2019 (2019), pp. 161–165. https://doi.org/10.21437/Interspeech.2019-2605
M. Rivière, A. Joulin, P. Mazaré, E. Dupoux, Unsupervised pretraining transfers well across languages, in ICASSP 2020 (2020), pp. 7414–7418. https://doi.org/10.1109/ICASSP40776.2020.9054548
S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition, in Proceedings of Interspeech 2019 (2019), pp. 3465–3469. https://doi.org/10.21437/Interspeech.2019-1873
T. Sercu, C. Puhrsch, B. Kingsbury, Y. LeCun, Very deep multilingual convolutional neural networks for LVCSR, in ICASSP (2016), pp. 4955–4959. https://doi.org/10.1109/ICASSP.2016.7472620
H. Shibata, T. Kato, T. Shinozaki, S. Watanabet, Composite embedding systems for zerospeech2017 track1, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2017), pp. 747–753. https://doi.org/10.1109/ASRU.2017.8269012
A. Stolcke, F. Grezl, Hwang, Mei-Yuh. Lei, Xin. N. Morgan, D. Vergyri, Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons, in ICASSP, vol. 1 (2006). https://doi.org/10.1109/ICASSP.2006.1660022
S. Thomas, S. Ganapathy, H. Hermansky, Cross-lingual and multi-stream posterior features for low resource LVCSR systems, in Interspeech (2010)
S. Thomas, S. Ganapathy, H. Hermansky, Multilingual MLP features for low-resource LVCSR systems, in ICASSP (2012), pp. 4269–4272. https://doi.org/10.1109/ICASSP.2012.6288862
S. Thomas, M.L. Seltzer, K. Church, H. Hermansky, Deep neural network features and semi-supervised training for low resource speech recognition, in ICASSP (2013), pp. 6704–6708. https://doi.org/10.1109/ICASSP.2013.6638959
S. Tong, P.N. Garner, H. Bourlard, Cross-lingual adaptation of a CTC-based multilingual acoustic model. Speech Commun. 104, 39–46 (2018). https://doi.org/10.1016/j.specom.2018.09.001
Article Google Scholar
V. Verma, A. Lamb, J. Kannala, Y. Bengio, D. Lopez-Paz, Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825 (2019)
K. Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova, The language-independent bottleneck features, in SLT (2012), pp. 336–341. https://doi.org/10.1109/SLT.2012.6424246
N.T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, H. Bourlard, Multilingual deep neural network based acoustic modeling for rapid language adaptation, in ICASSP (2014), pp. 7639–7643. https://doi.org/10.1109/ICASSP.2014.6855086
C. Wang, Y. Wu, Y. Qian, K. Kumatani, S. Liu, F. Wei, M. Zeng, X. Huang, Unispeech: unified speech representation learning with labeled and unlabeled data, in Proceedings of the 38th International Conference on Machine Learning, vol. 139. PMLR (2021), pp. 10937–10947
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, C. Fuegen, G. Zweig, M.L. Seltzer, Transformer-based acoustic modeling for hybrid speech recognition, in ICASSP 2020 (2020), pp. 6874–6878. https://doi.org/10.1109/ICASSP40776.2020.9054345
H. Zhang, M. Cisse, Y.N. Dauphin, D. Lopez-Paz, mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Download references

Acknowledgements

This work was supported by the Leading Plan of CAS (Grant No. XDC08030200).

Author information

Authors and Affiliations

University of Science and Technology of China (USTC), Hefei, China
Zi-Qiang Zhang, Yan Song, Ming-Hui Wu, Xin Fang & Li-Rong Dai
ICT Cluster, Singapore Institute of Technology, Singapore, Singapore
Ian McLoughlin

Authors

Zi-Qiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Song
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Ian McLoughlin
View author publications
You can also search for this author in PubMed Google Scholar
Li-Rong Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, ZQ., Song, Y., Wu, MH. et al. Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition. Circuits Syst Signal Process 41, 6827–6843 (2022). https://doi.org/10.1007/s00034-022-02075-7

Download citation

Received: 10 August 2021
Revised: 29 May 2022
Accepted: 30 May 2022
Published: 23 July 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00034-022-02075-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation