Skip to main content

Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Abstract

In this paper, we apply multi-task learning to perform low-resource multi-dialect speech recognition, and propose a method combining Transformer and soft parameter sharing multi-task learning. Our model has two task streams: the primary task stream that recognizes speech and the auxiliary task stream identifies the dialect. The auxiliary task stream provides the dialect identification information to the auxiliary cross-attention of the primary task stream, so that the primary task stream has dialect discrimination. Experimental results on the task of Tibetan multi-dialect speech recognition show that our model outperforms the single-dialect model and hard parameter sharing based multi-dialect model, by reducing the average syllable error rate (ASER) by 30.22% and 3.89%, respectively.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Gong, B., Cai, R., et al.: Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06014 (2021)

    Google Scholar 

  2. Suan, T., Cai, R., et al.: A language model for Amdo Tibetan speech recognition. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06016 (2021)

    Google Scholar 

  3. Yang, X., Wang, W., et al.: Simple data augmented transformer end-to-end Tibetan speech recognition. In: 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP), pp. 148–152 (2020)

    Google Scholar 

  4. Gong, C., Li, G., et al.: Research on tibetan speech recognition speech dictionary and acoustic model algorithm. In: Journal of Physics: Conference Series, vol. 1570, p. 012003 (2020)

    Google Scholar 

  5. Wang, W., Chen, Z., et al.: Long short-term memory for tibetan speech recognition. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 1, pp. 1059–1063 (2020)

    Google Scholar 

  6. Huang, Y., Yu, D., Liu, C., et al.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In 15th Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  7. Jain, A., Upreti, M., Jyothi, P: Improved accented speech recognition using accent embeddings and multi-task learning. In: Proceedings of Interspeech. ISCA, pp. 2454–2458 (2018)

    Google Scholar 

  8. Chen, M., Yang, Z., Liang, J., et al.: Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent specific top layer. In: Proceedings Interspeech. ISCA (2015)

    Google Scholar 

  9. Siohan, O., Rybach, D.: Multitask learning and system combination for automatic speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, AZ, USA: Scottsdale, pp. 589–595 (2015)

    Google Scholar 

  10. Qian, Y., Yin, M., You, Y., et al.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 310–316 (2015)

    Google Scholar 

  11. Thanda, A., Venkatesan, S.M.: Multi-task Learning of Deep Neural Networks for Audio Visual Automatic Speech Recognition. arXiv:1701.02477 [cs] (2017)

  12. Krishna, K., Toshniwal, S., Livescu, K.: Hierarchical Multitask Learning for CTC-based Speech Recognition. arXiv:1807.06234v2 (2018)

  13. Meyer, J.: Multi-Task and Transfer Learning in Low-Resource Speech Recognition. PhD Thesis, University of Arizona (2019)

    Google Scholar 

  14. Ruder, S.: An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098v1 [cs] (2017)

  15. Misra, I., Shrivastava, A., Gupta, A., et al.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, pp. 3994–4003 (2016)

    Google Scholar 

  16. Ruder, S., Bingel, J., Augenstein, I., et al.: Sluice Networks: Learning What to Share between Loosely Related Tasks. arXiv:1705.08142v1 (2017)

  17. Chen, D., Mak, B.K.W.: Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition, pp. 1172–1183. IEEE/ACM Transactions on Audio, Speech and Language Processing (2015)

    Google Scholar 

  18. Zhao, Y., Yue, J., et al.: End-to-end-based tibetan multitask speech recognition. IEEE Access 7, 162519–162529 (2019)

    Article  Google Scholar 

  19. Lu, J., Batra, D., Parikh, D., et al.: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (2019)

  20. Vu, N.T., Imseng, D., Povey, D., et al.: Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2014), Florence, Italy, pp. 7639–764 (2014)

    Google Scholar 

  21. Thomas, S., Ganapathy, S., Hermansky, H: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Proceedings of Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, Japan, pp. 877–880 (2010)

    Google Scholar 

  22. Heigold, G., Vanhoucke, V., Senior, A., et al.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2013), Vancouver, BC, Canada, pp. 8619–8623 (2013)

    Google Scholar 

  23. Rao, K., Sak:, H.: Multi-accent speech recognition with hierarchical grapheme based models. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, pp. 4815–4819 (2017)

    Google Scholar 

  24. Byrne, W., Beyerlein, P., Huerta, J.M., et al.: Towards language independent acoustic modeling. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Hilton Hotel and Convention Center, Istanbul, Turkey, pp.1029–1032, (2000)

    Google Scholar 

  25. Toshniwal, S., Sainath, T.N., Weiss, R.J., et al.: Multilingual speech recognition with a single end-to-end model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4904–4908 (2018)

    Google Scholar 

  26. Shetty, V.M., Sagaya, M., Mary, N.J.: Improving the performance of transformer based low resource speech recognition for Indian languages. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8279–8283 (2020)

    Google Scholar 

  27. Li, B., Sainath, T.N., et al.: Multi-dialect speech recognition with a single sequence-to-sequence model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4749–4753 (2018)

    Google Scholar 

  28. Tang, Z., Li, L., Wang, D., et al.: Collaborative joint training with multitask recurrent model for speech and speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 493–504 (2017)

    Article  Google Scholar 

  29. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018)

    Google Scholar 

  30. Open Speech and Language Resource. Available online: http://www.openslr.org/124/. Accessed 3 April 2022

  31. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint: 1412.6980 (2014)

    Google Scholar 

  32. Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dan, Z., Zhao, Y., Bi, X., Wu, L., Ji, Q. (2022). Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17120-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17119-2

  • Online ISBN: 978-3-031-17120-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics