Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition

Dan, Zhengjia; Zhao, Yue; Bi, Xiaojun; Wu, Licheng; Ji, Qiang

doi:10.1007/978-3-031-17120-8_9

Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition

Conference paper
First Online: 24 September 2022

2364 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Abstract

In this paper, we apply multi-task learning to perform low-resource multi-dialect speech recognition, and propose a method combining Transformer and soft parameter sharing multi-task learning. Our model has two task streams: the primary task stream that recognizes speech and the auxiliary task stream identifies the dialect. The auxiliary task stream provides the dialect identification information to the auxiliary cross-attention of the primary task stream, so that the primary task stream has dialect discrimination. Experimental results on the task of Tibetan multi-dialect speech recognition show that our model outperforms the single-dialect model and hard parameter sharing based multi-dialect model, by reducing the average syllable error rate (ASER) by 30.22% and 3.89%, respectively.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Gong, B., Cai, R., et al.: Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06014 (2021)
Google Scholar
Suan, T., Cai, R., et al.: A language model for Amdo Tibetan speech recognition. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06016 (2021)
Google Scholar
Yang, X., Wang, W., et al.: Simple data augmented transformer end-to-end Tibetan speech recognition. In: 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP), pp. 148–152 (2020)
Google Scholar
Gong, C., Li, G., et al.: Research on tibetan speech recognition speech dictionary and acoustic model algorithm. In: Journal of Physics: Conference Series, vol. 1570, p. 012003 (2020)
Google Scholar
Wang, W., Chen, Z., et al.: Long short-term memory for tibetan speech recognition. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 1, pp. 1059–1063 (2020)
Google Scholar
Huang, Y., Yu, D., Liu, C., et al.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In 15th Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Jain, A., Upreti, M., Jyothi, P: Improved accented speech recognition using accent embeddings and multi-task learning. In: Proceedings of Interspeech. ISCA, pp. 2454–2458 (2018)
Google Scholar
Chen, M., Yang, Z., Liang, J., et al.: Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent specific top layer. In: Proceedings Interspeech. ISCA (2015)
Google Scholar
Siohan, O., Rybach, D.: Multitask learning and system combination for automatic speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, AZ, USA: Scottsdale, pp. 589–595 (2015)
Google Scholar
Qian, Y., Yin, M., You, Y., et al.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 310–316 (2015)
Google Scholar
Thanda, A., Venkatesan, S.M.: Multi-task Learning of Deep Neural Networks for Audio Visual Automatic Speech Recognition. arXiv:1701.02477 [cs] (2017)
Krishna, K., Toshniwal, S., Livescu, K.: Hierarchical Multitask Learning for CTC-based Speech Recognition. arXiv:1807.06234v2 (2018)
Meyer, J.: Multi-Task and Transfer Learning in Low-Resource Speech Recognition. PhD Thesis, University of Arizona (2019)
Google Scholar
Ruder, S.: An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098v1 [cs] (2017)
Misra, I., Shrivastava, A., Gupta, A., et al.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, pp. 3994–4003 (2016)
Google Scholar
Ruder, S., Bingel, J., Augenstein, I., et al.: Sluice Networks: Learning What to Share between Loosely Related Tasks. arXiv:1705.08142v1 (2017)
Chen, D., Mak, B.K.W.: Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition, pp. 1172–1183. IEEE/ACM Transactions on Audio, Speech and Language Processing (2015)
Google Scholar
Zhao, Y., Yue, J., et al.: End-to-end-based tibetan multitask speech recognition. IEEE Access 7, 162519–162529 (2019)
Article Google Scholar
Lu, J., Batra, D., Parikh, D., et al.: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (2019)
Vu, N.T., Imseng, D., Povey, D., et al.: Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2014), Florence, Italy, pp. 7639–764 (2014)
Google Scholar
Thomas, S., Ganapathy, S., Hermansky, H: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Proceedings of Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, Japan, pp. 877–880 (2010)
Google Scholar
Heigold, G., Vanhoucke, V., Senior, A., et al.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2013), Vancouver, BC, Canada, pp. 8619–8623 (2013)
Google Scholar
Rao, K., Sak:, H.: Multi-accent speech recognition with hierarchical grapheme based models. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, pp. 4815–4819 (2017)
Google Scholar
Byrne, W., Beyerlein, P., Huerta, J.M., et al.: Towards language independent acoustic modeling. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Hilton Hotel and Convention Center, Istanbul, Turkey, pp.1029–1032, (2000)
Google Scholar
Toshniwal, S., Sainath, T.N., Weiss, R.J., et al.: Multilingual speech recognition with a single end-to-end model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4904–4908 (2018)
Google Scholar
Shetty, V.M., Sagaya, M., Mary, N.J.: Improving the performance of transformer based low resource speech recognition for Indian languages. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8279–8283 (2020)
Google Scholar
Li, B., Sainath, T.N., et al.: Multi-dialect speech recognition with a single sequence-to-sequence model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4749–4753 (2018)
Google Scholar
Tang, Z., Li, L., Wang, D., et al.: Collaborative joint training with multitask recurrent model for speech and speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 493–504 (2017)
Article Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018)
Google Scholar
Open Speech and Language Resource. Available online: http://www.openslr.org/124/. Accessed 3 April 2022
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint: 1412.6980 (2014)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineering, Minzu University of China, Beijing, 100080, China
Zhengjia Dan, Yue Zhao, Xiaojun Bi & Licheng Wu
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY, 12180-3590, USA
Qiang Ji

Authors

Zhengjia Dan
View author publications
You can also search for this author in PubMed Google Scholar
Yue Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Bi
View author publications
You can also search for this author in PubMed Google Scholar
Licheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Zhao .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dan, Z., Zhao, Y., Bi, X., Wu, L., Ji, Q. (2022). Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_9
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)