Abstract
In this paper, we apply multi-task learning to perform low-resource multi-dialect speech recognition, and propose a method combining Transformer and soft parameter sharing multi-task learning. Our model has two task streams: the primary task stream that recognizes speech and the auxiliary task stream identifies the dialect. The auxiliary task stream provides the dialect identification information to the auxiliary cross-attention of the primary task stream, so that the primary task stream has dialect discrimination. Experimental results on the task of Tibetan multi-dialect speech recognition show that our model outperforms the single-dialect model and hard parameter sharing based multi-dialect model, by reducing the average syllable error rate (ASER) by 30.22% and 3.89%, respectively.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Gong, B., Cai, R., et al.: Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06014 (2021)
Suan, T., Cai, R., et al.: A language model for Amdo Tibetan speech recognition. In: MATEC Web of Conferences, EDP Sciences, vol. 336, p. 06016 (2021)
Yang, X., Wang, W., et al.: Simple data augmented transformer end-to-end Tibetan speech recognition. In: 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP), pp. 148–152 (2020)
Gong, C., Li, G., et al.: Research on tibetan speech recognition speech dictionary and acoustic model algorithm. In: Journal of Physics: Conference Series, vol. 1570, p. 012003 (2020)
Wang, W., Chen, Z., et al.: Long short-term memory for tibetan speech recognition. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 1, pp. 1059–1063 (2020)
Huang, Y., Yu, D., Liu, C., et al.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In 15th Annual Conference of the International Speech Communication Association (2014)
Jain, A., Upreti, M., Jyothi, P: Improved accented speech recognition using accent embeddings and multi-task learning. In: Proceedings of Interspeech. ISCA, pp. 2454–2458 (2018)
Chen, M., Yang, Z., Liang, J., et al.: Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent specific top layer. In: Proceedings Interspeech. ISCA (2015)
Siohan, O., Rybach, D.: Multitask learning and system combination for automatic speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, AZ, USA: Scottsdale, pp. 589–595 (2015)
Qian, Y., Yin, M., You, Y., et al.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 310–316 (2015)
Thanda, A., Venkatesan, S.M.: Multi-task Learning of Deep Neural Networks for Audio Visual Automatic Speech Recognition. arXiv:1701.02477 [cs] (2017)
Krishna, K., Toshniwal, S., Livescu, K.: Hierarchical Multitask Learning for CTC-based Speech Recognition. arXiv:1807.06234v2 (2018)
Meyer, J.: Multi-Task and Transfer Learning in Low-Resource Speech Recognition. PhD Thesis, University of Arizona (2019)
Ruder, S.: An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098v1 [cs] (2017)
Misra, I., Shrivastava, A., Gupta, A., et al.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, pp. 3994–4003 (2016)
Ruder, S., Bingel, J., Augenstein, I., et al.: Sluice Networks: Learning What to Share between Loosely Related Tasks. arXiv:1705.08142v1 (2017)
Chen, D., Mak, B.K.W.: Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition, pp. 1172–1183. IEEE/ACM Transactions on Audio, Speech and Language Processing (2015)
Zhao, Y., Yue, J., et al.: End-to-end-based tibetan multitask speech recognition. IEEE Access 7, 162519–162529 (2019)
Lu, J., Batra, D., Parikh, D., et al.: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (2019)
Vu, N.T., Imseng, D., Povey, D., et al.: Multilingual deep neural network based acoustic modeling for rapid language adaptation. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2014), Florence, Italy, pp. 7639–764 (2014)
Thomas, S., Ganapathy, S., Hermansky, H: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Proceedings of Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, Japan, pp. 877–880 (2010)
Heigold, G., Vanhoucke, V., Senior, A., et al.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2013), Vancouver, BC, Canada, pp. 8619–8623 (2013)
Rao, K., Sak:, H.: Multi-accent speech recognition with hierarchical grapheme based models. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, pp. 4815–4819 (2017)
Byrne, W., Beyerlein, P., Huerta, J.M., et al.: Towards language independent acoustic modeling. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Hilton Hotel and Convention Center, Istanbul, Turkey, pp.1029–1032, (2000)
Toshniwal, S., Sainath, T.N., Weiss, R.J., et al.: Multilingual speech recognition with a single end-to-end model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4904–4908 (2018)
Shetty, V.M., Sagaya, M., Mary, N.J.: Improving the performance of transformer based low resource speech recognition for Indian languages. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8279–8283 (2020)
Li, B., Sainath, T.N., et al.: Multi-dialect speech recognition with a single sequence-to-sequence model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4749–4753 (2018)
Tang, Z., Li, L., Wang, D., et al.: Collaborative joint training with multitask recurrent model for speech and speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 493–504 (2017)
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018)
Open Speech and Language Resource. Available online: http://www.openslr.org/124/. Accessed 3 April 2022
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint: 1412.6980 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dan, Z., Zhao, Y., Bi, X., Wu, L., Ji, Q. (2022). Multi-task Learning with Auxiliary Cross-attention Transformer for Low-Resource Multi-dialect Speech Recognition. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-17120-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)