Skip to main content
Log in

Multi-domain Attention Fusion Network For Language Recognition

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Attention-based convolutional neural network models are increasingly adopted for language recognition tasks. In this paper, based on the self-attention mechanism, we solve the study of language recognition by acquiring rich context dependence. To do so, we propose a new multi-domain feature fusion network to integrate local features and their global dependencies adaptively. Specifically, we attach three attention modules to each stage of ResNet, which model semantic dependence in the time, frequency, and channel domain, respectively. The time attention module aggregates the features of all the time locations through the weighted sum of the features from the time feature map and the original features. Correspondingly, the frequency/channel attention module aggregates the features of all the frequency/channel locations through the weighted sum of the features from the frequency/channel feature map and the original features. We then aggregate the outputs of the three attention modules in three ways, which are addition, average, and maximum, respectively, to further improve feature representation. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset and the AP20-OLR-noisy-task dataset, and on both of them, our proposed method achieves the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of Data and Materials

All data used in this study are included in the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset [27, 28] and the AP20-OLR-noise-task dataset[29].

Abbreviations

LR:

Language recognition

OLR:

Oriental language recognition

DET:

Detect error trade-off

IDR:

Identification rate

SDA:

Single domain attention

SE-Net:

Squeeze-excitation network

MDAF:

Multi-domain attention fusion

MDAF-Net:

Multi-domain attention fusion network

SE:

Squeeze-excitation

FC:

Fully connected

ASP:

Attentive statistics pooling

AAM-Softmax:

Additive angular margin softmax

References

  1. Li H, Ma B, Lee KA. Spoken language recognition: from fundamentals to practice. Proc IEEE. 2013;101(5):1136–59.

    Article  Google Scholar 

  2. Waibel A, Geutner P, Tomokiyo LM, Schultz T, Woszczyna M. Multilinguality in speech and spoken language systems. Proc IEEE. 2000;88(8):1297–313.

    Article  Google Scholar 

  3. Miao X, McLoughlin I, Wang W, Zhang P. D-mona: a dilated mixed-order non-local attention network for speaker and language recognition. Neural Netw. 2021;139:201–11.

    Article  Google Scholar 

  4. Dehak N, Torres-Carrasquillo PA, Reynolds D, Dehak R. Language recognition via i-vectors and dimensionality reduction. In: Twelfth annual conference of the international speech communication association. Citeseer (2011)

  5. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2010;19(4):788–98.

    Article  Google Scholar 

  6. Huang J-T, Li J, Yu D, Deng L, Gong Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International conference on acoustics, speech and signal processing, pp. 7304–7308 (2013). IEEE

  7. Heigold G, Moreno I, Bengio S, Shazeer N. End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5115–5119 (2016). IEEE

  8. Cai W, Cai Z, Liu W, Wang X, Li M. Insights in-to-end learning scheme for language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5209–5213 (2018). IEEE

  9. Wu H, Cai W, Li M, Gao J, Zhang S, Lyu Z, Huang S: DKU-tencent submission to oriental language recognition AP18-OLR challenge. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1646–1651 (2019). IEEE

  10. Miao X, McLoughlin I, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)

  11. Zhou J, Jiang T, Li Z, Li L, Hong Q. Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function. In: Interspeech, pp. 2883–2887 (2019)

  12. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)

  13. Yadav S, Rai A. Frequency and temporal convolutional attention for text-independent speaker recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6794–6798 (2020). IEEE

  14. Miao X, McLoughlin IV, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)

  15. Woo S, Park J, Lee J-Y, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)

  16. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3146–3154 (2019)

  17. Qin Z, Zhang P, Wu F, Li X. Fcanet: frequency channel attention networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 783–792 (2021)

  18. Shi Y, Huang Q, Hain T. Robust speaker recognition using speech enhancement and attention model. arXiv preprint arXiv:2001.05031 (2020)

  19. Gao Z, Xie J, Wang Q, Li P. Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3024–3033 (2019)

  20. Zhang D, Shao J, Li X, Shen HT. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans Geosci Remote Sens. 2020;59(6):5183–96.

    Article  Google Scholar 

  21. Cai W, Cai Z, Zhang X, Wang X, Li M. A novel learnable dictionary encoding layer for end-to-end language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5189–5193 (2018). IEEE

  22. Padi B, Mohan A, Ganapathy S: Attention based hybrid i-vector blstm model for language recognition. In: INTERSPEECH, pp. 1263–1267 (2019)

  23. Cai W, Cai D, Huang S, Li M : Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5991–5995 (2019). IEEE

  24. Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.

    Article  Google Scholar 

  25. Deng J, Guo J, Xue N, Zafeiriou S. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019)

  26. Cai W, Chen J, Li M: Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv preprint arXiv:1804.05160 (2018)

  27. Tang Z, Wang D, Chen Y, Chen Q. AP17-OLR challenge: data, plan, and baseline. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017). IEEE

  28. Wang D, Li L, Tang D, Chen Q. AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp. 1–5 (2016). IEEE

  29. Li Z, Zhao M, Hong Q, Li L, Tang Z, Wang, D, Song L, Yang C. AP20-OLR challenge: three tasks and their baselines. In: 2020 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 550–555 (2020). IEEE

  30. Ma Z, Yu H. Language identification with deep bottleneck features. arXiv preprint arXiv:1809.08909 (2018)

  31. Ko T, Peddinti V, Povey D, Seltzer ML, Khudanpur S. A study on data augmentation of reverberant speech for robust speech recognition. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 5220–5224 (2017). IEEE

  32. Snyder D, Chen G, Povey D. MUSAN: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015)

  33. Cai W, Chen J, Zhang J, Li M. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1038–51.

    Article  Google Scholar 

  34. Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S: X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329–5333 (2018) IEEE

  35. Qi Z, Ma Y, Gu M. A study on low-resource language identification. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1897–1902 (2019). IEEE

  36. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)

  37. Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.

    Article  Google Scholar 

  38. Tang Z, Wang D, Chen Q. AP18-OLR challenge: three tasks and their baselines. CoRR abs/1806.00616 (2018) 1806.00616

  39. Fernando S, Sethu V, Ambikairajah E. Factorized hidden variability learning for adaptation of short duration language identification models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5204–5208 (2018). IEEE

  40. Yu J, Guo M, Xie Y, Zhang J. Articulatory features based tdnn model for spoken language recognition. In: 2019 International conference on Asian language processing (IALP), pp. 308–312 (2019). IEEE

  41. Vuddagiri RK, Mandava T, Vydana HK, Vuppala AK. Multi-head self-attention networks for language identification. In: 2019 Twelfth international conference on contemporary computing (IC3), pp. 1–5 (2019). IEEE

  42. Fan Z, Li M, Zhou S, Xu B. Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020)

  43. Li J, Wang B, Zhi Y, Li Z, Li L, Hong Q, Wang D. Oriental language recognition (olr) 2020: Summary and analysis. arXiv preprint arXiv:2107.05365 (2021)

  44. Li L, Li Z, Liu Y, Hong Q. Deep joint learning for language recognition. Neural Netw. 2021;141:72–86.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities (grant number 2021ZY87).

Author information

Authors and Affiliations

Authors

Contributions

The first author mainly performed the experiments and wrote the paper, and the other authors reviewed and edited the manuscript. All of the authors discussed the final results. All of the authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yanyan Xu or Dengfeng Ke.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ju, M., Xu, Y., Ke, D. et al. Multi-domain Attention Fusion Network For Language Recognition. SN COMPUT. SCI. 4, 39 (2023). https://doi.org/10.1007/s42979-022-01447-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01447-9

Keywords

Navigation