Skip to main content

Speaker Verification with Disentangled Self-attention

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13108))

Included in the following conference series:

  • 2781 Accesses

Abstract

In prior works of speaker verification, self-attention networks attract remarkable interests among end-to-end models and achieve great results. In this paper, we have proposed an integrated framework which is used for speaker verification (SV), disentangled self-attention network (DSAN), which focuses on the self-attention in depth. Based on Transformer, attention computation in DSAN is divided into two parts, a pairwise part which learns the relationship between two frames and a unary part which learns the importance of each frame. The original self-attention mechanism trains these two parts together which hinders the learning of each part. We show the effectiveness of this modification on speaker verification task. The proposed model trained on TIMIT, AISHELL-1 and VoxCeleb shows significant performance improvement over LSTM and traditional self-attention network. And we improve the interpretability of the model. Our best result yields an equal error rate (EER) result of 0.91% on TIMIT and 2.11% on VoxCeleb-E.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)

    Google Scholar 

  2. Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)

    Google Scholar 

  3. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)

    Article  Google Scholar 

  4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

    Google Scholar 

  5. Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

  6. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)

    Article  Google Scholar 

  7. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1–1.1. STIN 93, 27403 (1993)

    Google Scholar 

  8. Grawunder, S., Bose, I.: Average speaking pitch vs. average speaker fundamental frequency-reliability, homogeneity, and self report of listener groups. In: Proceedings of the International Conference Speech Prosody, pp. 763–766 (2008)

    Google Scholar 

  9. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE (2016)

    Google Scholar 

  10. India, M., Safari, P., Hernando, J.: Double multi-head attention for speaker verification. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6144–6148. IEEE (2021)

    Google Scholar 

  11. Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41

    Chapter  Google Scholar 

  12. Li, J., Lee, T.: Text-independent speaker verification with dual attention network. In: Proceedings of the Interspeech 2020, pp. 956–960 (2020)

    Google Scholar 

  13. Li, Z., Zhao, M., Li, J., Li, L., Hong, Q.: On the usage of multi-feature integration for speaker verification and language identification. In: Proceedings of the Interspeech 2020, pp. 457–461 (2020)

    Google Scholar 

  14. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    Google Scholar 

  15. McFee, B., et al.: librosa/librosa: 0.6. 3 (2019). https://doi.org/10.5281/zenodo.2564164

  16. Mošner, L., Matějka, P., Novotnỳ, O., Černockỳ, J.H.: Dereverberation and beamforming in far-field speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5254–5258. IEEE (2018)

    Google Scholar 

  17. Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)

  18. Okabe, K., Koshinaka, T., Shinoda, K.: Attentive statistics pooling for deep speaker embedding. In: Proceedings of the Interspeech 2018, pp. 2252–2256 (2018)

    Google Scholar 

  19. Qi, M., Yu, Y., Tang, Y., Deng, Q., Mai, F., Zhaxi, N.: Deep CNN with se block for speaker recognition. In: 2020 Information Communication Technologies Conference (ICTC), pp. 240–244. IEEE (2020)

    Google Scholar 

  20. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)

    Article  Google Scholar 

  21. Safari, P., Hernando, J.: Self-attention encoding and pooling for speaker recognition. In: Proceedings of the Interspeech 2020, pp. 941–945 (2020)

    Google Scholar 

  22. Sankala, S., Rafi, B.S.M., Kodukula, S.R.M.: Self attentive context dependent speaker embedding for speaker verification. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)

    Google Scholar 

  23. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)

    Google Scholar 

  24. Snyder, D., Garcia-Romero, D., Sell, G., McCree, A., Povey, D., Khudanpur, S.: Speaker recognition for multi-speaker conversations using x-vectors. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. IEEE (2019)

    Google Scholar 

  25. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)

    Google Scholar 

  26. Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170. IEEE (2016)

    Google Scholar 

  27. Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)

    Google Scholar 

  28. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  29. Villalba, J., et al.: State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. In: Interspeech, pp. 1488–1492 (2019)

    Google Scholar 

  30. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

    Google Scholar 

  31. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  32. Yin, M., et al.: Disentangled non-local neural networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_12

    Chapter  Google Scholar 

  33. Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech, pp. 1487–1491 (2017)

    Google Scholar 

  34. Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D.: Self-attentive speaker embeddings for text-independent speaker verification. In: Proceedings of the Interspeech 2018, pp. 3573–3577 (2018)

    Google Scholar 

Download references

Acknowledgement

This research work has been funded by the National Natural Science Foundation of China (Grant No. 61772337, U1736207).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gongshen Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, J., Ma, Z., Zhao, H., Liu, G., Li, X. (2021). Speaker Verification with Disentangled Self-attention. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13108. Springer, Cham. https://doi.org/10.1007/978-3-030-92185-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92185-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92184-2

  • Online ISBN: 978-3-030-92185-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics