Speaker Verification with Disentangled Self-attention

Guo, Junjie; Ma, Zhiyuan; Zhao, Haodong; Liu, Gongshen; Li, Xiaoyong

doi:10.1007/978-3-030-92185-9_3

Junjie Guo¹³,
Zhiyuan Ma¹³,
Haodong Zhao¹³,
Gongshen Liu¹³ &
…
Xiaoyong Li¹³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13108))

Included in the following conference series:

International Conference on Neural Information Processing

2781 Accesses

Abstract

In prior works of speaker verification, self-attention networks attract remarkable interests among end-to-end models and achieve great results. In this paper, we have proposed an integrated framework which is used for speaker verification (SV), disentangled self-attention network (DSAN), which focuses on the self-attention in depth. Based on Transformer, attention computation in DSAN is divided into two parts, a pairwise part which learns the relationship between two frames and a unary part which learns the importance of each frame. The original self-attention mechanism trains these two parts together which hinders the learning of each part. We show the effectiveness of this modification on speaker verification task. The proposed model trained on TIMIT, AISHELL-1 and VoxCeleb shows significant performance improvement over LSTM and traditional self-attention network. And we improve the interpretability of the model. Our best result yields an equal error rate (EER) result of 0.91% on TIMIT and 2.11% on VoxCeleb-E.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)
Google Scholar
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Google Scholar
Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)
Article Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1–1.1. STIN 93, 27403 (1993)
Google Scholar
Grawunder, S., Bose, I.: Average speaking pitch vs. average speaker fundamental frequency-reliability, homogeneity, and self report of listener groups. In: Proceedings of the International Conference Speech Prosody, pp. 763–766 (2008)
Google Scholar
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE (2016)
Google Scholar
India, M., Safari, P., Hernando, J.: Double multi-head attention for speaker verification. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6144–6148. IEEE (2021)
Google Scholar
Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41
Chapter Google Scholar
Li, J., Lee, T.: Text-independent speaker verification with dual attention network. In: Proceedings of the Interspeech 2020, pp. 956–960 (2020)
Google Scholar
Li, Z., Zhao, M., Li, J., Li, L., Hong, Q.: On the usage of multi-feature integration for speaker verification and language identification. In: Proceedings of the Interspeech 2020, pp. 457–461 (2020)
Google Scholar
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
McFee, B., et al.: librosa/librosa: 0.6. 3 (2019). https://doi.org/10.5281/zenodo.2564164
Mošner, L., Matějka, P., Novotnỳ, O., Černockỳ, J.H.: Dereverberation and beamforming in far-field speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5254–5258. IEEE (2018)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Okabe, K., Koshinaka, T., Shinoda, K.: Attentive statistics pooling for deep speaker embedding. In: Proceedings of the Interspeech 2018, pp. 2252–2256 (2018)
Google Scholar
Qi, M., Yu, Y., Tang, Y., Deng, Q., Mai, F., Zhaxi, N.: Deep CNN with se block for speaker recognition. In: 2020 Information Communication Technologies Conference (ICTC), pp. 240–244. IEEE (2020)
Google Scholar
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)
Article Google Scholar
Safari, P., Hernando, J.: Self-attention encoding and pooling for speaker recognition. In: Proceedings of the Interspeech 2020, pp. 941–945 (2020)
Google Scholar
Sankala, S., Rafi, B.S.M., Kodukula, S.R.M.: Self attentive context dependent speaker embedding for speaker verification. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., McCree, A., Povey, D., Khudanpur, S.: Speaker recognition for multi-speaker conversations using x-vectors. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. IEEE (2019)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Google Scholar
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170. IEEE (2016)
Google Scholar
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Villalba, J., et al.: State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. In: Interspeech, pp. 1488–1492 (2019)
Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Yin, M., et al.: Disentangled non-local neural networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_12
Chapter Google Scholar
Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech, pp. 1487–1491 (2017)
Google Scholar
Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D.: Self-attentive speaker embeddings for text-independent speaker verification. In: Proceedings of the Interspeech 2018, pp. 3573–3577 (2018)
Google Scholar

Download references

Acknowledgement

This research work has been funded by the National Natural Science Foundation of China (Grant No. 61772337, U1736207).

Author information

Authors and Affiliations

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
Junjie Guo, Zhiyuan Ma, Haodong Zhao, Gongshen Liu & Xiaoyong Li

Authors

Junjie Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Haodong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Gongshen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gongshen Liu .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, J., Ma, Z., Zhao, H., Liu, G., Li, X. (2021). Speaker Verification with Disentangled Self-attention. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13108. Springer, Cham. https://doi.org/10.1007/978-3-030-92185-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-92185-9_3
Published: 06 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92184-2
Online ISBN: 978-3-030-92185-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Speaker Verification with Disentangled Self-attention