Abstract
In prior works of speaker verification, self-attention networks attract remarkable interests among end-to-end models and achieve great results. In this paper, we have proposed an integrated framework which is used for speaker verification (SV), disentangled self-attention network (DSAN), which focuses on the self-attention in depth. Based on Transformer, attention computation in DSAN is divided into two parts, a pairwise part which learns the relationship between two frames and a unary part which learns the importance of each frame. The original self-attention mechanism trains these two parts together which hinders the learning of each part. We show the effectiveness of this modification on speaker verification task. The proposed model trained on TIMIT, AISHELL-1 and VoxCeleb shows significant performance improvement over LSTM and traditional self-attention network. And we improve the interpretability of the model. Our best result yields an equal error rate (EER) result of 0.91% on TIMIT and 2.11% on VoxCeleb-E.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1–1.1. STIN 93, 27403 (1993)
Grawunder, S., Bose, I.: Average speaking pitch vs. average speaker fundamental frequency-reliability, homogeneity, and self report of listener groups. In: Proceedings of the International Conference Speech Prosody, pp. 763–766 (2008)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE (2016)
India, M., Safari, P., Hernando, J.: Double multi-head attention for speaker verification. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6144–6148. IEEE (2021)
Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41
Li, J., Lee, T.: Text-independent speaker verification with dual attention network. In: Proceedings of the Interspeech 2020, pp. 956–960 (2020)
Li, Z., Zhao, M., Li, J., Li, L., Hong, Q.: On the usage of multi-feature integration for speaker verification and language identification. In: Proceedings of the Interspeech 2020, pp. 457–461 (2020)
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
McFee, B., et al.: librosa/librosa: 0.6. 3 (2019). https://doi.org/10.5281/zenodo.2564164
Mošner, L., Matějka, P., Novotnỳ, O., Černockỳ, J.H.: Dereverberation and beamforming in far-field speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5254–5258. IEEE (2018)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Okabe, K., Koshinaka, T., Shinoda, K.: Attentive statistics pooling for deep speaker embedding. In: Proceedings of the Interspeech 2018, pp. 2252–2256 (2018)
Qi, M., Yu, Y., Tang, Y., Deng, Q., Mai, F., Zhaxi, N.: Deep CNN with se block for speaker recognition. In: 2020 Information Communication Technologies Conference (ICTC), pp. 240–244. IEEE (2020)
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)
Safari, P., Hernando, J.: Self-attention encoding and pooling for speaker recognition. In: Proceedings of the Interspeech 2020, pp. 941–945 (2020)
Sankala, S., Rafi, B.S.M., Kodukula, S.R.M.: Self attentive context dependent speaker embedding for speaker verification. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Snyder, D., Garcia-Romero, D., Sell, G., McCree, A., Povey, D., Khudanpur, S.: Speaker recognition for multi-speaker conversations using x-vectors. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5796–5800. IEEE (2019)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165–170. IEEE (2016)
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Villalba, J., et al.: State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. In: Interspeech, pp. 1488–1492 (2019)
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Yin, M., et al.: Disentangled non-local neural networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_12
Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech, pp. 1487–1491 (2017)
Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D.: Self-attentive speaker embeddings for text-independent speaker verification. In: Proceedings of the Interspeech 2018, pp. 3573–3577 (2018)
Acknowledgement
This research work has been funded by the National Natural Science Foundation of China (Grant No. 61772337, U1736207).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, J., Ma, Z., Zhao, H., Liu, G., Li, X. (2021). Speaker Verification with Disentangled Self-attention. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13108. Springer, Cham. https://doi.org/10.1007/978-3-030-92185-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-92185-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92184-2
Online ISBN: 978-3-030-92185-9
eBook Packages: Computer ScienceComputer Science (R0)