Virtual Fully-Connected Layer for a Large-Scale Speaker Verification Dataset

Song, Zhida; He, Liang; Fang, Zhihua; Hu, Ying; Huang, Hao

doi:10.1007/978-3-031-20233-9_39

Zhida Song¹⁵,
Liang He^15,16,
Zhihua Fang¹⁵,
Ying Hu¹⁵ &
…
Hao Huang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13628))

Included in the following conference series:

Chinese Conference on Biometric Recognition

1066 Accesses

Abstract

Recently, convolutional neural networks (CNNs) have been widely used in speaker verification tasks and achieved the state-of-the-art performance in most dominant datasets, such as NIST SREs, VoxCeleb, CNCeleb and etc. However, suppose the speaker classification is performed by one-hot coding, the weight shape of the last fully-connected layer is \( B \times N \), B is the min-batch size, and N is the number of speakers, which will require large GPU memory as the number of speakers increases. To address this problem, we introduce a virtual fully-connected (Virtual FC) layer in the field of face recognition to the large-scale speaker verification by re-grouping strategy, mapping N to M(M is a hyperparameter less than N), so that the number of weight parameters in this layer becomes M/N times to the original.

We also explored the effect of the number of utterances per speaker in each min-batch on the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Article Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Gao, S.-H., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., Torr, P.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2021)
Article Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proceedings of the Interspeech 2020, pp. 3830–3834 (2020)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, pp. 999–1003 (2017)
Google Scholar
Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of I-vector length normalization in speaker recognition systems. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Li, P., Wang, B., Zhang, L.: Virtual fully-connected layer: training a large-scale face recognition dataset with limited computational resources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13315–13324 (2021)
Google Scholar
Okabe, K., Koshinaka, T., Shinoda, K.: Attentive statistics pooling for deep speaker embedding. In: Proceedings of the Interspeech 2018, pp. 2252–2256 (2018)
Google Scholar
Huang, Z., Wang, S., Yu, K.: Angular softmax for short-duration text-independent speaker verification. In: Interspeech, pp. 3623–3627 (2018)
Google Scholar
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220 (2017)
Google Scholar
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25(7), 926–930 (2018)
Article Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694 (2019)
Google Scholar
Nagrani, A., Son Chung, J., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: Proceedings of the Interspeech 2017, pp. 2616–2620 (2017)
Google Scholar
Son Chung, J., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Proceedings of the Interspeech 2018, pp. 1086–1090 (2018)
Google Scholar
Son Chung, J., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Xu, W., et al.: Jointing multi-task learning and gradient reversal layer for far-field speaker verification. In: Chinese Conference on Biometric Recognition, pp. 449–457. Springer (2021). https://doi.org/10.1007/978-3-030-86608-2_49

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, Xinjiang University, Urumqi, China
Zhida Song, Liang He, Zhihua Fang, Ying Hu & Hao Huang
Department of Electronic Engineering, Tsinghua University, Beijing, China
Liang He

Authors

Zhida Song
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar
Zhihua Fang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang He .

Editor information

Editors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Weihong Deng
Tsinghua University, Beijing, China
Jianjiang Feng
Beihang University, Beijing, China
Di Huang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Meina Kan
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Tsinghua University, Beijing, China
Fang Zheng
China Electronics Standardization Institute, Beijing, China
Wenfeng Wang
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaofeng He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, Z., He, L., Fang, Z., Hu, Y., Huang, H. (2022). Virtual Fully-Connected Layer for a Large-Scale Speaker Verification Dataset. In: Deng, W., et al. Biometric Recognition. CCBR 2022. Lecture Notes in Computer Science, vol 13628. Springer, Cham. https://doi.org/10.1007/978-3-031-20233-9_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-20233-9_39
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20232-2
Online ISBN: 978-3-031-20233-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Virtual Fully-Connected Layer for a Large-Scale Speaker Verification Dataset