Abstract
Recently, convolutional neural networks (CNNs) have been widely used in speaker verification tasks and achieved the state-of-the-art performance in most dominant datasets, such as NIST SREs, VoxCeleb, CNCeleb and etc. However, suppose the speaker classification is performed by one-hot coding, the weight shape of the last fully-connected layer is \( B \times N \), B is the min-batch size, and N is the number of speakers, which will require large GPU memory as the number of speakers increases. To address this problem, we introduce a virtual fully-connected (Virtual FC) layer in the field of face recognition to the large-scale speaker verification by re-grouping strategy, mapping N to M(M is a hyperparameter less than N), so that the number of weight parameters in this layer becomes M/N times to the original.
We also explored the effect of the number of utterances per speaker in each min-batch on the performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
He, K., Zhang, X., Ren, S., Sun, J: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Gao, S.-H., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., Torr, P.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2021)
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proceedings of the Interspeech 2020, pp. 3830–3834 (2020)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, pp. 999–1003 (2017)
Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of I-vector length normalization in speaker recognition systems. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Li, P., Wang, B., Zhang, L.: Virtual fully-connected layer: training a large-scale face recognition dataset with limited computational resources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13315–13324 (2021)
Okabe, K., Koshinaka, T., Shinoda, K.: Attentive statistics pooling for deep speaker embedding. In: Proceedings of the Interspeech 2018, pp. 2252–2256 (2018)
Huang, Z., Wang, S., Yu, K.: Angular softmax for short-duration text-independent speaker verification. In: Interspeech, pp. 3623–3627 (2018)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220 (2017)
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25(7), 926–930 (2018)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694 (2019)
Nagrani, A., Son Chung, J., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: Proceedings of the Interspeech 2017, pp. 2616–2620 (2017)
Son Chung, J., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Proceedings of the Interspeech 2018, pp. 1086–1090 (2018)
Son Chung, J., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Xu, W., et al.: Jointing multi-task learning and gradient reversal layer for far-field speaker verification. In: Chinese Conference on Biometric Recognition, pp. 449–457. Springer (2021). https://doi.org/10.1007/978-3-030-86608-2_49
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Song, Z., He, L., Fang, Z., Hu, Y., Huang, H. (2022). Virtual Fully-Connected Layer for a Large-Scale Speaker Verification Dataset. In: Deng, W., et al. Biometric Recognition. CCBR 2022. Lecture Notes in Computer Science, vol 13628. Springer, Cham. https://doi.org/10.1007/978-3-031-20233-9_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-20233-9_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20232-2
Online ISBN: 978-3-031-20233-9
eBook Packages: Computer ScienceComputer Science (R0)