Abstract
The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model’s parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model’s performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.
Similar content being viewed by others
Data Availability
The datasets used during and analyzed during the current study are available from the corresponding author on research request.
Notes
The code of ECAPA-TDNN is presented in https://github.com/speechbrain/speechbrain/lobes/models/ECAPA_TDNN.py.
References
Mittal A, Dua M (2022) Automatic speaker verification systems and spoof detection techniques: review and analysis. International Journal of Speech Technology, 1–30
Xu J, Wang X, Feng B, Liu W (2020) Deep multi-metric learning for text-independent speaker verification. Neurocomputing 410:394–400
Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565
Wang W, Lin Q, Cai D, Li M (2022) Similarity measurement of segment-level speaker embeddings in speaker diarization. IEEE/ACM Trans Audio Speech Lang Process 30:2645–2658
Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Proceedings interspeech, pp 999–1003
Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255. https://doi.org/10.1109/TASLP.2021.3065202
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoustics Speech Signal Process 37(3):328–339
Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255
Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S (2019) Speaker recognition for multi-speaker conversations using x-vectors. In: Proceedings ICASSP, pp 5796–5800
Povey D, Cheng G, Wang Y, Li K, Xu H, Yarmohammadi M, Khudanpur S (2018) Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp 3743–3747
Zhu Y, Mak B (2023) Bayesian Self-attentive speaker embeddings for text-independent speaker verification. IEEE/ACM Trans Audio Speech Lang Process 31:1000–1012
Zhu H, Lee KA, Li H (2022) Discriminative speaker embedding with serialized multi-layer multi-head attention. Speech Commun 144:89–100
Wu Y, Guo C, Gao H, Xu J, Bai G (2020) Dilated residual networks with multi-level attention for speaker verification. Neurocomputing 412:177–186
Gu B, Guo W, Zhang J (2023) Memory storable network based feature aggregation for speaker representation learning. IEEE/ACM Trans Audio Speech Lang Process 31:643–655
Zhang R, Wei J, Lu W, Wang L, Liu M, Zhang L, Jin J, Xu J (2020) Aret: Aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings interspeech, pp 946–950
Shen H, Y Y, Sun G, Langman R, Han E, Droppo J, Stolcke A (2022) Improving fairness in speaker verification via Group-adapted Fusion Network. In: Proceedings ICASSP, pp 7077–7081. IEEE
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings CVPR, pp 212–220
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings CVPR, pp 4690–4699
Gao S, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr PH (2019) Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence
Laver J (1994) Principles of Phonetics. Cambridge University Press
Kitamura T, Honda K, Takemoto H (2005) Individual variation of the hypopharyngeal cavities and its acoustic effects. Acoust Sci Technol 26(1):16–26
Takemoto H, Adachi S, Kitamura T, Mokhtari P, Honda K (2006) Acoustic roles of the laryngeal cavity in vocal tract resonance. J Acoust Soc Am 120(4):2228–2238
Qin Y, Ren Q, Mao Q, Chen J (2023) Multi-branch feature aggregation based on multiple weighting for speaker verification. Comput Speech Lang 77:101426
Desplanques B, Thienpondt J, Demuynck K (2020) Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In: Proceedings Interspeech, pp 3830–3834
Alenin A, Okhotnikov A, Makarov R, Torgashov N, Shigabeev I, Simonchik K (2021) The ID R &D System description for short-duration speaker verification challenge 2021. In: Proceedings interspeech, pp 2297–2301
Zeinali H, Lee KA, Alam J, Burget L (2020) Sdsv challenge 2020: Large-scale evaluation of short-duration speaker verification. In: Proceedings interspeech, pp 731–735
Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings CVPR, pp 13733–13742
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings ECCV, pp 116–131
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, et al (2011) The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on automatic speech recognition and understanding. IEEE Signal processing society
Zhang R, Wei J, Lu W, Zhang L, Ji Y, Xu J, Lu X (2022) CS-REP: Making speaker verification networks embracing re-parameterization. In: Proceedings ICASSP, pp 7082–7086. IEEE
Yu Y-Q, Zheng S, Suo H, Lei Y, Li W-J (2021) Cam: Context-aware masking for robust speaker verification. In: Proceedings ICASSP, pp 6703–6707
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings AAAI, pp 4278–4284
Li Z, Xiao R, Chen H, Zhao Z, Wang W, Zhang P (2013) How to make embeddings suitable for PLDA. Comput Speech Lang 81:101523
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings CVPR, pp 1251–1258
Koluguri NR, Li J, Lavrukhin V, Ginsburg B (2020) Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification. arXiv:2010.12653
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings CVPR, pp 7132–7141
Joon Son Chung AN, Zisserman A (2018) Voxceleb2: Deep speaker recognition. In: Proceedings interspeech, pp 1086–1090
Arsha Nagrani JSC, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. In: Proceedings interspeech, pp 2616–2620
Li L, Liu R, Kang J, Fan Y, Cui H, Cai Y, Vipperla R, Zheng TF, Wang D (2022) Cn-celeb: multi-genre speaker recognition. Speech Commun 137:77–91
Prince SJ, Elder JH (2007) Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings ICCV, pp 1–8
Nagrani A, Chung JS, Xie W, Zisserman A (2020) Voxceleb: Large-scale speaker verification in the wild. Comput Speech Lang 60:101027
Cumani S, Batzu PD, Colibro D, Vair C, Laface P, Vasilakakis V (2011) Comparison of speaker recognition approaches for real applications. In: Proceedings interspeech, pp 2365–2368
Martin AF, Greenberg CS (2009) Nist 2008 speaker recognition evaluation: Performance across telephone and room microphone channels. In: Proceedings interspeech, pp 2579–2582
Qian Y, Chen Z, Wang S (2021) Audio-visual deep neural network for robust person verification. IEEE/ACM Trans Audio Speech Lang Process 29:1079–1092
Zhou T, Zhao Y, Wu J (2021) Resnext and res2net structures for speaker verification. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp 301–307. IEEE
Bai Z, Wang J, Zhang X-L, Chen J (2022) End-to-end speaker verification via curriculum bipartite ranking weighted binary cross-entropy. IEEE/ACM Trans Audio Speech Lang Process 30:1330–1344
Wu Y, Guo C, Zhao J, Jin X, Xu J (2022) RSKNet-MTSP: Effective and portable deep architecture for speaker verification. Neurocomputing 511:259–272
Cai Y, Li L, Abel A, Zhu X, Wang D (2021) Deep normalization for speaker vectors. IEEE/ACM Trans Audio Speech Lang Process 29:733–744. https://doi.org/10.1109/TASLP.2020.3039573
Acknowledgements
This work was supported by National Key R &D Program of China (No. 2020YFC2004103), Qinghai science and technology program (No.2022-ZJ-T05), the project of Tianjin science and technology program (No.21JCZXJC00190), and the National Natural Science Foundation of China (No. 62176181).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, R., Wei, J., Lu, X. et al. TMS: Temporal multi-scale in time-delay neural network for speaker verification. Appl Intell 53, 26497–26517 (2023). https://doi.org/10.1007/s10489-023-04953-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04953-2