TMS: Temporal multi-scale in time-delay neural network for speaker verification

Zhang, Ruiteng; Wei, Jianguo; Lu, Xugang; Lu, Wenhuan; Jin, Di; Zhang, Lin; Xu, Junhai; Dang, Jianwu

doi:10.1007/s10489-023-04953-2

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Published: 25 August 2023

Volume 53, pages 26497–26517, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ruiteng Zhang¹,
Jianguo Wei^1,2,
Xugang Lu³,
Wenhuan Lu¹,
Di Jin¹,
Lin Zhang⁴,
Junhai Xu¹ &
…
Jianwu Dang¹

232 Accesses
Explore all metrics

Abstract

The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model’s parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model’s performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crossed-Time Delay Neural Network for Speaker Recognition

Text-Independent Speaker Verification Employing CNN-LSTM-TDNN Hybrid Networks

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Data Availability

The datasets used during and analyzed during the current study are available from the corresponding author on research request.

Notes

The code of ECAPA-TDNN is presented in https://github.com/speechbrain/speechbrain/lobes/models/ECAPA_TDNN.py.

References

Mittal A, Dua M (2022) Automatic speaker verification systems and spoof detection techniques: review and analysis. International Journal of Speech Technology, 1–30
Xu J, Wang X, Feng B, Liu W (2020) Deep multi-metric learning for text-independent speaker verification. Neurocomputing 410:394–400
Article Google Scholar
Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565
Article Google Scholar
Wang W, Lin Q, Cai D, Li M (2022) Similarity measurement of segment-level speaker embeddings in speaker diarization. IEEE/ACM Trans Audio Speech Lang Process 30:2645–2658
Article Google Scholar
Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Proceedings interspeech, pp 999–1003
Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255. https://doi.org/10.1109/TASLP.2021.3065202
Article Google Scholar
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoustics Speech Signal Process 37(3):328–339
Article Google Scholar
Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255
Article Google Scholar
Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S (2019) Speaker recognition for multi-speaker conversations using x-vectors. In: Proceedings ICASSP, pp 5796–5800
Povey D, Cheng G, Wang Y, Li K, Xu H, Yarmohammadi M, Khudanpur S (2018) Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp 3743–3747
Zhu Y, Mak B (2023) Bayesian Self-attentive speaker embeddings for text-independent speaker verification. IEEE/ACM Trans Audio Speech Lang Process 31:1000–1012
Article Google Scholar
Zhu H, Lee KA, Li H (2022) Discriminative speaker embedding with serialized multi-layer multi-head attention. Speech Commun 144:89–100
Article Google Scholar
Wu Y, Guo C, Gao H, Xu J, Bai G (2020) Dilated residual networks with multi-level attention for speaker verification. Neurocomputing 412:177–186
Article Google Scholar
Gu B, Guo W, Zhang J (2023) Memory storable network based feature aggregation for speaker representation learning. IEEE/ACM Trans Audio Speech Lang Process 31:643–655
Article Google Scholar
Zhang R, Wei J, Lu W, Wang L, Liu M, Zhang L, Jin J, Xu J (2020) Aret: Aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings interspeech, pp 946–950
Shen H, Y Y, Sun G, Langman R, Han E, Droppo J, Stolcke A (2022) Improving fairness in speaker verification via Group-adapted Fusion Network. In: Proceedings ICASSP, pp 7077–7081. IEEE
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings CVPR, pp 212–220
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Article Google Scholar
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings CVPR, pp 4690–4699
Gao S, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr PH (2019) Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence
Laver J (1994) Principles of Phonetics. Cambridge University Press
Book Google Scholar
Kitamura T, Honda K, Takemoto H (2005) Individual variation of the hypopharyngeal cavities and its acoustic effects. Acoust Sci Technol 26(1):16–26
Article Google Scholar
Takemoto H, Adachi S, Kitamura T, Mokhtari P, Honda K (2006) Acoustic roles of the laryngeal cavity in vocal tract resonance. J Acoust Soc Am 120(4):2228–2238
Article Google Scholar
Qin Y, Ren Q, Mao Q, Chen J (2023) Multi-branch feature aggregation based on multiple weighting for speaker verification. Comput Speech Lang 77:101426
Article Google Scholar
Desplanques B, Thienpondt J, Demuynck K (2020) Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In: Proceedings Interspeech, pp 3830–3834
Alenin A, Okhotnikov A, Makarov R, Torgashov N, Shigabeev I, Simonchik K (2021) The ID R &D System description for short-duration speaker verification challenge 2021. In: Proceedings interspeech, pp 2297–2301
Zeinali H, Lee KA, Alam J, Burget L (2020) Sdsv challenge 2020: Large-scale evaluation of short-duration speaker verification. In: Proceedings interspeech, pp 731–735
Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings CVPR, pp 13733–13742
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings ECCV, pp 116–131
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, et al (2011) The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on automatic speech recognition and understanding. IEEE Signal processing society
Zhang R, Wei J, Lu W, Zhang L, Ji Y, Xu J, Lu X (2022) CS-REP: Making speaker verification networks embracing re-parameterization. In: Proceedings ICASSP, pp 7082–7086. IEEE
Yu Y-Q, Zheng S, Suo H, Lei Y, Li W-J (2021) Cam: Context-aware masking for robust speaker verification. In: Proceedings ICASSP, pp 6703–6707
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings AAAI, pp 4278–4284
Li Z, Xiao R, Chen H, Zhao Z, Wang W, Zhang P (2013) How to make embeddings suitable for PLDA. Comput Speech Lang 81:101523
Article Google Scholar
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings CVPR, pp 1251–1258
Koluguri NR, Li J, Lavrukhin V, Ginsburg B (2020) Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification. arXiv:2010.12653
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings CVPR, pp 7132–7141
Joon Son Chung AN, Zisserman A (2018) Voxceleb2: Deep speaker recognition. In: Proceedings interspeech, pp 1086–1090
Arsha Nagrani JSC, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. In: Proceedings interspeech, pp 2616–2620
Li L, Liu R, Kang J, Fan Y, Cui H, Cai Y, Vipperla R, Zheng TF, Wang D (2022) Cn-celeb: multi-genre speaker recognition. Speech Commun 137:77–91
Article Google Scholar
Prince SJ, Elder JH (2007) Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings ICCV, pp 1–8
Nagrani A, Chung JS, Xie W, Zisserman A (2020) Voxceleb: Large-scale speaker verification in the wild. Comput Speech Lang 60:101027
Article Google Scholar
Cumani S, Batzu PD, Colibro D, Vair C, Laface P, Vasilakakis V (2011) Comparison of speaker recognition approaches for real applications. In: Proceedings interspeech, pp 2365–2368
Martin AF, Greenberg CS (2009) Nist 2008 speaker recognition evaluation: Performance across telephone and room microphone channels. In: Proceedings interspeech, pp 2579–2582
Qian Y, Chen Z, Wang S (2021) Audio-visual deep neural network for robust person verification. IEEE/ACM Trans Audio Speech Lang Process 29:1079–1092
Article Google Scholar
Zhou T, Zhao Y, Wu J (2021) Resnext and res2net structures for speaker verification. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp 301–307. IEEE
Bai Z, Wang J, Zhang X-L, Chen J (2022) End-to-end speaker verification via curriculum bipartite ranking weighted binary cross-entropy. IEEE/ACM Trans Audio Speech Lang Process 30:1330–1344
Article Google Scholar
Wu Y, Guo C, Zhao J, Jin X, Xu J (2022) RSKNet-MTSP: Effective and portable deep architecture for speaker verification. Neurocomputing 511:259–272
Article Google Scholar
Cai Y, Li L, Abel A, Zhu X, Wang D (2021) Deep normalization for speaker vectors. IEEE/ACM Trans Audio Speech Lang Process 29:733–744. https://doi.org/10.1109/TASLP.2020.3039573
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2020YFC2004103), Qinghai science and technology program (No.2022-ZJ-T05), the project of Tianjin science and technology program (No.21JCZXJC00190), and the National Natural Science Foundation of China (No. 62176181).

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Di Jin, Junhai Xu & Jianwu Dang
Computer College, Qinghai Nationalities University, Xining, China
Jianguo Wei
National Institute of Information and Communications Technology, Kyoto, Japan
Xugang Lu
National Institute of Informatics, Tokyo, Japan
Lin Zhang

Authors

Ruiteng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xugang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wenhuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Di Jin
View author publications
You can also search for this author in PubMed Google Scholar
Lin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junhai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junhai Xu.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, R., Wei, J., Lu, X. et al. TMS: Temporal multi-scale in time-delay neural network for speaker verification. Appl Intell 53, 26497–26517 (2023). https://doi.org/10.1007/s10489-023-04953-2

Download citation

Accepted: 05 August 2023
Published: 25 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04953-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Abstract

Access this article

Similar content being viewed by others

Crossed-Time Delay Neural Network for Speaker Recognition

Text-Independent Speaker Verification Employing CNN-LSTM-TDNN Hybrid Networks

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Abstract

Access this article

Similar content being viewed by others

Crossed-Time Delay Neural Network for Speaker Recognition

Text-Independent Speaker Verification Employing CNN-LSTM-TDNN Hybrid Networks

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation