Skip to main content

Advertisement

Log in

TMS: Temporal multi-scale in time-delay neural network for speaker verification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model’s parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model’s performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets used during and analyzed during the current study are available from the corresponding author on research request.

Notes

  1. The code of ECAPA-TDNN is presented in https://github.com/speechbrain/speechbrain/lobes/models/ECAPA_TDNN.py.

References

  1. Mittal A, Dua M (2022) Automatic speaker verification systems and spoof detection techniques: review and analysis. International Journal of Speech Technology, 1–30

  2. Xu J, Wang X, Feng B, Liu W (2020) Deep multi-metric learning for text-independent speaker verification. Neurocomputing 410:394–400

    Article  Google Scholar 

  3. Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565

    Article  Google Scholar 

  4. Wang W, Lin Q, Cai D, Li M (2022) Similarity measurement of segment-level speaker embeddings in speaker diarization. IEEE/ACM Trans Audio Speech Lang Process 30:2645–2658

    Article  Google Scholar 

  5. Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: Proceedings interspeech, pp 999–1003

  6. Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255. https://doi.org/10.1109/TASLP.2021.3065202

    Article  Google Scholar 

  7. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoustics Speech Signal Process 37(3):328–339

    Article  Google Scholar 

  8. Chen X, Bao C (2021) Phoneme-unit-specific time-delay neural network for speaker verification. IEEE/ACM Trans Audio Speech Lang Process 29:1243–1255

    Article  Google Scholar 

  9. Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S (2019) Speaker recognition for multi-speaker conversations using x-vectors. In: Proceedings ICASSP, pp 5796–5800

  10. Povey D, Cheng G, Wang Y, Li K, Xu H, Yarmohammadi M, Khudanpur S (2018) Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp 3743–3747

  11. Zhu Y, Mak B (2023) Bayesian Self-attentive speaker embeddings for text-independent speaker verification. IEEE/ACM Trans Audio Speech Lang Process 31:1000–1012

    Article  Google Scholar 

  12. Zhu H, Lee KA, Li H (2022) Discriminative speaker embedding with serialized multi-layer multi-head attention. Speech Commun 144:89–100

    Article  Google Scholar 

  13. Wu Y, Guo C, Gao H, Xu J, Bai G (2020) Dilated residual networks with multi-level attention for speaker verification. Neurocomputing 412:177–186

    Article  Google Scholar 

  14. Gu B, Guo W, Zhang J (2023) Memory storable network based feature aggregation for speaker representation learning. IEEE/ACM Trans Audio Speech Lang Process 31:643–655

    Article  Google Scholar 

  15. Zhang R, Wei J, Lu W, Wang L, Liu M, Zhang L, Jin J, Xu J (2020) Aret: Aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings interspeech, pp 946–950

  16. Shen H, Y Y, Sun G, Langman R, Han E, Droppo J, Stolcke A (2022) Improving fairness in speaker verification via Group-adapted Fusion Network. In: Proceedings ICASSP, pp 7077–7081. IEEE

  17. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings CVPR, pp 212–220

  18. Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930

    Article  Google Scholar 

  19. Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings CVPR, pp 4690–4699

  20. Gao S, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr PH (2019) Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence

  21. Laver J (1994) Principles of Phonetics. Cambridge University Press

    Book  Google Scholar 

  22. Kitamura T, Honda K, Takemoto H (2005) Individual variation of the hypopharyngeal cavities and its acoustic effects. Acoust Sci Technol 26(1):16–26

    Article  Google Scholar 

  23. Takemoto H, Adachi S, Kitamura T, Mokhtari P, Honda K (2006) Acoustic roles of the laryngeal cavity in vocal tract resonance. J Acoust Soc Am 120(4):2228–2238

    Article  Google Scholar 

  24. Qin Y, Ren Q, Mao Q, Chen J (2023) Multi-branch feature aggregation based on multiple weighting for speaker verification. Comput Speech Lang 77:101426

    Article  Google Scholar 

  25. Desplanques B, Thienpondt J, Demuynck K (2020) Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In: Proceedings Interspeech, pp 3830–3834

  26. Alenin A, Okhotnikov A, Makarov R, Torgashov N, Shigabeev I, Simonchik K (2021) The ID R &D System description for short-duration speaker verification challenge 2021. In: Proceedings interspeech, pp 2297–2301

  27. Zeinali H, Lee KA, Alam J, Burget L (2020) Sdsv challenge 2020: Large-scale evaluation of short-duration speaker verification. In: Proceedings interspeech, pp 731–735

  28. Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings CVPR, pp 13733–13742

  29. Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings ECCV, pp 116–131

  30. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, et al (2011) The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on automatic speech recognition and understanding. IEEE Signal processing society

  31. Zhang R, Wei J, Lu W, Zhang L, Ji Y, Xu J, Lu X (2022) CS-REP: Making speaker verification networks embracing re-parameterization. In: Proceedings ICASSP, pp 7082–7086. IEEE

  32. Yu Y-Q, Zheng S, Suo H, Lei Y, Li W-J (2021) Cam: Context-aware masking for robust speaker verification. In: Proceedings ICASSP, pp 6703–6707

  33. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings AAAI, pp 4278–4284

  34. Li Z, Xiao R, Chen H, Zhao Z, Wang W, Zhang P (2013) How to make embeddings suitable for PLDA. Comput Speech Lang 81:101523

    Article  Google Scholar 

  35. Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings CVPR, pp 1251–1258

  36. Koluguri NR, Li J, Lavrukhin V, Ginsburg B (2020) Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification. arXiv:2010.12653

  37. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings CVPR, pp 7132–7141

  38. Joon Son Chung AN, Zisserman A (2018) Voxceleb2: Deep speaker recognition. In: Proceedings interspeech, pp 1086–1090

  39. Arsha Nagrani JSC, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. In: Proceedings interspeech, pp 2616–2620

  40. Li L, Liu R, Kang J, Fan Y, Cui H, Cai Y, Vipperla R, Zheng TF, Wang D (2022) Cn-celeb: multi-genre speaker recognition. Speech Commun 137:77–91

    Article  Google Scholar 

  41. Prince SJ, Elder JH (2007) Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings ICCV, pp 1–8

  42. Nagrani A, Chung JS, Xie W, Zisserman A (2020) Voxceleb: Large-scale speaker verification in the wild. Comput Speech Lang 60:101027

    Article  Google Scholar 

  43. Cumani S, Batzu PD, Colibro D, Vair C, Laface P, Vasilakakis V (2011) Comparison of speaker recognition approaches for real applications. In: Proceedings interspeech, pp 2365–2368

  44. Martin AF, Greenberg CS (2009) Nist 2008 speaker recognition evaluation: Performance across telephone and room microphone channels. In: Proceedings interspeech, pp 2579–2582

  45. Qian Y, Chen Z, Wang S (2021) Audio-visual deep neural network for robust person verification. IEEE/ACM Trans Audio Speech Lang Process 29:1079–1092

    Article  Google Scholar 

  46. Zhou T, Zhao Y, Wu J (2021) Resnext and res2net structures for speaker verification. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp 301–307. IEEE

  47. Bai Z, Wang J, Zhang X-L, Chen J (2022) End-to-end speaker verification via curriculum bipartite ranking weighted binary cross-entropy. IEEE/ACM Trans Audio Speech Lang Process 30:1330–1344

    Article  Google Scholar 

  48. Wu Y, Guo C, Zhao J, Jin X, Xu J (2022) RSKNet-MTSP: Effective and portable deep architecture for speaker verification. Neurocomputing 511:259–272

    Article  Google Scholar 

  49. Cai Y, Li L, Abel A, Zhu X, Wang D (2021) Deep normalization for speaker vectors. IEEE/ACM Trans Audio Speech Lang Process 29:733–744. https://doi.org/10.1109/TASLP.2020.3039573

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2020YFC2004103), Qinghai science and technology program (No.2022-ZJ-T05), the project of Tianjin science and technology program (No.21JCZXJC00190), and the National Natural Science Foundation of China (No. 62176181).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junhai Xu.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, R., Wei, J., Lu, X. et al. TMS: Temporal multi-scale in time-delay neural network for speaker verification. Appl Intell 53, 26497–26517 (2023). https://doi.org/10.1007/s10489-023-04953-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04953-2

Keywords

Navigation