Text-Independent Speaker Verification Employing CNN-LSTM-TDNN Hybrid Networks

Alam, Jahangir; Fathan, Abderrahim; Kang, Woo Hyun

doi:10.1007/978-3-030-87802-3_1

Jahangir Alam¹⁰,
Abderrahim Fathan¹⁰ &
Woo Hyun Kang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1657 Accesses
9 Citations

Abstract

Time Delay Neural Network (TDNN)-based speaker embeddings extraction have become the dominant approach for text-independent speaker verification. Several single and hybrid deep learning architectures have been proposed for improving the performance. In this paper, we propose yet another hybrid configuration that employs Convolution Neural Network (CNN), TDNN and Long Short-Term Memory (LSTM) for training and extraction of speaker discriminant utterance level representations. In the proposed framework, we also use SpecAugment for on the fly data augmentation and multi-level statistics pooling from across CNN, LSTM and TDNN layers for aggregating the frame level information into utterance level speaker embeddings via fully connected layers. For performance evaluation of the proposed framework, speaker verification experiments are carried out across NIST SRE 2016, Voxceleb and Short duration Speaker Verification (SdSV) challenge 2021 corpora. Evaluation metrics chosen for performance evaluation are equal error rate (EER), minimum decision cost function (minDCF), and actual decision cost function (actDCF). Experimental results depict the proposed hybrid approach yielding improvements over the original TDNN, TDNN-LSTM hybrid architecture, as well as some other previously-proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhattacharya, G., Alam, J., Kenny, P.: Deep speaker recognition: modular or monolithic?. In: Proceedings Interspeech 2019, pp. 1143–1147 (2019)
Google Scholar
Chen, C., Zhang, S., Yeh, C., Wang, J., Wang, T., Huang, C.: Speaker characterization using TDNN-LSTM based speaker embedding. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6211–6215 (2019)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. Interspeech 2020, October 2020. http://dx.doi.org/10.21437/Interspeech.2020-2650
Gao, S., Cheng, M., Zhao, K., Zhang, X., Yang, M., Torr, P.H.S.: Res2net: a new multi-scale backbone architecture. CoRR abs/1904.01169 (2019). http://arxiv.org/abs/1904.01169
Garcia-Romero, D., McCree, A.: Supervised domain adaptation for i-vector based speaker recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4047–4051 (2014)
Google Scholar
Garcia-Romero, D., Sell, G., Mccree, A.: MagNetO: x-vector magnitude estimation network plus offset for improved speaker recognition. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 1–8 (2020)
Google Scholar
Gusev, A., et al.: Deep speaker embeddings for far-field speaker recognition on short utterances (2020)
Google Scholar
Hajavi, A., Etemad, A.: A deep neural network for short-segment speaker recognition. In: Proceedings Interspeech (2019), pp. 2878–2882 (2019)
Google Scholar
Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Sig. Process. Mag. 32(6), 74–99 (2015)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. CoRR abs/1709.01507 (2017). http://arxiv.org/abs/1709.01507
Huang, C.L.: Speaker characterization using TDNN, TDNN-LSTM, TDNN-LSTM-attention based speaker embeddings for NIST SRE 2019. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 423–427 (2020)
Google Scholar
Li, N., Tuo, D., Su, D., Li, Z., Yu, D.: Deep discriminative embeddings for duration robust speaker verification. In: Proceedings Interspeech (2018), pp. 2262–2266 (2018)
Google Scholar
Liang, T., Liu, Y., Xu, C., Zhang, X., He, L.: Combined vector based on factorized time-delay neural network for text-independent speaker recognition. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 428–432 (2020)
Google Scholar
Lin, W., Mak, M.W., Yi, L.: Learning mixture representation for deep speaker embedding using attention. In: Proceedings Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 210–214 (2020)
Google Scholar
Monteiro, J., Albuquerque, I., Alam, J., Hjelm, R.D., Falk, T.: An end-to-end approach for the verification problem: learning the right distance. In: International Conference on Machine Learning (2020)
Google Scholar
Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., Shchemelinin, V.: On deep speaker embeddings for text-independent speaker recognition. CoRR abs/1804.10080 (2018). http://arxiv.org/abs/1804.10080
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Proceedings Interspeech (2019), pp. 2613–2617 (2019)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit (2011)
Google Scholar
Sadjadi, S.O., et al.: The 2016 NIST speaker recognition evaluation. In: Proceedings Interspeech (2017), pp. 1353–1357 (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust dnn embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Google Scholar
Sztahó, D., Szaszák, G., Beke, A.: Deep learning methods in speaker recognition: a review (2019)
Google Scholar
Tang, Y., Ding, G., Huang, J., He, X., Zhou, B.: Deep speaker embedding learning with multi-level pooling for text-independent speaker verification. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6116–6120 (2019)
Google Scholar
Wang, S., Rohdin, J., Plchot, O., Burget, L., Yu, K., Černocký, J.: Investigation of specaugment for deep speaker embedding learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7139–7143 (2020)
Google Scholar
Xiang, X., Wang, S., Huang, H., Qian, Y., Yu, K.: Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition. arXiv preprint arXiv:1906.07317 (2019)
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)
Google Scholar
Zeinali, H., Lee, K.A., Alam, J., Burget, L.: Short-duration speaker verification (SdSV) challenge 2021: the challenge evaluation plan (2021)
Google Scholar
Zhang, R., et al.: ARET: aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings Interspeech (2020), pp. 946–950 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Research Institute of Montreal, Montreal, Quebec, H3N 1M3, Canada
Jahangir Alam, Abderrahim Fathan & Woo Hyun Kang

Authors

Jahangir Alam
View author publications
You can also search for this author in PubMed Google Scholar
Abderrahim Fathan
View author publications
You can also search for this author in PubMed Google Scholar
Woo Hyun Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jahangir Alam .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alam, J., Fathan, A., Kang, W.H. (2021). Text-Independent Speaker Verification Employing CNN-LSTM-TDNN Hybrid Networks. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_1
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics