End-to-end speaker identification research based on multi-scale SincNet and CGAN

Wei, Guangcun; Zhang, Yanna; Min, Hang; Xu, Yunfei

doi:10.1007/s00521-023-08906-1

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Review
Published: 02 August 2023

Volume 35, pages 22209–22222, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Guangcun Wei ORCID: orcid.org/0000-0002-5461-8901^1,2,
Yanna Zhang¹^na1,
Hang Min¹^na1 &
…
Yunfei Xu¹^na1

258 Accesses
Explore all metrics

Abstract

Deep learning has improved the performance of speaker identification systems in recent years, but it has also presented significant challenges. Typically, data-driven modeling approaches based on DNNs rely on large-scale training data, but due to environmental constraints, large amounts of user speech data are not obtainable. As a result, this work proposes a new SincGAN speaker identification (SI) model that directly recognizes the input’s raw waveform, allowing speaker identification with only a small number of training utterances. Unlike methods that use standard hand-crafted feature recognition, this method is real end-to-end recognition. In this case, a generator is utilized to reconstruct the input samples to enhance the amount of training data, and a discriminator is employed to finish the SI classification task. A multi-scale SincNet layer based on three bespoke filter banks is also added to capture the low-level speech representation of the three channels in the waveform, allowing the model to better catch critical narrowband speaker properties (e.g., pitch and resonance peaks). Experiments reveal that the method achieves better recognition results on the TIMIT and LIBRISPEECH datasets under the constraints of limited training data. Furthermore, the proposed model has a competitive advantage over existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Article 17 November 2023

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

Article Open access 17 August 2023

Data avaliblity

The “LIBRISPEECH” data, “TIMIT” data, and “VoxCeleb” data that support the findings of this study are available respectively in “OpenSLR”, “https://openslr.org/12/”, “https://catalog.ldc.upenn.edu/LDC93S1” and “http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html”.

References

Liu K, Zhou H (2020) Text-independent speaker verification with adversarial learning on short utterances. IEEE, pp 6569–6573
Chakroun R, Frikha M (2020) Robust features for text-independent speaker recognition with short utterances. Neural Comput Appl 32(17):13863–13883
Article Google Scholar
Shon S, Ali A (2018) Glass J Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567
Jin Z et al (2021) Adversarial data augmentation for disordered speech recognition. arXiv preprint arXiv:2108.00899
Li C et al (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Dig Sig Process 10(1–3):19–41
Article Google Scholar
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311
Article Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447
Article Google Scholar
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. IEEE, NY, pp 1695–1699
Book Google Scholar
Variani E, Lei X, McDermott E, Moreno IL (2014) Gonzalez-Dominguez J Deep neural networks for small footprint text-dependent speaker verification. IEEE, pp 4052–4056
Snyder D, Garcia-Romero D, Povey D , Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. pp 999–1003
Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. IEEE, pp 5791–5795
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612
Muckenhirn H, Doss MM (2018) Marcell S Towards directly modeling raw speech signal for speaker verification using cnns. IEEE, pp 4884–4888
Palaz D, Collobert R et al (2015) Analysis of cnn-based speech recognition system using raw speech as input. Tech. Rep, Idiap
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553):436–444
Article Google Scholar
Snyder D, Garcia-Romero D, Sell G, Povey D (2018) Khudanpur S X-vectors: robust dnn embeddings for speaker recognition. IEEE, pp 5329–5333
Rituerto-González E, Peláez-Moreno C (2021) End-to-end recurrent denoising autoencoder embeddings for speaker identification. Neural Comput Appl 33(21):14429–14439
Article Google Scholar
Ravanelli M (2018) Bengio Y Speaker recognition from raw waveform with sincnet. IEEE, pp 1021–1028
Seki H, Yamamoto K (2017) Nakagawa S A deep neural network integrated with filterbank learning for speech recognition. IEEE, pp 5480–5484
Jung J-W, Heo H-S, Yang I-H, Shim H-J, Yu H-J (2018) Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification. extraction 8(12):23–24
Google Scholar
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Arjovsky M, Chintala S (2017) Bottou L Wasserstein generative adversarial networks. PMLR, pp 214–223
Adiga N, Pantazis Y, Tsiaras V, Stylianou Y (2019) Speech enhancement for noise-robust speech synthesis using wasserstein gan. pp 1–1825
Paul D, Pantazis Y, Stylianou Y (2019) Non-parallel voice conversion using weighted generative adversarial networks. pp 659–663
Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing voice synthesis based on generative adversarial networks. IEEE, pp 6955–6959
Springenberg JT (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390
Odena A (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583
Shen P, Lu X, Li S, Kawai H (2017) Conditional generative adversarial nets classifier for spoken language identification. PP 2814–2818
Nidadavolu PS, Kataria S, Villalba J (2019) Dehak N Low-resource domain adaptation for speaker recognition using cycle-gans. IEEE, pp 710–717
Chen L, Liu Y, Xiao W, Wang Y, Xie H (2020) Speakergan: speaker identification with conditional generative adversarial network. Neurocomputing 418:211–220
Article Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inform Process Syst 30
Mao X et al (2017) Least squares generative adversarial networks, pp 2794–2802
Goodfellow I et al (2014) Generative adversarial nets. Adv Neural Inform Process Syst 27:55–65
Google Scholar
Van den Oord A et al (2016) Conditional image generation with pixelcnn decoders. Adv Neural Inform Process Syst 29
Chang P-C, Chen Y-S, Lee C-H (2021) Ms-sincresnet: joint learning of 1d and 2d kernels using multi-scale sincnet and resnet for music genre classification. pp 29–36
Zhu G, Jiang F, Duan Z (2020) Y-vector: multiscale waveform encoder for speaker embedding. arXiv preprint arXiv:2010.12951
Ba, J. L., Kiros, J. R. Hinton, G. E (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Fang F, Yamagishi J, Echizen I (2018) Lorenzo-Trueba J High-quality nonparallel voice conversion based on cycle-consistent adversarial network. IEEE, pp 5279–5283
Dumoulin V, Visin F (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. pp 1520–1528
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. IEEE, pp 5206–5210
Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium
Ioffe S (2014) Normalization C SB Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models, vol 30. Citeseer, p 3
Muckenhirn H, Magimai-Doss M, Marcel S (2018) On learning vocal tract system related speaker discriminative information from raw signal using cnns. pp 1116–1120

Download references

Funding

This work is supported by State Grid Shandong electric power company science and technology support project, China (Grant No. 5206002000UD).

Author information

Yanna Zhang, Hang Min, and Yunfei Xu have contributed equally to this work.

Authors and Affiliations

College of Intelligent Equipment, Shandong University of Science and Technology, Taian, 271019, Shandong, China
Guangcun Wei, Yanna Zhang, Hang Min & Yunfei Xu
College of Computer Sicence and Engineering, Shandong University of Science and Technology, QingDao, 266590, Shandong, China
Guangcun Wei

Authors

Guangcun Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yanna Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hang Min
View author publications
You can also search for this author in PubMed Google Scholar
Yunfei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangcun Wei.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, G., Zhang, Y., Min, H. et al. End-to-end speaker identification research based on multi-scale SincNet and CGAN. Neural Comput & Applic 35, 22209–22222 (2023). https://doi.org/10.1007/s00521-023-08906-1

Download citation

Received: 21 June 2022
Accepted: 14 July 2023
Published: 02 August 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00521-023-08906-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Abstract

Access this article

Similar content being viewed by others

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

Data avaliblity

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Abstract

Access this article

Similar content being viewed by others

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

Data avaliblity

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation