Abstract
Deep learning has improved the performance of speaker identification systems in recent years, but it has also presented significant challenges. Typically, data-driven modeling approaches based on DNNs rely on large-scale training data, but due to environmental constraints, large amounts of user speech data are not obtainable. As a result, this work proposes a new SincGAN speaker identification (SI) model that directly recognizes the input’s raw waveform, allowing speaker identification with only a small number of training utterances. Unlike methods that use standard hand-crafted feature recognition, this method is real end-to-end recognition. In this case, a generator is utilized to reconstruct the input samples to enhance the amount of training data, and a discriminator is employed to finish the SI classification task. A multi-scale SincNet layer based on three bespoke filter banks is also added to capture the low-level speech representation of the three channels in the waveform, allowing the model to better catch critical narrowband speaker properties (e.g., pitch and resonance peaks). Experiments reveal that the method achieves better recognition results on the TIMIT and LIBRISPEECH datasets under the constraints of limited training data. Furthermore, the proposed model has a competitive advantage over existing models.
Similar content being viewed by others
Data avaliblity
The “LIBRISPEECH” data, “TIMIT” data, and “VoxCeleb” data that support the findings of this study are available respectively in “OpenSLR”, “https://openslr.org/12/”, “https://catalog.ldc.upenn.edu/LDC93S1” and “http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html”.
References
Liu K, Zhou H (2020) Text-independent speaker verification with adversarial learning on short utterances. IEEE, pp 6569–6573
Chakroun R, Frikha M (2020) Robust features for text-independent speaker recognition with short utterances. Neural Comput Appl 32(17):13863–13883
Shon S, Ali A (2018) Glass J Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567
Jin Z et al (2021) Adversarial data augmentation for disordered speech recognition. arXiv preprint arXiv:2108.00899
Li C et al (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Dig Sig Process 10(1–3):19–41
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311
Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. IEEE, NY, pp 1695–1699
Variani E, Lei X, McDermott E, Moreno IL (2014) Gonzalez-Dominguez J Deep neural networks for small footprint text-dependent speaker verification. IEEE, pp 4052–4056
Snyder D, Garcia-Romero D, Povey D , Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. pp 999–1003
Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. IEEE, pp 5791–5795
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612
Muckenhirn H, Doss MM (2018) Marcell S Towards directly modeling raw speech signal for speaker verification using cnns. IEEE, pp 4884–4888
Palaz D, Collobert R et al (2015) Analysis of cnn-based speech recognition system using raw speech as input. Tech. Rep, Idiap
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553):436–444
Snyder D, Garcia-Romero D, Sell G, Povey D (2018) Khudanpur S X-vectors: robust dnn embeddings for speaker recognition. IEEE, pp 5329–5333
Rituerto-González E, Peláez-Moreno C (2021) End-to-end recurrent denoising autoencoder embeddings for speaker identification. Neural Comput Appl 33(21):14429–14439
Ravanelli M (2018) Bengio Y Speaker recognition from raw waveform with sincnet. IEEE, pp 1021–1028
Seki H, Yamamoto K (2017) Nakagawa S A deep neural network integrated with filterbank learning for speech recognition. IEEE, pp 5480–5484
Jung J-W, Heo H-S, Yang I-H, Shim H-J, Yu H-J (2018) Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification. extraction 8(12):23–24
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Arjovsky M, Chintala S (2017) Bottou L Wasserstein generative adversarial networks. PMLR, pp 214–223
Adiga N, Pantazis Y, Tsiaras V, Stylianou Y (2019) Speech enhancement for noise-robust speech synthesis using wasserstein gan. pp 1–1825
Paul D, Pantazis Y, Stylianou Y (2019) Non-parallel voice conversion using weighted generative adversarial networks. pp 659–663
Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing voice synthesis based on generative adversarial networks. IEEE, pp 6955–6959
Springenberg JT (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390
Odena A (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583
Shen P, Lu X, Li S, Kawai H (2017) Conditional generative adversarial nets classifier for spoken language identification. PP 2814–2818
Nidadavolu PS, Kataria S, Villalba J (2019) Dehak N Low-resource domain adaptation for speaker recognition using cycle-gans. IEEE, pp 710–717
Chen L, Liu Y, Xiao W, Wang Y, Xie H (2020) Speakergan: speaker identification with conditional generative adversarial network. Neurocomputing 418:211–220
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inform Process Syst 30
Mao X et al (2017) Least squares generative adversarial networks, pp 2794–2802
Goodfellow I et al (2014) Generative adversarial nets. Adv Neural Inform Process Syst 27:55–65
Van den Oord A et al (2016) Conditional image generation with pixelcnn decoders. Adv Neural Inform Process Syst 29
Chang P-C, Chen Y-S, Lee C-H (2021) Ms-sincresnet: joint learning of 1d and 2d kernels using multi-scale sincnet and resnet for music genre classification. pp 29–36
Zhu G, Jiang F, Duan Z (2020) Y-vector: multiscale waveform encoder for speaker embedding. arXiv preprint arXiv:2010.12951
Ba, J. L., Kiros, J. R. Hinton, G. E (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Fang F, Yamagishi J, Echizen I (2018) Lorenzo-Trueba J High-quality nonparallel voice conversion based on cycle-consistent adversarial network. IEEE, pp 5279–5283
Dumoulin V, Visin F (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. pp 1520–1528
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. IEEE, pp 5206–5210
Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium
Ioffe S (2014) Normalization C SB Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models, vol 30. Citeseer, p 3
Muckenhirn H, Magimai-Doss M, Marcel S (2018) On learning vocal tract system related speaker discriminative information from raw signal using cnns. pp 1116–1120
Funding
This work is supported by State Grid Shandong electric power company science and technology support project, China (Grant No. 5206002000UD).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, G., Zhang, Y., Min, H. et al. End-to-end speaker identification research based on multi-scale SincNet and CGAN. Neural Comput & Applic 35, 22209–22222 (2023). https://doi.org/10.1007/s00521-023-08906-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08906-1