Skip to main content
Log in

End-to-end speaker identification research based on multi-scale SincNet and CGAN

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Deep learning has improved the performance of speaker identification systems in recent years, but it has also presented significant challenges. Typically, data-driven modeling approaches based on DNNs rely on large-scale training data, but due to environmental constraints, large amounts of user speech data are not obtainable. As a result, this work proposes a new SincGAN speaker identification (SI) model that directly recognizes the input’s raw waveform, allowing speaker identification with only a small number of training utterances. Unlike methods that use standard hand-crafted feature recognition, this method is real end-to-end recognition. In this case, a generator is utilized to reconstruct the input samples to enhance the amount of training data, and a discriminator is employed to finish the SI classification task. A multi-scale SincNet layer based on three bespoke filter banks is also added to capture the low-level speech representation of the three channels in the waveform, allowing the model to better catch critical narrowband speaker properties (e.g., pitch and resonance peaks). Experiments reveal that the method achieves better recognition results on the TIMIT and LIBRISPEECH datasets under the constraints of limited training data. Furthermore, the proposed model has a competitive advantage over existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data avaliblity

The “LIBRISPEECH” data, “TIMIT” data, and “VoxCeleb” data that support the findings of this study are available respectively in “OpenSLR”, “https://openslr.org/12/”, “https://catalog.ldc.upenn.edu/LDC93S1” and “http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html”.

References

  1. Liu K, Zhou H (2020) Text-independent speaker verification with adversarial learning on short utterances. IEEE, pp 6569–6573

  2. Chakroun R, Frikha M (2020) Robust features for text-independent speaker recognition with short utterances. Neural Comput Appl 32(17):13863–13883

    Article  Google Scholar 

  3. Shon S, Ali A (2018) Glass J Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567

  4. Jin Z et al (2021) Adversarial data augmentation for disordered speech recognition. arXiv preprint arXiv:2108.00899

  5. Li C et al (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304

  6. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Dig Sig Process 10(1–3):19–41

    Article  Google Scholar 

  7. Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311

    Article  Google Scholar 

  8. Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447

    Article  Google Scholar 

  9. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  10. Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. IEEE, NY, pp 1695–1699

    Book  Google Scholar 

  11. Variani E, Lei X, McDermott E, Moreno IL (2014) Gonzalez-Dominguez J Deep neural networks for small footprint text-dependent speaker verification. IEEE, pp 4052–4056

  12. Snyder D, Garcia-Romero D, Povey D , Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. pp 999–1003

  13. Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. IEEE, pp 5791–5795

  14. Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612

  15. Muckenhirn H, Doss MM (2018) Marcell S Towards directly modeling raw speech signal for speaker verification using cnns. IEEE, pp 4884–4888

  16. Palaz D, Collobert R et al (2015) Analysis of cnn-based speech recognition system using raw speech as input. Tech. Rep, Idiap

  17. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553):436–444

    Article  Google Scholar 

  18. Snyder D, Garcia-Romero D, Sell G, Povey D (2018) Khudanpur S X-vectors: robust dnn embeddings for speaker recognition. IEEE, pp 5329–5333

  19. Rituerto-González E, Peláez-Moreno C (2021) End-to-end recurrent denoising autoencoder embeddings for speaker identification. Neural Comput Appl 33(21):14429–14439

    Article  Google Scholar 

  20. Ravanelli M (2018) Bengio Y Speaker recognition from raw waveform with sincnet. IEEE, pp 1021–1028

  21. Seki H, Yamamoto K (2017) Nakagawa S A deep neural network integrated with filterbank learning for speech recognition. IEEE, pp 5480–5484

  22. Jung J-W, Heo H-S, Yang I-H, Shim H-J, Yu H-J (2018) Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification. extraction 8(12):23–24

    Google Scholar 

  23. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  24. Arjovsky M, Chintala S (2017) Bottou L Wasserstein generative adversarial networks. PMLR, pp 214–223

  25. Adiga N, Pantazis Y, Tsiaras V, Stylianou Y (2019) Speech enhancement for noise-robust speech synthesis using wasserstein gan. pp 1–1825

  26. Paul D, Pantazis Y, Stylianou Y (2019) Non-parallel voice conversion using weighted generative adversarial networks. pp 659–663

  27. Hono Y, Hashimoto K, Oura K, Nankaku Y, Tokuda K (2019) Singing voice synthesis based on generative adversarial networks. IEEE, pp 6955–6959

  28. Springenberg JT (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390

  29. Odena A (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583

  30. Shen P, Lu X, Li S, Kawai H (2017) Conditional generative adversarial nets classifier for spoken language identification. PP 2814–2818

  31. Nidadavolu PS, Kataria S, Villalba J (2019) Dehak N Low-resource domain adaptation for speaker recognition using cycle-gans. IEEE, pp 710–717

  32. Chen L, Liu Y, Xiao W, Wang Y, Xie H (2020) Speakergan: speaker identification with conditional generative adversarial network. Neurocomputing 418:211–220

    Article  Google Scholar 

  33. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inform Process Syst 30

  34. Mao X et al (2017) Least squares generative adversarial networks, pp 2794–2802

  35. Goodfellow I et al (2014) Generative adversarial nets. Adv Neural Inform Process Syst 27:55–65

    Google Scholar 

  36. Van den Oord A et al (2016) Conditional image generation with pixelcnn decoders. Adv Neural Inform Process Syst 29

  37. Chang P-C, Chen Y-S, Lee C-H (2021) Ms-sincresnet: joint learning of 1d and 2d kernels using multi-scale sincnet and resnet for music genre classification. pp 29–36

  38. Zhu G, Jiang F, Duan Z (2020) Y-vector: multiscale waveform encoder for speaker embedding. arXiv preprint arXiv:2010.12951

  39. Ba, J. L., Kiros, J. R. Hinton, G. E (2016) Layer normalization. arXiv preprint arXiv:1607.06450

  40. Fang F, Yamagishi J, Echizen I (2018) Lorenzo-Trueba J High-quality nonparallel voice conversion based on cycle-consistent adversarial network. IEEE, pp 5279–5283

  41. Dumoulin V, Visin F (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285

  42. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. pp 1520–1528

  43. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778

  44. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. IEEE, pp 5206–5210

  45. Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium

  46. Ioffe S (2014) Normalization C SB Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  47. Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models, vol 30. Citeseer, p 3

  48. Muckenhirn H, Magimai-Doss M, Marcel S (2018) On learning vocal tract system related speaker discriminative information from raw signal using cnns. pp 1116–1120

Download references

Funding

This work is supported by State Grid Shandong electric power company science and technology support project, China (Grant No. 5206002000UD).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guangcun Wei.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, G., Zhang, Y., Min, H. et al. End-to-end speaker identification research based on multi-scale SincNet and CGAN. Neural Comput & Applic 35, 22209–22222 (2023). https://doi.org/10.1007/s00521-023-08906-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08906-1

Keywords

Navigation