Skip to main content
Log in

Exploration of multi-source discriminative acoustic feature for speaker recognition with short-duration audio signal

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This work aims to further compensate for the weaknesses of feature sparsity and insufficient discriminative acoustic features in existing short-duration audio voiceprint recognition. To address this issue, we propose a Bark-scaled Gaussian and linear filter bank superposition cepstral coefficients (BGLCC), a multi-dimensional central difference (MDCD) acoustic feature, and a Fisher feature fusion of multi-source features. First, the rich low-frequency information is extracted based on the high distribution density of the Bark-scaled Gaussian filter bank in the low-frequency domain, and the extraction of more high-frequency information is based on the linear filter bank uniformly distributed in the high-frequency domain. In addition, the multidimensional central difference method captures better dynamic features of voiceprints in the relative BGLCC domain to improve the performance of short utterance speaker recognition. Finally, the Fisher feature fusion method (FFF) can further enhance speaker individual information and reduce speaker commonality information. Extensive experiments are conducted on short-duration text-independent speaker verification datasets generated from the VoxCeleb corpus, which speech samples of diverse lengths. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach by at least 15% in the test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

All data generated or analyzed during this study are included in this published article [15].

References

  1. Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312. https://doi.org/10.1121/1.1914702

    Article  Google Scholar 

  2. Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462. https://doi.org/10.1109/5.628714

    Article  Google Scholar 

  3. Chowdhury A, Ross A (2019) Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans Inf Forensic Secur 15:1616–1629. https://doi.org/10.1109/TIFS.2019.2941773

    Article  Google Scholar 

  4. Chung JS, Nagrani A, Zisserman A (2018) Voxceleb2: deep speaker recognition. In: INTERSPEECH, pp. 1086-1090. https://doi.org/10.21437/Interspeech.2018-1929

  5. Das RK, Mahadeva Prasanna SR (2016) Exploring different attributes of source information for speaker verification with limited test data. The J Acoustic Soc Am 140(1):184–190. https://doi.org/10.1121/1.4954653

    Article  Google Scholar 

  6. Dehak N, Dehak R, Glass JR, Reynolds DA, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Odyssey, p 15. https://www.iscaspeech.org/archive_open/archive_papers/odyssey_2010/papers/od10_015.pdf

  7. Greenberg CS, Stanford VM, Martin A F, Yadagiri M, Doddington GR, Godfrey JJ, Hernandez-Cordero J (2013) The 2012 NIST speaker recognition evaluation. In: INTERSPEECH, pp. 1971-1975. https://doi.org/10.21437/Interspeech.2013-469

  8. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  9. Herrera-Camacho A, Zúñiga-Sainos A, Sierra-Martínez G, Trangol-Curipe J, Mota-Montoya M, Jarquín-Casas A (2019) Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE. In: International Conference on Video, Signal and Image Processing, pp. 105–110. https://doi.org/10.1145/3369318.3369330

  10. Huang L, Pun CM (2020) Audio replay spoof attack detection by joint segment-based linear filter bank feature extraction and attention-enhanced DenseNet-BiLSTM network. IEEE/ACM Trans Audio, Speech, Lang Process 28:1813–1825. https://doi.org/10.1109/TASLP.2020.2998870

    Article  Google Scholar 

  11. Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354. https://doi.org/10.1109/TSA.2004.840940

    Article  Google Scholar 

  12. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40. https://doi.org/10.1016/j.specom.2009.08.009

    Article  Google Scholar 

  13. Li C, Ma X, Jiang B, Li X, Zhang X, Liu X, Cao Y, Kannan A, Zhu Z (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. https://arxiv.org/abs/1705.02304

  14. Liu Z, Wu Z, Li T, Li J, Shen C (2018) GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans Indust Inf 14(7):3244–3252. https://doi.org/10.1109/TII.2018.2799928

    Article  Google Scholar 

  15. Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: a largescale speaker identification dataset. In: INTERSPEECH, pp. 2616–2620. https://doi.org/10.21437/Interspeech.2017-950

  16. Nosratighods M, Ambikairajah E, Epps J, Carey MJ (2010) A segment selection technique for speaker verification. Speech Comm 52(9):753–761. https://doi.org/10.1016/j.specom.2010.04.007

    Article  Google Scholar 

  17. Omar MK, Pelecanos JW (2010) Training universal background models for speaker recognition. In: Odyssey, p 10. https://www.iscaspeech.org/archive_open/archive_papers/odyssey_2010/papers/od10_010.pdf

  18. Paseddula C, Gangashetty SV (2018) DNN based acoustic scene classification using score fusion of mfcc and inverse mfcc. In: international conference on industrial and information systems (ICIIS), pp. 18-21. https://doi.org/10.1109/ICIINFS.2018.8721379

  19. Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in PyTorch. In: NIPS, pp 1–4. https://openreview.net/pdf?id=BJJsrmfCZ

  20. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682

  21. Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535. https://doi.org/10.1016/j.csl.2017.01.001

    Article  Google Scholar 

  22. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008. https://dl.acm.org/doi/10.5555/3295222.3295349

  23. Vogt R, Sridharan S, Mason M (2009) Making confident speaker verification decisions with minimal speech. IEEE Trans Audio Speech Lang Process 18(6):1182–1192. https://doi.org/10.1109/TASL.2009.2031505

    Article  Google Scholar 

  24. Wu Z, Yu Z, Yuan J, Zhang J (2016) A twice face recognition algorithm. Soft Comput 20:1007–1019. https://doi.org/10.1007/s00500-014-1561-9

    Article  Google Scholar 

  25. Yang H, Deng Y, Zhao HA (2019) A comparison of MFCC and LPCC with deep learning for speaker recognition. In: International Conference on Big Data and Computing, pp. 160–164. https://doi.org/10.1145/3335484.3335528

  26. Zhang C, Koishida K, Hansen JH (2018) Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans Audio, Speech, Lang Process 26(9):1633–1644. https://doi.org/10.1109/TASLP.2018.2831456

    Article  Google Scholar 

  27. Zinchenko K, Wu CY, Song KT (2016) A study on speech recognition control for a surgical robot. IEEE Trans Indust Inf 13(2):607–615. https://doi.org/10.1109/TII.2016.2625818

    Article  Google Scholar 

Download references

Acknowledgements

This work was in part supported by NSFC [Grant No. 62176194, Grant No.62101393], the Major project of IoV [Grant No. 2020AAA001], Sanya Science and Education Innovation Park of Wuhan University of Technology [Grant No. 2021KF0031], CSTC [Grant No. cstc2021jcyj-msxmX1148], and the Open Project of Wuhan University of Technology Chongqing Research Institute [ZL2021–6].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfei Zi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zi, Y., Xiong, S. Exploration of multi-source discriminative acoustic feature for speaker recognition with short-duration audio signal. Multimed Tools Appl 82, 47537–47557 (2023). https://doi.org/10.1007/s11042-023-16378-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16378-7

Keywords