Abstract
This work aims to further compensate for the weaknesses of feature sparsity and insufficient discriminative acoustic features in existing short-duration audio voiceprint recognition. To address this issue, we propose a Bark-scaled Gaussian and linear filter bank superposition cepstral coefficients (BGLCC), a multi-dimensional central difference (MDCD) acoustic feature, and a Fisher feature fusion of multi-source features. First, the rich low-frequency information is extracted based on the high distribution density of the Bark-scaled Gaussian filter bank in the low-frequency domain, and the extraction of more high-frequency information is based on the linear filter bank uniformly distributed in the high-frequency domain. In addition, the multidimensional central difference method captures better dynamic features of voiceprints in the relative BGLCC domain to improve the performance of short utterance speaker recognition. Finally, the Fisher feature fusion method (FFF) can further enhance speaker individual information and reduce speaker commonality information. Extensive experiments are conducted on short-duration text-independent speaker verification datasets generated from the VoxCeleb corpus, which speech samples of diverse lengths. The results demonstrate that the proposed method outperforms the existing acoustic feature extraction approach by at least 15% in the test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods.



Similar content being viewed by others
Data availability
All data generated or analyzed during this study are included in this published article [15].
References
Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312. https://doi.org/10.1121/1.1914702
Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462. https://doi.org/10.1109/5.628714
Chowdhury A, Ross A (2019) Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans Inf Forensic Secur 15:1616–1629. https://doi.org/10.1109/TIFS.2019.2941773
Chung JS, Nagrani A, Zisserman A (2018) Voxceleb2: deep speaker recognition. In: INTERSPEECH, pp. 1086-1090. https://doi.org/10.21437/Interspeech.2018-1929
Das RK, Mahadeva Prasanna SR (2016) Exploring different attributes of source information for speaker verification with limited test data. The J Acoustic Soc Am 140(1):184–190. https://doi.org/10.1121/1.4954653
Dehak N, Dehak R, Glass JR, Reynolds DA, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Odyssey, p 15. https://www.iscaspeech.org/archive_open/archive_papers/odyssey_2010/papers/od10_015.pdf
Greenberg CS, Stanford VM, Martin A F, Yadagiri M, Doddington GR, Godfrey JJ, Hernandez-Cordero J (2013) The 2012 NIST speaker recognition evaluation. In: INTERSPEECH, pp. 1971-1975. https://doi.org/10.21437/Interspeech.2013-469
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Herrera-Camacho A, Zúñiga-Sainos A, Sierra-Martínez G, Trangol-Curipe J, Mota-Montoya M, Jarquín-Casas A (2019) Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE. In: International Conference on Video, Signal and Image Processing, pp. 105–110. https://doi.org/10.1145/3369318.3369330
Huang L, Pun CM (2020) Audio replay spoof attack detection by joint segment-based linear filter bank feature extraction and attention-enhanced DenseNet-BiLSTM network. IEEE/ACM Trans Audio, Speech, Lang Process 28:1813–1825. https://doi.org/10.1109/TASLP.2020.2998870
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354. https://doi.org/10.1109/TSA.2004.840940
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40. https://doi.org/10.1016/j.specom.2009.08.009
Li C, Ma X, Jiang B, Li X, Zhang X, Liu X, Cao Y, Kannan A, Zhu Z (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. https://arxiv.org/abs/1705.02304
Liu Z, Wu Z, Li T, Li J, Shen C (2018) GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans Indust Inf 14(7):3244–3252. https://doi.org/10.1109/TII.2018.2799928
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: a largescale speaker identification dataset. In: INTERSPEECH, pp. 2616–2620. https://doi.org/10.21437/Interspeech.2017-950
Nosratighods M, Ambikairajah E, Epps J, Carey MJ (2010) A segment selection technique for speaker verification. Speech Comm 52(9):753–761. https://doi.org/10.1016/j.specom.2010.04.007
Omar MK, Pelecanos JW (2010) Training universal background models for speaker recognition. In: Odyssey, p 10. https://www.iscaspeech.org/archive_open/archive_papers/odyssey_2010/papers/od10_010.pdf
Paseddula C, Gangashetty SV (2018) DNN based acoustic scene classification using score fusion of mfcc and inverse mfcc. In: international conference on industrial and information systems (ICIIS), pp. 18-21. https://doi.org/10.1109/ICIINFS.2018.8721379
Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in PyTorch. In: NIPS, pp 1–4. https://openreview.net/pdf?id=BJJsrmfCZ
Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535. https://doi.org/10.1016/j.csl.2017.01.001
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008. https://dl.acm.org/doi/10.5555/3295222.3295349
Vogt R, Sridharan S, Mason M (2009) Making confident speaker verification decisions with minimal speech. IEEE Trans Audio Speech Lang Process 18(6):1182–1192. https://doi.org/10.1109/TASL.2009.2031505
Wu Z, Yu Z, Yuan J, Zhang J (2016) A twice face recognition algorithm. Soft Comput 20:1007–1019. https://doi.org/10.1007/s00500-014-1561-9
Yang H, Deng Y, Zhao HA (2019) A comparison of MFCC and LPCC with deep learning for speaker recognition. In: International Conference on Big Data and Computing, pp. 160–164. https://doi.org/10.1145/3335484.3335528
Zhang C, Koishida K, Hansen JH (2018) Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans Audio, Speech, Lang Process 26(9):1633–1644. https://doi.org/10.1109/TASLP.2018.2831456
Zinchenko K, Wu CY, Song KT (2016) A study on speech recognition control for a surgical robot. IEEE Trans Indust Inf 13(2):607–615. https://doi.org/10.1109/TII.2016.2625818
Acknowledgements
This work was in part supported by NSFC [Grant No. 62176194, Grant No.62101393], the Major project of IoV [Grant No. 2020AAA001], Sanya Science and Education Innovation Park of Wuhan University of Technology [Grant No. 2021KF0031], CSTC [Grant No. cstc2021jcyj-msxmX1148], and the Open Project of Wuhan University of Technology Chongqing Research Institute [ZL2021–6].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zi, Y., Xiong, S. Exploration of multi-source discriminative acoustic feature for speaker recognition with short-duration audio signal. Multimed Tools Appl 82, 47537–47557 (2023). https://doi.org/10.1007/s11042-023-16378-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16378-7