Speaker age and gender classification using GMM supervector and NAP channel compensation method

Yücesoy, Ergün

doi:10.1007/s12652-020-02045-4

Speaker age and gender classification using GMM supervector and NAP channel compensation method

Original Research
Published: 13 May 2020

Volume 13, pages 3633–3642, (2022)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Ergün Yücesoy ORCID: orcid.org/0000-0003-1707-384X¹

497 Accesses
Explore all metrics

Abstract

One of the most important factors affecting the performance of speech-based recognition systems is the differences between training and test conditions. The Nuisance attribute projection (NAP) is an effective method for eliminating these differences, called channel effects. In this study, the effects of the NAP approach in determining age and gender groups are investigated. Mel-frequency cepstral coefficients and delta coefficients are used as a feature and Gaussian mixture models (GMM) adapted from the universal background model by maximum-a-posteriori method are used for the modeling of age and gender classes. After the GMMs corresponding to each speech are converted into mean supervectors, they are applied to a Support Vector Machine (SVM), and speeches are classified according to the age and gender group of the speakers. While linear GMM kernel based on Kullback–Leibler divergence is used instead of standard SVM kernels, the NAP channel subspace size is changed between 20 and 200 and the number of GMM components is changed between 32 and 512 to determine the optimum values for these parameters. In the tests on the aGender database, the optimum number of components is determined as 128, and the optimum NAP channel subspace size is determined as 45. The age and gender classification accuracy of the system, which is developed using these optimum parameters, is increased from 60.52 to 62.03% with the use of NAP. In addition, age classification accuracy is increased from 60.23 to 61.82% and gender classification accuracy is increased from 91.71 to 92.30%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

VQ/GMM-Based Speaker Identification with Emphasis on Language Dependency

Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

Article 13 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Abbreviations

aGender:: Age and Gender Speech Corpus
NAP:: Nuisance attribute projection
MFCC:: Mel-frequency cepstral coefficient
GMM:: Gaussian mixture model
UBM:: Universal background model
MAP:: Maximum-a-posteriori
SVM:: Support vector machine
C:: Child
YF:: Young female
YM:: Young male
AF:: Adult female
AM:: Adult male
SF:: Senior female
SM:: Senior male
KL:: Kullback–Leibler
HMM:: Hidden Markov Model
DTW:: Dynamic time warping
ANN:: Artificial neural network
DNN:: Deep neural network
SDC:: Shifted delta cepstral
i-vector:: Identity vector
P:: Positive
N:: Negative
TP:: True positive
FN:: False negative
TN:: True negative
FP:: False positive
utt _a, utt _b :: Utterance a and b
HNR:: Harmonics-to-noise ratio
PLP:: Perceptual linear prediction
LPCC:: Linear prediction cepstrum coefficient
$D$ :: Feature size
N:: Number of training points
${\alpha }_{i}$ :: Weights of the support vectors
${t}_{i}$ :: Ideal outputs
$K\left(x,{x}_{i}\right)$ :: Kernel function
${x}_{i}$ :: Support vectors
$x$ :: Observation
$d$ :: A learned constant
$b(x)$ :: A mapping
${\lambda }_{i}$ :: Mixture weights
$N()$ :: Gaussian function
${m}_{i}$ :: Mean vector
${\Sigma }_{i}$ :: Covariance matrix
$K$ :: The number of Gaussian components
${g}_{a}$ and ${g}_{b}$ :: GMM models for a and b utterance
$K$ :: NAP channel subspace size
$D({g}_{a}||{g}_{b})$ :: Natural distance between two utterances
${m}^{a}$, ${m}^{b}$ :: Mean supervisors for $a$ and $b$ utterance
${N}_{s}$ :: Number of speakers
${h}_{i}$ :: Number of sessions for the ith speaker
${\Phi }_{(1,{\mathrm{s}}_{1})}$ :: Expansion form of recordings of 1^st speaker in 1^st, session
${s}_{i}$ :: ith speaker
$I$ :: Identity matrix

References

Bahari MH, McLaren M, Van Hamme H, Van Leeuwen DA (2014) Speaker age estimation using i-vectors. Eng Appl Artif Intell 34:99–108. https://doi.org/10.1016/j.engappai.2014.05.003
Article Google Scholar
Bakir C (2016) Automatic speaker gender identification for the german language. Balk J Electr Comput Eng 4:79–83. https://doi.org/10.17694/bajece.43067
Article Google Scholar
Bhukya S (2018) Effect of gender on improving speech recognition system. Int J Comput Appl 179:22–30. https://doi.org/10.5120/ijca2018916200
Article Google Scholar
Büyük O, Arslan LM (2019) An investigation of multi-language age classification from voice. 12th Int Conf Bio-Inspired Syst Signal Process BIOSIGNALS 2019 Part 12th Int Jt Conf Biomed Eng Syst Technol BIOSTEC. https://doi.org/10.5220/0007237600850092
Article Google Scholar
Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. ICASSP IEEE Int Conf Acoust Speech Signal Process Proc 1:97–100. https://doi.org/10.1109/icassp.2006.1659966
Article Google Scholar
Collobert R, Bengio S (2001) SVMTorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160. https://doi.org/10.1162/15324430152733142
Article MathSciNet MATH Google Scholar
Cristianini N, Shawe-Taylor J (2000) Support vector machines. In: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. https://doi.org/10.1017/CBO9780511801389.008
Fauve BGB, Matrouf D, Scheffer N et al (2007) State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans Audio Speech Lang Process 15:1960–1968. https://doi.org/10.1109/TASL.2007.902877
Article Google Scholar
Furui S (1981) Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Trans Acoust 29:342–350. https://doi.org/10.1109/TASSP.1981.1163605
Article Google Scholar
Gao W, Darvishan A, Toghani M et al (2019) Different states of multi-block based forecast engine for price and load prediction. Int J Electr Power Energy Syst 104:423–435. https://doi.org/10.1016/j.ijepes.2018.07.014
Article Google Scholar
Ghadimi N, Akbarimajd A, Shayeghi H, Abedinia O (2018) Two stage forecast engine with feature selection technique and improved meta-heuristic algorithm for electricity load forecasting. Energy 161:130–142. https://doi.org/10.1016/j.energy.2018.07.088
Article Google Scholar
Jiang Z, Huang H, Yang S et al (2009) Acoustic feature comparison of MFCC and CZT-based cepstrum for speech recognition. 2009 Fifth Int Conf Nat Comput 1:55–59. https://doi.org/10.1109/ICNC.2009.587
Article Google Scholar
Kockmann M, Burget L, Černock\`y J (2010) Brno university of technology system for interspeech 2010 paralinguistic challenge. In: Eleventh Annual conference of the International Speech Communication Association. Makuhari, Chiba, JP, pp 2822–2825
Li M, Han KJ, Narayanan S (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 27:151–167. https://doi.org/10.1016/j.csl.2012.01.008
Article Google Scholar
Markitantov M, Verkholyak O (2019) Automatic recognition of speaker age and gender based on deep neural networks. Int Conf Speech Comput. https://doi.org/10.1007/978-3-030-26061-3_34
Article Google Scholar
Mason JS, Zhang X (1991) Velocity and acceleration features in speaker recognition. [Proceedings] ICASSP 91 1991 Int Conf Acoust Speech Signal Process. https://doi.org/10.1109/ICASSP.1991.151073
Article Google Scholar
Porat R, Lange D, Zigel Y (2010) Age recognition based on speech signals using weights supervector. In: Eleventh annual conference of the International Speech Communication Association, pp 2814–2817
Prabukumar M, Agilandeeswari L, Ganesan K (2019) An intelligent lung cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Humaniz Comput 10:267–293. https://doi.org/10.1007/s12652-017-0655-5
Article Google Scholar
Qawaqneh Z, Mallouh AA, Barkana BD (2017) DNN-based models for speaker age and gender classification. Int Conf Bio-inspired Syst Signal Process 5:106–111. https://doi.org/10.5220/0006096401060111
Article Google Scholar
Safavi S, Russell M, Jančovič P (2018) Automatic speaker, age-group and gender identification from children’s speech. Comput Speech Lang 50:141–156. https://doi.org/10.1016/j.csl.2018.01.001
Article Google Scholar
Schuller B, Steidl S, Batliner A et al (2013) Paralinguistics in speech and language—state-of-the-art and the challenge. Comput Speech Lang 27:4–39. https://doi.org/10.1016/j.csl.2012.02.005
Article Google Scholar
Schuller B, Steidl S, Batliner A, et al (2010) The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of the 11th annual conference of the International Speech Communication Association, INTERSPEECH 2010, pp 2794–2797
Solomonoff A, Quillen C, Campbell WM (2004) Channel compensation for SVM speaker recognition. In: Proceedings Odyssey-04 speaker and language recognition workshop, Toledo, Spain, pp 219–226
Solomonoff A, Campbell WM, Boardman I (2005) Advances in channel compensation for SVM speaker recognition. Proceedings (ICASSP’05) IEEE Int Conf Acoust Speech Signal Process 1:I-629. https://doi.org/10.1109/ICASSP.2005.1415192
Article Google Scholar

Download references

Author information

Authors and Affiliations

Vocational School of Technical Sciences, Ordu University, 52200, Ordu, Turkey
Ergün Yücesoy

Authors

Ergün Yücesoy
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ergün Yücesoy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yücesoy, E. Speaker age and gender classification using GMM supervector and NAP channel compensation method. J Ambient Intell Human Comput 13, 3633–3642 (2022). https://doi.org/10.1007/s12652-020-02045-4

Download citation

Received: 04 November 2019
Accepted: 25 April 2020
Published: 13 May 2020
Issue Date: July 2022
DOI: https://doi.org/10.1007/s12652-020-02045-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker age and gender classification using GMM supervector and NAP channel compensation method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

VQ/GMM-Based Speaker Identification with Emphasis on Language Dependency

Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

Explore related subjects

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now