Abstract
One of the most important factors affecting the performance of speech-based recognition systems is the differences between training and test conditions. The Nuisance attribute projection (NAP) is an effective method for eliminating these differences, called channel effects. In this study, the effects of the NAP approach in determining age and gender groups are investigated. Mel-frequency cepstral coefficients and delta coefficients are used as a feature and Gaussian mixture models (GMM) adapted from the universal background model by maximum-a-posteriori method are used for the modeling of age and gender classes. After the GMMs corresponding to each speech are converted into mean supervectors, they are applied to a Support Vector Machine (SVM), and speeches are classified according to the age and gender group of the speakers. While linear GMM kernel based on Kullback–Leibler divergence is used instead of standard SVM kernels, the NAP channel subspace size is changed between 20 and 200 and the number of GMM components is changed between 32 and 512 to determine the optimum values for these parameters. In the tests on the aGender database, the optimum number of components is determined as 128, and the optimum NAP channel subspace size is determined as 45. The age and gender classification accuracy of the system, which is developed using these optimum parameters, is increased from 60.52 to 62.03% with the use of NAP. In addition, age classification accuracy is increased from 60.23 to 61.82% and gender classification accuracy is increased from 91.71 to 92.30%.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Abbreviations
- aGender:
-
Age and Gender Speech Corpus
- NAP:
-
Nuisance attribute projection
- MFCC:
-
Mel-frequency cepstral coefficient
- GMM:
-
Gaussian mixture model
- UBM:
-
Universal background model
- MAP:
-
Maximum-a-posteriori
- SVM:
-
Support vector machine
- C:
-
Child
- YF:
-
Young female
- YM:
-
Young male
- AF:
-
Adult female
- AM:
-
Adult male
- SF:
-
Senior female
- SM:
-
Senior male
- KL:
-
Kullback–Leibler
- HMM:
-
Hidden Markov Model
- DTW:
-
Dynamic time warping
- ANN:
-
Artificial neural network
- DNN:
-
Deep neural network
- SDC:
-
Shifted delta cepstral
- i-vector:
-
Identity vector
- P:
-
Positive
- N:
-
Negative
- TP:
-
True positive
- FN:
-
False negative
- TN:
-
True negative
- FP:
-
False positive
- utt a, utt b :
-
Utterance a and b
- HNR:
-
Harmonics-to-noise ratio
- PLP:
-
Perceptual linear prediction
- LPCC:
-
Linear prediction cepstrum coefficient
- \(D\) :
-
Feature size
- N:
-
Number of training points
- \({\alpha }_{i}\) :
-
Weights of the support vectors
- \({t}_{i}\) :
-
Ideal outputs
- \(K\left(x,{x}_{i}\right)\) :
-
Kernel function
- \({x}_{i}\) :
-
Support vectors
- \(x\) :
-
Observation
- \(d\) :
-
A learned constant
- \(b(x)\) :
-
A mapping
- \({\lambda }_{i}\) :
-
Mixture weights
- \(N()\) :
-
Gaussian function
- \({m}_{i}\) :
-
Mean vector
- \({\Sigma }_{i}\) :
-
Covariance matrix
- \(K\) :
-
The number of Gaussian components
- \({g}_{a}\) and \({g}_{b}\) :
-
GMM models for a and b utterance
- \(K\) :
-
NAP channel subspace size
- \(D({g}_{a}||{g}_{b})\) :
-
Natural distance between two utterances
- \({m}^{a}\), \({m}^{b}\) :
-
Mean supervisors for \(a\) and \(b\) utterance
- \({N}_{s}\) :
-
Number of speakers
- \({h}_{i}\) :
-
Number of sessions for the ith speaker
- \({\Phi }_{(1,{\mathrm{s}}_{1})}\) :
-
Expansion form of recordings of 1st speaker in 1st, session
- \({s}_{i}\) :
-
ith speaker
- \(I\) :
-
Identity matrix
References
Bahari MH, McLaren M, Van Hamme H, Van Leeuwen DA (2014) Speaker age estimation using i-vectors. Eng Appl Artif Intell 34:99–108. https://doi.org/10.1016/j.engappai.2014.05.003
Bakir C (2016) Automatic speaker gender identification for the german language. Balk J Electr Comput Eng 4:79–83. https://doi.org/10.17694/bajece.43067
Bhukya S (2018) Effect of gender on improving speech recognition system. Int J Comput Appl 179:22–30. https://doi.org/10.5120/ijca2018916200
Büyük O, Arslan LM (2019) An investigation of multi-language age classification from voice. 12th Int Conf Bio-Inspired Syst Signal Process BIOSIGNALS 2019 Part 12th Int Jt Conf Biomed Eng Syst Technol BIOSTEC. https://doi.org/10.5220/0007237600850092
Campbell WM, Sturim DE, Reynolds DA, Solomonoff A (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. ICASSP IEEE Int Conf Acoust Speech Signal Process Proc 1:97–100. https://doi.org/10.1109/icassp.2006.1659966
Collobert R, Bengio S (2001) SVMTorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160. https://doi.org/10.1162/15324430152733142
Cristianini N, Shawe-Taylor J (2000) Support vector machines. In: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. https://doi.org/10.1017/CBO9780511801389.008
Fauve BGB, Matrouf D, Scheffer N et al (2007) State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans Audio Speech Lang Process 15:1960–1968. https://doi.org/10.1109/TASL.2007.902877
Furui S (1981) Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Trans Acoust 29:342–350. https://doi.org/10.1109/TASSP.1981.1163605
Gao W, Darvishan A, Toghani M et al (2019) Different states of multi-block based forecast engine for price and load prediction. Int J Electr Power Energy Syst 104:423–435. https://doi.org/10.1016/j.ijepes.2018.07.014
Ghadimi N, Akbarimajd A, Shayeghi H, Abedinia O (2018) Two stage forecast engine with feature selection technique and improved meta-heuristic algorithm for electricity load forecasting. Energy 161:130–142. https://doi.org/10.1016/j.energy.2018.07.088
Jiang Z, Huang H, Yang S et al (2009) Acoustic feature comparison of MFCC and CZT-based cepstrum for speech recognition. 2009 Fifth Int Conf Nat Comput 1:55–59. https://doi.org/10.1109/ICNC.2009.587
Kockmann M, Burget L, Černock\`y J (2010) Brno university of technology system for interspeech 2010 paralinguistic challenge. In: Eleventh Annual conference of the International Speech Communication Association. Makuhari, Chiba, JP, pp 2822–2825
Li M, Han KJ, Narayanan S (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 27:151–167. https://doi.org/10.1016/j.csl.2012.01.008
Markitantov M, Verkholyak O (2019) Automatic recognition of speaker age and gender based on deep neural networks. Int Conf Speech Comput. https://doi.org/10.1007/978-3-030-26061-3_34
Mason JS, Zhang X (1991) Velocity and acceleration features in speaker recognition. [Proceedings] ICASSP 91 1991 Int Conf Acoust Speech Signal Process. https://doi.org/10.1109/ICASSP.1991.151073
Porat R, Lange D, Zigel Y (2010) Age recognition based on speech signals using weights supervector. In: Eleventh annual conference of the International Speech Communication Association, pp 2814–2817
Prabukumar M, Agilandeeswari L, Ganesan K (2019) An intelligent lung cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Humaniz Comput 10:267–293. https://doi.org/10.1007/s12652-017-0655-5
Qawaqneh Z, Mallouh AA, Barkana BD (2017) DNN-based models for speaker age and gender classification. Int Conf Bio-inspired Syst Signal Process 5:106–111. https://doi.org/10.5220/0006096401060111
Safavi S, Russell M, Jančovič P (2018) Automatic speaker, age-group and gender identification from children’s speech. Comput Speech Lang 50:141–156. https://doi.org/10.1016/j.csl.2018.01.001
Schuller B, Steidl S, Batliner A et al (2013) Paralinguistics in speech and language—state-of-the-art and the challenge. Comput Speech Lang 27:4–39. https://doi.org/10.1016/j.csl.2012.02.005
Schuller B, Steidl S, Batliner A, et al (2010) The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of the 11th annual conference of the International Speech Communication Association, INTERSPEECH 2010, pp 2794–2797
Solomonoff A, Quillen C, Campbell WM (2004) Channel compensation for SVM speaker recognition. In: Proceedings Odyssey-04 speaker and language recognition workshop, Toledo, Spain, pp 219–226
Solomonoff A, Campbell WM, Boardman I (2005) Advances in channel compensation for SVM speaker recognition. Proceedings (ICASSP’05) IEEE Int Conf Acoust Speech Signal Process 1:I-629. https://doi.org/10.1109/ICASSP.2005.1415192
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yücesoy, E. Speaker age and gender classification using GMM supervector and NAP channel compensation method. J Ambient Intell Human Comput 13, 3633–3642 (2022). https://doi.org/10.1007/s12652-020-02045-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-02045-4