Skip to main content
Log in

Speaker age and gender classification using GMM supervector and NAP channel compensation method

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

One of the most important factors affecting the performance of speech-based recognition systems is the differences between training and test conditions. The Nuisance attribute projection (NAP) is an effective method for eliminating these differences, called channel effects. In this study, the effects of the NAP approach in determining age and gender groups are investigated. Mel-frequency cepstral coefficients and delta coefficients are used as a feature and Gaussian mixture models (GMM) adapted from the universal background model by maximum-a-posteriori method are used for the modeling of age and gender classes. After the GMMs corresponding to each speech are converted into mean supervectors, they are applied to a Support Vector Machine (SVM), and speeches are classified according to the age and gender group of the speakers. While linear GMM kernel based on Kullback–Leibler divergence is used instead of standard SVM kernels, the NAP channel subspace size is changed between 20 and 200 and the number of GMM components is changed between 32 and 512 to determine the optimum values for these parameters. In the tests on the aGender database, the optimum number of components is determined as 128, and the optimum NAP channel subspace size is determined as 45. The age and gender classification accuracy of the system, which is developed using these optimum parameters, is increased from 60.52 to 62.03% with the use of NAP. In addition, age classification accuracy is increased from 60.23 to 61.82% and gender classification accuracy is increased from 91.71 to 92.30%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Abbreviations

aGender:

Age and Gender Speech Corpus

NAP:

Nuisance attribute projection

MFCC:

Mel-frequency cepstral coefficient

GMM:

Gaussian mixture model

UBM:

Universal background model

MAP:

Maximum-a-posteriori

SVM:

Support vector machine

C:

Child

YF:

Young female

YM:

Young male

AF:

Adult female

AM:

Adult male

SF:

Senior female

SM:

Senior male

KL:

Kullback–Leibler

HMM:

Hidden Markov Model

DTW:

Dynamic time warping

ANN:

Artificial neural network

DNN:

Deep neural network

SDC:

Shifted delta cepstral

i-vector:

Identity vector

P:

Positive

N:

Negative

TP:

True positive

FN:

False negative

TN:

True negative

FP:

False positive

utt a, utt b :

Utterance a and b

HNR:

Harmonics-to-noise ratio

PLP:

Perceptual linear prediction

LPCC:

Linear prediction cepstrum coefficient

\(D\) :

Feature size

N:

Number of training points

\({\alpha }_{i}\) :

Weights of the support vectors

\({t}_{i}\) :

Ideal outputs

\(K\left(x,{x}_{i}\right)\) :

Kernel function

\({x}_{i}\) :

Support vectors

\(x\) :

Observation

\(d\) :

A learned constant

\(b(x)\) :

A mapping

\({\lambda }_{i}\) :

Mixture weights

\(N()\) :

Gaussian function

\({m}_{i}\) :

Mean vector

\({\Sigma }_{i}\) :

Covariance matrix

\(K\) :

The number of Gaussian components

\({g}_{a}\) and \({g}_{b}\) :

GMM models for a and b utterance

\(K\) :

NAP channel subspace size

\(D({g}_{a}||{g}_{b})\) :

Natural distance between two utterances

\({m}^{a}\), \({m}^{b}\) :

Mean supervisors for \(a\) and \(b\) utterance

\({N}_{s}\) :

Number of speakers

\({h}_{i}\) :

Number of sessions for the ith speaker

\({\Phi }_{(1,{\mathrm{s}}_{1})}\) :

Expansion form of recordings of 1st speaker in 1st, session

\({s}_{i}\) :

ith speaker

\(I\) :

Identity matrix

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ergün Yücesoy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yücesoy, E. Speaker age and gender classification using GMM supervector and NAP channel compensation method. J Ambient Intell Human Comput 13, 3633–3642 (2022). https://doi.org/10.1007/s12652-020-02045-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-02045-4

Keywords

Navigation