Abstract
In the given article, we present a novel approach in the paralinguistic field of age and gender recognition by speaker voice based on deep neural networks. The training and testing of proposed models were implemented on the German speech corpus aGender. We conducted experiments using different network topologies, including neural networks with fully-connected and convolutional layers. In a joint recognition of speaker age and gender, our system reached the recognition performance measured as unweighted accuracy of 48.41%. In a separate age and gender recognition setup, the obtained performance was 57.53% and 88.80%, respectively. Applied deep neural networks provide the best result of speaker age recognition in comparison to existing traditional classification methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ranzato, M., Hinton, G.: Modeling pixel means and covariances using factorized third-order Boltzmann machines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2551–2558 (2010)
Lee, H., Ekanadham, C., Ng, A.: Sparse deep belief net model for visual area V2. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 873–880 (2007)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2012)
Deselaers, T., Hasan, S., Bender, O., Ney, H.: A deep learning approach to machine transliteration. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 233–241 (2009)
Yu, D., Wang, S., Karam, Z., Deng, L.: Language recognition using deep-structured conditional random fields. In: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 5030–5033 (2010)
Schuller, B., et al.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2794–2797 (2010)
Burkhardt, F., Eckert, M., Johannsen, W., Stegmann, J.: A database of age and gender annotated telephone speech. In: Proceedings of 7th International Conference on Language Resources and Evaluation (LREC 2010) (2010)
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM Multimedia 2010 International Conference, pp. 1459–1462 (2010)
Kockmann, M., Burget, L., Cernocký, J.: Brno University of Technology system for Interspeech 2010 paralinguistic challenge. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2822–2825 (2010)
Meinedo, H., Trancoso, I.: Age and gender classification using fusion of acoustic and prosodic features. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2818–2821 (2010)
Li, M., Han, K., Narayanan, S.: Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)
Yücesoy, E., Nabiyev, V.: A new approach with score-level fusion for the classification of a speaker age and gender. Comput. Electr. Eng. 53, 29–39 (2016)
Równicka, J., Kacprzak, S.: Speaker age classification and regression using i-vectors. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016): Understanding Speech Processing in Humans and Machines, pp. 1402–1406 (2016)
Sadjadi, S., Slaney, M., Heck, L.: MSR identity toolbox v1.0: a Matlab toolbox for speaker-recognition research. Speech Lang. Process. Tech. Committee Newsl. 1, 1–32 (2013)
Qawaqneh, Z., Abumallouh, A., Barkana, B.: Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl.-Based Syst. 115, 5–14 (2016)
Abumallouh, A., Qawaqneh, Z., Barkana, B.: New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. In: Neural Computing and Applications, vol. 30, no. 8, pp. 2581–2593 (2017)
Ghahremani, P., et al.: End-to-end deep neural network age estimation. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association, INTERSPEECH 2018, pp. 277–281 (2018)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: robust DNN embeddings for speaker recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Abumallouh, A., Qawaqneh, Z., Barkana, B.: Deep neural network combined posteriors for speakers’ age and gender classification. In: Annual Connecticut Conference on Industrial Electronics, Technology & Automation (CT-IETA), pp. 1–5 (2016)
McFee, B., et al.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th python in science conference, pp. 18–24 (2015)
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
Bocklet, T., Stemmer, G., Zeißler, V., Noeth, E.: Age and gender recognition based on multiple systems - early vs. late fusion. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2830–2833 (2010)
Nguyen, P., Le, T., Tran, D., Huang, X., Sharma, D.: Fuzzy support vector machines for age and gender classification. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2806–2809 (2010)
Gajsek, R., Žibert, J., Justin, T., Štruc, V., Vesnicer, B., Mihelic, F.: Gender and affect recognition based on GMM and GMM-UBM modeling with relevance MAP estimation. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2810–2813 (2010)
Acknowledgements
This research is supported by the Russian Science Foundation (project No. 18-11-00145).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Markitantov, M., Verkholyak, O. (2019). Automatic Recognition of Speaker Age and Gender Based on Deep Neural Networks. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)