Abstract
With the advent of conversational voice recognition systems such as Alexa, SIRI, OK Google, etc., natural language conversational scheme including Chatbot and voice recognition systems are in new high and determining the age of a speaker is critical for setting the pertinent context. Age can be inferred from the speech signal by inferring various factors such as physical attributes of voice, linguistic attributes, frequency, speech rate, etc., This paper discusses on extracting the spectral features of speech such as Cepstral Coefficients, Spectral Decrease, Centroid, Flatness, Spectral Entropy,Jitter and Shimmer as inputs which would also helps in classifying speaker age through deep learning techniques.A novel approach is addressed along with the model for implementation using Deep Neural Network and Convolutional Neural Network for classifying the features using three different classifiers.The results obtained from the proposed system would outline the performance in speaker age recognition.

Source: robustness-related issues in speaker recognition)



Source: deep learning-based distant-talking speech processing in real-world sound environments)





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
TIMID, Switch Board and CMU KIDS corpus.
References
Abdel-Hamid O, Mohamed A, Jiang H, Penn G (2012) Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4277–4280. Doi: https://doi.org/10.1109/ICASSP.2012.6288864
Abdel-Hamid O, Mohamed A-R, Jiang H, Deng Li, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2014.2339736
Anzalone L, Barra P, Barra S, Narducci F, Nappi M (2019) Transfer Learning for Facial Attributes Prediction and Clustering. In: Wang G, El Saddik A, Lai X, Martinez Perez G, Choo KK (Eds.). Smart City and Informatization iSCI 2019. Communications in Computer and Information Science. 1122. Springer, Singapore. Doi: https://doi.org/10.1007/978-981-15-1301-5_9.
Bachate RP, Sharma A (2019) Automatic speech recognition systems for regional languages in India. Int J Recent Technol Eng (IJRTE) 8(2S3):585–592. https://doi.org/10.35940/ijrte.B1108.0782S319
Barra P, Bisogni C, Nappi M, Freire-Obregón D, Castrillón Santana M (2019) Gait analysis for gender classification in forensics. Depend Sens Cloud Big Data Syst. https://doi.org/10.1007/978-981-15-1304-6_15
Beigi H (2011) Fundamentals of speaker recognition. Springer, Berlin
Büyük O, Arslan LM (2018) Age identification from voice using feed-forward deep neural networks. In: 2018 26th Signal Processing and Communications Applications Conference (SIU) pp 1–4. https://doi.org/10.1109/SIU.2018.8404322
Campbell J (1997) Speaker recognition: a tutorial. Proceed IEEE 85(9):1437–1462. https://doi.org/10.1109/5.628714
Devi KJ, Thongam K (2019) Automatic speaker recognition with enhanced swallow swarm optimization and ensemble classification model from speech signals. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01414-y
Figen E (2011) Fundamentals of speaker recognition. J Eng Sci 6(2–3):185–193
Ganesh A, Chandra E (2012) An overview of speech recognition and speech synthesis algorithms. Int J Comput Technol Appl 3(4):1426–1430
Ghahremani P, Nidadavolu P, Chen N, Villalba J, Povey D, Khudanpur S, Dehak N (2018) End-to-end deep neural network age estimation. Proceed Ann Conf Int Speech Commun Assoc. https://doi.org/10.21437/Interspeech.2018-2015
Godfrey JJ, Holliman E, McDaniel J (1992) SWITCHBOARD: telephone speech corpus for research and development. [Proceedings] ICASSP-92: 1992 IEEE int conf acoustics speech signal process. DOI: https://doi.org/10.1109/ICASSP.1992.225858.
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using CNN.801–804. Doi: https://doi.org/10.1145/2647868.2654984.
Huang Y, Tian K, Wu A et al (2019) Feature fusion methods research based on deep belief\networks for speech emotion recognition under noise condition. J Ambient Intell Human Comput 10:1787–1798. https://doi.org/10.1007/s12652-017-0644-8
Karpagavalli S, Chandra E (2016) A review on automatic speech recognition architecture and approaches. Int J Signal Process Image Process Pattern Recogn 9(4):393–404. https://doi.org/10.14257/ijsip.2016.9.4.34
Karthika K, Chandra E (2018) An advance on gender classification by information preserving features. EEET '18: Proceedings of the 2018 international conference on electronics and electrical engineering technology. pp 227–231. Doi: https://doi.org/10.1145/3277453.3277462
McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. INTERSPEECH-2014. pp:686–690
Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, Muller C, Huber R, Andrassy B, Bauer J, Littel B (2007) Comparison of four approaches to age and gender recognition for telephone applications. Acoustics speech, and signal processing, 1988. ICASSP-88.IEEE. DOI: https://doi.org/10.1109/ICASSP.2007.367263
Michael F, Barnard E, Van Heerden C, Müller C (2009) Multilingual speaker age recognition: regression analyses on the Lwazi corpus. IEEE Workshop Autom Speech Recogn Underst. https://doi.org/10.1109/ASRU.2009.5373374
Ming Li KJ (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang Sci Dir 27(1):151–167. https://doi.org/10.1016/j.csl.2012.01.008
Mohamed A-R, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20:14–22. https://doi.org/10.1109/TASL.2011.2109382
Nehe NS, Holambe RS (2009) Isolated word recognition using normalized teager energy cepstral features. Int Conf Adv Comput Control Telecommun Technol. https://doi.org/10.1109/ACT.2009.36
Ossama A-H, Mohamed A-R, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2012.6288864.
Passricha V, Aggarwal RK (2020) A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. J Ambient Intell Human Comput 11:675–691. https://doi.org/10.1007/s12652-019-01325-y
Patil BD, Manav Y, Sudheendra P (2013) Dynamic database creation for speaker recognition system. MoMM '13: Proceedings of international conference on advances in mobile computing and multimedia. pp 532–536. Doi: https://doi.org/10.1145/2536853.2536923.
Pellegrini T, Vahid H, Isabel T, Annika H, Miguel Sales D (2014) Speaker age estimation for elderly speech recognition in European Portuguese. Interspeech
Poorjam AH (2014) Speaker profiling for forensic applications. Dissertation. KU Leuven, Heverlee
Rubin PV (1998) Measuring and modeling speech production. Animal acoustic communication. Springer, Berlin
Saeid Safavi MR (2018) Automatic speaker, age-group and gender identification from children’s speech. Comput Speech Lang Sci Dir 50:141–156. https://doi.org/10.1016/j.csl.2018.01.001
Sainath TN, Mohamed A-R, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for LVCSR. IEEE international conference on acoustics, speech and signal processing. pp. 8614–8618. Doi: https://doi.org/10.1109/ICASSP.2013.6639347.
Salehghaffari H (2018) Speaker verification using convolutional neural networks. EURASIP J Image Video Process
Sarma M, Sarma KK, Goel NK (2020) Children's age and gender recognition from raw speech waveform using DNN. In: Advances in intelligent computing and communication. pp. 1–9. Springer, Singapore. Doi: https://doi.org/10.1007/978-981-15-2774-6.
Schotz S (2006) Perception, analysis and synthesis of speaker age. Dissertation. Lund University
Schotz S (2007) Acoustic analysis of adult speaker age. In: Müller C (Eds.). Speaker classification I. Lecture notes in computer science. Springer, Berlin. pp 88–107. Doi:https://doi.org/10.1007/978-3-540-74200-5_5.
Schuller BJ (2017) A paralinguistic approach to speaker diarisation: using age, gender, voice likability and personality traits. Proceedings of the 25th ACM international conference on multimedia. ACM. P 387–392. DOI: https://doi.org/10.1145/3123266.3123338
Shipp T, Qi Y, Huntley R, Hollien H (1992) Acoustic and temporal correlates of perceived age. J Voice Sci Dir 6(3):211–216. https://doi.org/10.1016/S0892-1997(05)80145-6
Skoog Waller S, Eriksson M, Sörqvist P (2015) Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age. Front Psychol. https://doi.org/10.3389/fpsyg.2015.00978
Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep Neural Network embeddings for text-independent speaker verification. Interspeech. https://doi.org/10.21437/Interspeech.2017-620
Sujiya EC (2017) A review on speaker recognition. Int J Eng Technol 9(3):1592–1598. https://doi.org/10.21817/ijet/2017/v9i3/170903513
Tranel D, Damasio AR, Damasio H (1988) Intact recognition of facial expression, gender, and age in patients with impaired recognition of face identity. Neurology 38(5):690–696. https://doi.org/10.1212/wnl.38.5.690
Wang Z, Tashev I (2017) Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 5150–5154. Doi: https://doi.org/10.1109/ICASSP.2017.7953138.
Yücesoy E (2020) Speaker age and gender classification using GMM super vector and NAP channel compensation method. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-02045-4
Yue M, Chen L, Zhang J, Liu H (2014) Speaker age recognition based on isolated words by using SVM, 2014 IEEE 3rd international conference on cloud computing and intelligence systems. pp. 282–286. Doi: https://doi.org/10.1109/CCIS.2014.7175743.
Zakariya Q, Mallouh AA, Barkana BD (2017) DNN-based Models for Speaker Age and Gender Classification. Proceedings of the 10th international joint conference on biomedical engineering systems and technologies .pp 106–111. DOI: https://doi.org/10.5220/0006096401060111.
Zazo R, Sankar Nidadavolu P, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE pp. 22524–22530. Doi: https://doi.org/10.1109/ACCESS.2018.2816163.
Zhang Y, Weninger F, Liu B, Schmitt M, Eyben F, Schuller B (2017) A paralinguistic approach to speaker diarisation: using age, gender, voice likability and personality traits. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 387–392
Acknowledgment
I am grateful to all kinds of support provided by Prof. Dr. E. Chandra Eswaran for guiding me for my research work.
Funding
The research work is supported by RUSA 2.0- BEICH.
Author information
Authors and Affiliations
Contributions
Both the authors conceived of the presented idea, developed the theory and performed the computations and Dr. E. Chandra encouraged K. Karthika to investigate the research and supervised the findings of this work. All authors discussed the results and contributed to the final manuscript. This work has been submitted for Indian Intellectual property with Patent Application Number 201841032399.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no competing interests.
Replication of results
No replicated results are presented.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kuppusamy, K., Eswaran, C. Convolutional and Deep Neural Networks based techniques for extracting the age-relevant features of the speaker. J Ambient Intell Human Comput 13, 5655–5667 (2022). https://doi.org/10.1007/s12652-021-03238-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-021-03238-1