Abstract
We propose a method which provides age of the speaker as an additional information while training a machine learning model for gender identification. To achieve this objective, we design a multi-task learning Deep Neural Network (DNN) model where the primary output layer has the speakers’ gender as target. Further, we use age group of the speaker as auxiliary target for each utterance, where age groups are created considering the gender of the speaker. We experimentally prove that multi-task learning DNN outperforms Gaussian Mixture Model (GMM) or single-task learning DNN trained only for gender recognition for more real life oriented datasets. For such datasets we have recordings of speakers’ from all age groups (children to seniors). We use raw speech waveform as input to our DNN which executes the multi-task learning with the freedom to follow gender and age discriminative features during training. The raw waveform front end uses convolutional layer based filter learning. Further, we use Long Short Term Memory cell based recurrent projection (LSTMP) layers for modeling temporal dynamics of speech from learned feature representation.
Similar content being viewed by others
References
Alhussein, M., Ali, Z., Imran, M., & Abdul, W. (2016). Automatic gender detection based on characteristics of vocal folds for mobile healthcare system, Mobile Information Systems, vol 2016, 1-12.
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y. (2017). An exploration of dropout with LSTMs, in Proceedings of Interspeech 2017, The 11th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden.
Ghahremani, P., Manohar, V., Povey, D., & Khudanpur, S. (2016). Acoustic modelling from the signal domain using CNNs, In Proceedings of Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, September 8–12, San Francisco, CA, USA.
Goel, N. K., Sarma, M., Kushwah, T. S., Agrawal, D. K., Iqbal, Z., & Chauhan, S. (2018). Extracting speaker’s gender, accent, age and emotional state from speech, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Golik, P., Tuske, Z., Schluter, R., & Ney, H. (2015). Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, In Proceedings of Interspeech 2015, The 16th Annual Conference of the International Speech Communication Association, (pp. 26–30). September 6–10, Dresden, Germany.
Hebbar, R., Somandepalli, K., & Narayanan, S. (2018). Improving Gender Identification in Movie Audio using Cross-Domain Data In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Hermansky, H., & Sharma, S. (1998). TRAP- Classifiers for Temporal Patterns, In Proceedings of 5th International Conference On Spoken Language Processing.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech sound waves using restricted Boltzmann machines, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5884–5887). Prague, Czech Republic.
Kabil, S. H., Muckenhirn, H., & Doss, M. M. (2018). On Learning to Identify Genders from Raw Speech Signal using CNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Kumar, N., Nasir, M., Georgiou, P., & Narayanan, S. S. (2016). Robust multichannel gender classification from speech in movie audio, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.
Levitan, S. I., Mishra, T., & Bangalore, S. (2016). Automatic Identification of Gender from Speech, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech and Language,27, 151–167.
Li, M., Jung, C., & Han, K. J. (2010). Combining five acoustic level modeling methods for automatic speaker age and gender recognition, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.
Meinedo, H., & Trancoso, I. (2010). Age and gender classification using fusion of acoustic and prosodic features, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.
Meinedo, H., Trancoso, I., & (2011). Age and gender detection in the I-DASH project. ACM Transactions on Speech and Language Processing, 7(4), 13.
Palaz, D., Magimai-Doss, M., & Collobert, R. (2015). Analysis of CNN-based speech recognition system using raw speech as input, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.
Palaz, D., Magimai-Doss, M., & Collobert, R. (2019). End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Communication, 108, 15–32.
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts, In Proceedings of Interspeech 2015, the 16th Annual Conference of the International Speech Communication Association, September 6–10. Dresden, Germany.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlcek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The kaldi speech recognition toolkit, In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US.
Povey, D., Zhang, X., & Khudanpur, S. (2015). Parallel training of deep neural networks with natural gradient and parameter averaging, In Proceedings of International Conference on Learning Representations (ICLR).
Ruder, S. An overview of multi-task learning in deep neural networks. Retrieved from https://arxiv.org/pdf/1706.05098.pdf. Accessed 25 Feb 2018.
Sainath, T. N., Kingsbury, B., Mohamed, A., & Ramabhadran, B. (2013). Learning Filter Banks within a Deep Neural Network Framework, In Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302.
Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., & Vinyals, O. (2015). Learning the speech front-end with raw waveform CLDNNs, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.
Sak, H., Senior, A., & Beaufays, F. (2014). Long short–term memory recurrent neural network architectures for large scale acoustic modeling, In Proceedings of the Interspeech 2014-15th Annual Conference of the International Speech Communication Association, September 14–18, Singapore.
Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018). Emotion Identification from raw speech signals using DNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 3097–3101). Hyderabad, India.
Speaker Recognition Evaluation. (2000). Retrieved from https://catalog.ldc.upenn.edu/LDC2001S97. Accessed 3 Nov 2018
Shobaki, K., Hosom, J., & Cole, R. A. (2000).The OGI kids’ speech corpus and recognizers, In Proceedings of Interspeech 2000, The 6th International Conference on Spoken Language Processing, ICSLP 2000 / Interspeech 2000, (pp. 16–20). Beijing, China.
Switchbord Cellular Part-I. Retrieved from https://catalog.ldc.upenn.edu/LDC2001S13. Accessed 18 Feb 2019.
The 2008 NIST Speaker Recognition Evaluation Results. Retrieved from https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results. Accessed 15 Mar 2017.
The NIST Year 2010 Speaker Recognition Evaluation Plan. Retrieved from https://www.nist.gov/system/files/documents/itl/iad/mig/NIST_SRE10_evalplan-r6.pdf. Accessed 30 Mar 2017
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016, March 20–25, Shanghai, China.
Tuske, Z., Golik, P., Schluter, R., & Ney, H. (2014). Acoustic modeling with deep neural networks using raw time signal for LVCSR, In Proceedings of Interspeech 2014, The 15th Annual Conference of the International Speech Communication Association, (pp. 890–894). 14–18 Singapore.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sarma, M., Sarma, K.K. & Goel, N.K. Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker. Int J Speech Technol 23, 223–240 (2020). https://doi.org/10.1007/s10772-020-09680-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09680-4