Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Sarma, Mousmita; Sarma, Kandarpa Kumar; Goel, Nagendra Kumar

doi:10.1007/s10772-020-09680-4

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Published: 10 February 2020

Volume 23, pages 223–240, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mousmita Sarma¹,
Kandarpa Kumar Sarma¹ &
Nagendra Kumar Goel²

315 Accesses
6 Citations
Explore all metrics

Abstract

We propose a method which provides age of the speaker as an additional information while training a machine learning model for gender identification. To achieve this objective, we design a multi-task learning Deep Neural Network (DNN) model where the primary output layer has the speakers’ gender as target. Further, we use age group of the speaker as auxiliary target for each utterance, where age groups are created considering the gender of the speaker. We experimentally prove that multi-task learning DNN outperforms Gaussian Mixture Model (GMM) or single-task learning DNN trained only for gender recognition for more real life oriented datasets. For such datasets we have recordings of speakers’ from all age groups (children to seniors). We use raw speech waveform as input to our DNN which executes the multi-task learning with the freedom to follow gender and age discriminative features during training. The raw waveform front end uses convolutional layer based filter learning. Further, we use Long Short Term Memory cell based recurrent projection (LSTMP) layers for modeling temporal dynamics of speech from learned feature representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Article 08 September 2023

Kodali Radha & Mohan Bansal

Children’s Age and Gender Recognition from Raw Speech Waveform Using DNN

Gender Recognition from Speech Signal Using 1-D CNN

References

Alhussein, M., Ali, Z., Imran, M., & Abdul, W. (2016). Automatic gender detection based on characteristics of vocal folds for mobile healthcare system, Mobile Information Systems, vol 2016, 1-12.
Article Google Scholar
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y. (2017). An exploration of dropout with LSTMs, in Proceedings of Interspeech 2017, The 11th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden.
Ghahremani, P., Manohar, V., Povey, D., & Khudanpur, S. (2016). Acoustic modelling from the signal domain using CNNs, In Proceedings of Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, September 8–12, San Francisco, CA, USA.
Goel, N. K., Sarma, M., Kushwah, T. S., Agrawal, D. K., Iqbal, Z., & Chauhan, S. (2018). Extracting speaker’s gender, accent, age and emotional state from speech, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Golik, P., Tuske, Z., Schluter, R., & Ney, H. (2015). Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, In Proceedings of Interspeech 2015, The 16th Annual Conference of the International Speech Communication Association, (pp. 26–30). September 6–10, Dresden, Germany.
Hebbar, R., Somandepalli, K., & Narayanan, S. (2018). Improving Gender Identification in Movie Audio using Cross-Domain Data In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Hermansky, H., & Sharma, S. (1998). TRAP- Classifiers for Temporal Patterns, In Proceedings of 5th International Conference On Spoken Language Processing.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech sound waves using restricted Boltzmann machines, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5884–5887). Prague, Czech Republic.
Kabil, S. H., Muckenhirn, H., & Doss, M. M. (2018). On Learning to Identify Genders from Raw Speech Signal using CNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.
Kumar, N., Nasir, M., Georgiou, P., & Narayanan, S. S. (2016). Robust multichannel gender classification from speech in movie audio, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.
Levitan, S. I., Mishra, T., & Bangalore, S. (2016). Automatic Identification of Gender from Speech, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech and Language,27, 151–167.
Article Google Scholar
Li, M., Jung, C., & Han, K. J. (2010). Combining five acoustic level modeling methods for automatic speaker age and gender recognition, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.
Meinedo, H., & Trancoso, I. (2010). Age and gender classification using fusion of acoustic and prosodic features, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.
Meinedo, H., Trancoso, I., & (2011). Age and gender detection in the I-DASH project. ACM Transactions on Speech and Language Processing, 7(4), 13.
Article Google Scholar
Palaz, D., Magimai-Doss, M., & Collobert, R. (2015). Analysis of CNN-based speech recognition system using raw speech as input, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.
Palaz, D., Magimai-Doss, M., & Collobert, R. (2019). End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Communication, 108, 15–32.
Article Google Scholar
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts, In Proceedings of Interspeech 2015, the 16th Annual Conference of the International Speech Communication Association, September 6–10. Dresden, Germany.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlcek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The kaldi speech recognition toolkit, In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US.
Povey, D., Zhang, X., & Khudanpur, S. (2015). Parallel training of deep neural networks with natural gradient and parameter averaging, In Proceedings of International Conference on Learning Representations (ICLR).
Ruder, S. An overview of multi-task learning in deep neural networks. Retrieved from https://arxiv.org/pdf/1706.05098.pdf. Accessed 25 Feb 2018.
Sainath, T. N., Kingsbury, B., Mohamed, A., & Ramabhadran, B. (2013). Learning Filter Banks within a Deep Neural Network Framework, In Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302.
Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., & Vinyals, O. (2015). Learning the speech front-end with raw waveform CLDNNs, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.
Sak, H., Senior, A., & Beaufays, F. (2014). Long short–term memory recurrent neural network architectures for large scale acoustic modeling, In Proceedings of the Interspeech 2014-15th Annual Conference of the International Speech Communication Association, September 14–18, Singapore.
Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018). Emotion Identification from raw speech signals using DNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 3097–3101). Hyderabad, India.
Speaker Recognition Evaluation. (2000). Retrieved from https://catalog.ldc.upenn.edu/LDC2001S97. Accessed 3 Nov 2018
Shobaki, K., Hosom, J., & Cole, R. A. (2000).The OGI kids’ speech corpus and recognizers, In Proceedings of Interspeech 2000, The 6th International Conference on Spoken Language Processing, ICSLP 2000 / Interspeech 2000, (pp. 16–20). Beijing, China.
Switchbord Cellular Part-I. Retrieved from https://catalog.ldc.upenn.edu/LDC2001S13. Accessed 18 Feb 2019.
The 2008 NIST Speaker Recognition Evaluation Results. Retrieved from https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results. Accessed 15 Mar 2017.
The NIST Year 2010 Speaker Recognition Evaluation Plan. Retrieved from https://www.nist.gov/system/files/documents/itl/iad/mig/NIST_SRE10_evalplan-r6.pdf. Accessed 30 Mar 2017
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016, March 20–25, Shanghai, China.
Tuske, Z., Golik, P., Schluter, R., & Ney, H. (2014). Acoustic modeling with deep neural networks using raw time signal for LVCSR, In Proceedings of Interspeech 2014, The 15th Annual Conference of the International Speech Communication Association, (pp. 890–894). 14–18 Singapore.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Gauhati University, Guwahati, Assam, India
Mousmita Sarma & Kandarpa Kumar Sarma
GoVivace Inc., McLean, VA, USA
Nagendra Kumar Goel

Authors

Mousmita Sarma
View author publications
You can also search for this author in PubMed Google Scholar
Kandarpa Kumar Sarma
View author publications
You can also search for this author in PubMed Google Scholar
Nagendra Kumar Goel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mousmita Sarma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarma, M., Sarma, K.K. & Goel, N.K. Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker. Int J Speech Technol 23, 223–240 (2020). https://doi.org/10.1007/s10772-020-09680-4

Download citation

Received: 11 August 2019
Accepted: 27 January 2020
Published: 10 February 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10772-020-09680-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Abstract

Access this article

Similar content being viewed by others

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Children’s Age and Gender Recognition from Raw Speech Waveform Using DNN

Gender Recognition from Speech Signal Using 1-D CNN

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Abstract

Access this article

Similar content being viewed by others

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Children’s Age and Gender Recognition from Raw Speech Waveform Using DNN

Gender Recognition from Speech Signal Using 1-D CNN

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation