Skip to main content
Log in

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Recently, domain adversarial neural networks (DANN) have delivered promising results for out of domain data. This paper exploits DANN for speaker independent emotion recognition, where the domain corresponds to speakers, i.e. the training and testing datasets contain different speakers. The result is a speaker adversarial neural network (SANN). The proposed SANN is used for extracting speaker-invariant and emotion-specific discriminative features for the task of speech emotion recognition. To extract speaker-invariant features, multi-tasking adversarial training of a deep neural network (DNN) is employed. The DNN framework consists of two sub-networks: one for emotion classification (primary task) and the other for speaker classification (secondary task). The gradient reversal layer (GRL) was introduced between (a) the layer common to both the primary and auxiliary classifiers and (b) the auxiliary classifier. The objective of the GRL layer is to reduce the variance among speakers by maximizing the speaker classification loss. The proposed framework jointly optimizes the above two sub-networks to minimize the emotion classification loss and mini-maximize the speaker classification loss. The proposed network was evaluated on the IEMOCAP and EMODB datasets. A total of 1582 features were extracted from the standard library openSMILE. A subset of these features was eventually selected using a genetic algorithm approach. On the IEMOCAP dataset, the proposed SANN model achieved relative improvements of +6.025% (weighted accuracy) and +5.62% (unweighted accuracy) over the baseline system. Similar results were observed for the EMODB dataset. Further, in spite of differences with respect to models and features with state-of-the-art methods, significant improvement in accuracy values was also obtained over them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The code and datasets generated during and/or analysed during the current study are available at https://github.com/fahadmaster/SANN.

References

  1. M. Abdelwahab, C. Busso, Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(12), 2423–35 (2018)

    Article  Google Scholar 

  2. F. Albu, D. Hagiescu, L. Vladutu, M.A. Puica, Neural network approaches for children’s emotion recognition in intelligent learning applications, in EDULEARN15 7th Annu Int Conf Educ New Learn Technol Barcelona, Spain, 6th–8th (2015)

  3. L. Bahl, P. Brown, De Souza P, Mercer R, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, 1986), Vol. 11, pp. 49–52

  4. R. Bock, O. Egorow, I. Siegert, A. Wendemuth, Comparative study on normalisation in emotion recognition from speech, in International Conference on Intelligent Human Computer Interaction (Springer, Cham, 2017), pp. 189–201

  5. D.O. Bos, EEG-based emotion recognition. Influence Visual Audit. Stimuli. 56(3), 1–7 (2006)

    Google Scholar 

  6. F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in 9th european conference on speech communication and technology (2005)

  7. C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–59 (2008)

    Article  Google Scholar 

  8. R.A. Calix, G.M. Knapp, Actor level emotion magnitude prediction in text and speech. Multimed. Tools Appl. 62(2), 319–32 (2013)

    Article  Google Scholar 

  9. M. Calzolari. manuel-calzolari/sklearn-genetic: sklearn-genetic 0.2

  10. R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)

    Article  MathSciNet  Google Scholar 

  11. C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette, Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun. 50(6), 487–503 (2008)

    Article  Google Scholar 

  12. S. Deb, S. Dandapat, Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Trans. Affect. Comput. 10(3), 360–73 (2017)

    Article  Google Scholar 

  13. C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, B. Schmauch. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630. 2018 Feb 15

  14. M.S. Fahad, A. Deepak, G. Pradhan, J. Yadav, DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40(1), 466–89 (2021)

    Article  Google Scholar 

  15. M.S. Fahad, A. Juhi, A.Shambhavi, Ranjan, A. Deepak, Multi-model Emotion Recognition Using Hybrid Framework of Deep and Machine Learning, in U.P. Rao, S.J. Patel, P. Raj, A. Visconti (Eds.) Security, Privacy and Data Analytics. Lecture Notes in Electrical Engineering (Springer, Singapore, 2022), vol 848

  16. H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 1(92), 60–8 (2017)

    Article  Google Scholar 

  17. L. Fu, X. Mao, L. Chen, Speaker independent emotion recognition using HMMs fusion system with relative features, in 2008 First International Conference on Intelligent Networks and Intelligent Systems (IEEE, 2008), pp. 608–611

  18. L. Fu, X. Mao, L. Chen, Relative speech emotion recognition based artificial neural network, in 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application (IEEE, 2008), Vol. 2, pp. 140–144

  19. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096 (2016)

    MathSciNet  MATH  Google Scholar 

  20. S. Gupta, M.S. Fahad, A. Deepak, Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. Multimed. Tools Appl. 79, 23347–65 (2020)

    Article  Google Scholar 

  21. K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in Fifteenth Annual Conference of the International Speech Communication Association (2014)

  22. A. Hassan, R. Damper, M. Niranjan, On acoustic emotion recognition: compensating for covariate shift. IEEE Trans. Audio Speech Lang. Process. 21(7), 1458–68 (2013)

    Article  Google Scholar 

  23. M. Kockmann, L. Burget, Application of speaker-and language identification state-of-the-art techniques for emotion recognition. Speech Commun. 53(9–10), 1172–85 (2011)

    Article  Google Scholar 

  24. J. Kolbusz, P. Rozycki, B.M. Wilamowski. The study of architecture MLP with linear neurons in order to eliminate the “vanishing gradient” problem, in International Conference on Artificial Intelligence and Soft Computing (Springer, Cham, 2017), pp. 97–106

  25. E.H. Kim, K.H. Hyun, S.H. Kim, Y.K. Kwak, Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Trans. Mechatron. 14(3), 317–25 (2009)

    Article  Google Scholar 

  26. J.B. Kim, J.S. Park, Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition. Eng. Appl. Artif. Intell. 1(52), 126–34 (2016)

    Article  Google Scholar 

  27. S. Latif, R. Rana, S. Younis, J. Qadir, L. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353 (2018)

  28. C.C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–71 (2011)

    Article  Google Scholar 

  29. H. Li, M. Tu, J. Huang, S. Narayanan, P. Georgiou, Speaker-invariant affective representation learning via adversarial training. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE 2020), pp. 7144–7148

  30. M. Lugger, B. Yang. An incremental analysis of different feature groups in speaker independent emotion recognition

  31. GR. Madhumani, S. Shah, B. Abraham, V. Joshi, S. Sitaram. Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition. arXiv preprint arXiv:2006.05257. 2020 Jun 9

  32. M. Mansoorizadeh, N.M. Charkari, Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–97 (2010)

    Article  Google Scholar 

  33. S. Mariooryad, C. Busso, Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Commun. 1(57), 1–2 (2014)

    Article  Google Scholar 

  34. Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, B.H. Juang, Speaker-invariant training via adversarial learning, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5969–5973

  35. S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 2227–2231

  36. M. Neumann, N.T. Vu, Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)

  37. K.E. Ooi, L.S. Low, M. Lech, N. Allen, Early prediction of major depression in adolescents using glottal wave characteristics and teager energy parameters, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 4613–4616

  38. D. O’Shaughnessy, Recognition and processing of speech signals using neural networks. Circuits Syst. Signal Process. 38(8), 3454–81 (2019)

    Article  MathSciNet  Google Scholar 

  39. S. Patro, K.K. Sahu, Normalization: a preprocessing stage. arXiv preprint arXiv:1503.06462 (2015)

  40. T.V. Sagar, Characterisation and synthesis of emotions in speech using prosodic features (Doctoral dissertation, Dept. of Electronics and communications Engineering, Indian Institute of Technology Guwahati)

  41. A. Satt, S. Rozenberg, R. Hoory, Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms, in Interspeech (2017), pp. 1089–1093

  42. B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, Speaker independent speech emotion recognition by ensemble classification, in 2005 IEEE International Conference on Multimedia and Expo (IEEE, 2005), pp. 864–867

  43. B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S.S. Narayanan. The INTERSPEECH 2010 paralinguistic challenge, in 11th Annual Conference of the International Speech Communication Association (2010)

  44. S. Shahnawazuddin, N. Adiga, H.K. Kathania, G. Pradhan, R. Sinha, Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition. Digit. Signal Process. 1(79), 142–51 (2018)

    Article  MathSciNet  Google Scholar 

  45. S. Shahnawazuddin, C. Singh, H.K. Kathania, W. Ahmad, G. Pradhan, An experimental study on the significance of variable frame-length and overlap in the context of children’s speech recognition. Circuits Syst. Signal Process. 37(12), 5540–53 (2018)

    Article  Google Scholar 

  46. Y. Shinohara, Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition, in Interspeech, pp. 2369–2372 (2016)

  47. Y. Sun, G. Wen, Emotion recognition using semi-supervised feature selection with speaker normalization. Int. J. Speech Technol. 18(3), 317–31 (2015)

    Article  MathSciNet  Google Scholar 

  48. L. Van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 1 (2008)

    MATH  Google Scholar 

  49. D. Ververidis, C. Kotropoulos. A state of the art review on emotional speech databases, in Proceedings of 1st Richmedia Conference (Citeseer, 2003), pp. 109–119

  50. R. Xia, Y. Liu, A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2015)

    Article  MathSciNet  Google Scholar 

  51. J.H. Yang, J.W. Hung, A preliminary study of emotion recognition employing adaptive Gaussian mixture models with the maximum a posteriori principle, in 2014 International Conference on Information Science, Electronics and Electrical Engineering (IEEE, 2014), Vol. 3, pp. 1576–1579

  52. C.K. Yogesh, M. Hariharan, R. Yuvaraj, R. Ngadiran, S. Yaacob, K. Polat, Bispectral features and mean shift clustering for stress and emotion recognition from natural speech. Comput. Electr. Eng. 1(62), 676–91 (2017)

    Google Scholar 

  53. J. Yi, J. Tao, Z. Wen, Y. Bai, Adversarial multilingual training for low-resource speech recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2018), pp. 4899–4903

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Shah Fahad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fahad, M.S., Ranjan, A., Deepak, A. et al. Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition. Circuits Syst Signal Process 41, 6113–6135 (2022). https://doi.org/10.1007/s00034-022-02068-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02068-6

Keywords

Navigation