Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

Fahad, Md Shah; Ranjan, Ashish; Deepak, Akshay; Pradhan, Gayadhar

doi:10.1007/s00034-022-02068-6

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

Published: 13 June 2022

Volume 41, pages 6113–6135, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Md Shah Fahad ORCID: orcid.org/0000-0002-2556-131X^1,2,
Ashish Ranjan³,
Akshay Deepak¹ &
…
Gayadhar Pradhan⁴

562 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Recently, domain adversarial neural networks (DANN) have delivered promising results for out of domain data. This paper exploits DANN for speaker independent emotion recognition, where the domain corresponds to speakers, i.e. the training and testing datasets contain different speakers. The result is a speaker adversarial neural network (SANN). The proposed SANN is used for extracting speaker-invariant and emotion-specific discriminative features for the task of speech emotion recognition. To extract speaker-invariant features, multi-tasking adversarial training of a deep neural network (DNN) is employed. The DNN framework consists of two sub-networks: one for emotion classification (primary task) and the other for speaker classification (secondary task). The gradient reversal layer (GRL) was introduced between (a) the layer common to both the primary and auxiliary classifiers and (b) the auxiliary classifier. The objective of the GRL layer is to reduce the variance among speakers by maximizing the speaker classification loss. The proposed framework jointly optimizes the above two sub-networks to minimize the emotion classification loss and mini-maximize the speaker classification loss. The proposed network was evaluated on the IEMOCAP and EMODB datasets. A total of 1582 features were extracted from the standard library openSMILE. A subset of these features was eventually selected using a genetic algorithm approach. On the IEMOCAP dataset, the proposed SANN model achieved relative improvements of +6.025% (weighted accuracy) and +5.62% (unweighted accuracy) over the baseline system. Similar results were observed for the EMODB dataset. Further, in spite of differences with respect to models and features with state-of-the-art methods, significant improvement in accuracy values was also obtained over them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Article 09 February 2022

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Article 16 December 2023

Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

Data Availability

The code and datasets generated during and/or analysed during the current study are available at https://github.com/fahadmaster/SANN.

References

M. Abdelwahab, C. Busso, Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(12), 2423–35 (2018)
Article Google Scholar
F. Albu, D. Hagiescu, L. Vladutu, M.A. Puica, Neural network approaches for children’s emotion recognition in intelligent learning applications, in EDULEARN15 7th Annu Int Conf Educ New Learn Technol Barcelona, Spain, 6th–8th (2015)
L. Bahl, P. Brown, De Souza P, Mercer R, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, 1986), Vol. 11, pp. 49–52
R. Bock, O. Egorow, I. Siegert, A. Wendemuth, Comparative study on normalisation in emotion recognition from speech, in International Conference on Intelligent Human Computer Interaction (Springer, Cham, 2017), pp. 189–201
D.O. Bos, EEG-based emotion recognition. Influence Visual Audit. Stimuli. 56(3), 1–7 (2006)
Google Scholar
F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in 9th european conference on speech communication and technology (2005)
C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–59 (2008)
Article Google Scholar
R.A. Calix, G.M. Knapp, Actor level emotion magnitude prediction in text and speech. Multimed. Tools Appl. 62(2), 319–32 (2013)
Article Google Scholar
M. Calzolari. manuel-calzolari/sklearn-genetic: sklearn-genetic 0.2
R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette, Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun. 50(6), 487–503 (2008)
Article Google Scholar
S. Deb, S. Dandapat, Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Trans. Affect. Comput. 10(3), 360–73 (2017)
Article Google Scholar
C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, B. Schmauch. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630. 2018 Feb 15
M.S. Fahad, A. Deepak, G. Pradhan, J. Yadav, DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features. Circuits Syst. Signal Process. 40(1), 466–89 (2021)
Article Google Scholar
M.S. Fahad, A. Juhi, A.Shambhavi, Ranjan, A. Deepak, Multi-model Emotion Recognition Using Hybrid Framework of Deep and Machine Learning, in U.P. Rao, S.J. Patel, P. Raj, A. Visconti (Eds.) Security, Privacy and Data Analytics. Lecture Notes in Electrical Engineering (Springer, Singapore, 2022), vol 848
H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 1(92), 60–8 (2017)
Article Google Scholar
L. Fu, X. Mao, L. Chen, Speaker independent emotion recognition using HMMs fusion system with relative features, in 2008 First International Conference on Intelligent Networks and Intelligent Systems (IEEE, 2008), pp. 608–611
L. Fu, X. Mao, L. Chen, Relative speech emotion recognition based artificial neural network, in 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application (IEEE, 2008), Vol. 2, pp. 140–144
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096 (2016)
MathSciNet MATH Google Scholar
S. Gupta, M.S. Fahad, A. Deepak, Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. Multimed. Tools Appl. 79, 23347–65 (2020)
Article Google Scholar
K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in Fifteenth Annual Conference of the International Speech Communication Association (2014)
A. Hassan, R. Damper, M. Niranjan, On acoustic emotion recognition: compensating for covariate shift. IEEE Trans. Audio Speech Lang. Process. 21(7), 1458–68 (2013)
Article Google Scholar
M. Kockmann, L. Burget, Application of speaker-and language identification state-of-the-art techniques for emotion recognition. Speech Commun. 53(9–10), 1172–85 (2011)
Article Google Scholar
J. Kolbusz, P. Rozycki, B.M. Wilamowski. The study of architecture MLP with linear neurons in order to eliminate the “vanishing gradient” problem, in International Conference on Artificial Intelligence and Soft Computing (Springer, Cham, 2017), pp. 97–106
E.H. Kim, K.H. Hyun, S.H. Kim, Y.K. Kwak, Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Trans. Mechatron. 14(3), 317–25 (2009)
Article Google Scholar
J.B. Kim, J.S. Park, Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition. Eng. Appl. Artif. Intell. 1(52), 126–34 (2016)
Article Google Scholar
S. Latif, R. Rana, S. Younis, J. Qadir, L. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353 (2018)
C.C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–71 (2011)
Article Google Scholar
H. Li, M. Tu, J. Huang, S. Narayanan, P. Georgiou, Speaker-invariant affective representation learning via adversarial training. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE 2020), pp. 7144–7148
M. Lugger, B. Yang. An incremental analysis of different feature groups in speaker independent emotion recognition
GR. Madhumani, S. Shah, B. Abraham, V. Joshi, S. Sitaram. Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition. arXiv preprint arXiv:2006.05257. 2020 Jun 9
M. Mansoorizadeh, N.M. Charkari, Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–97 (2010)
Article Google Scholar
S. Mariooryad, C. Busso, Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Commun. 1(57), 1–2 (2014)
Article Google Scholar
Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gong, B.H. Juang, Speaker-invariant training via adversarial learning, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5969–5973
S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017), pp. 2227–2231
M. Neumann, N.T. Vu, Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612 (2017)
K.E. Ooi, L.S. Low, M. Lech, N. Allen, Early prediction of major depression in adolescents using glottal wave characteristics and teager energy parameters, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 4613–4616
D. O’Shaughnessy, Recognition and processing of speech signals using neural networks. Circuits Syst. Signal Process. 38(8), 3454–81 (2019)
Article MathSciNet Google Scholar
S. Patro, K.K. Sahu, Normalization: a preprocessing stage. arXiv preprint arXiv:1503.06462 (2015)
T.V. Sagar, Characterisation and synthesis of emotions in speech using prosodic features (Doctoral dissertation, Dept. of Electronics and communications Engineering, Indian Institute of Technology Guwahati)
A. Satt, S. Rozenberg, R. Hoory, Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms, in Interspeech (2017), pp. 1089–1093
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, Speaker independent speech emotion recognition by ensemble classification, in 2005 IEEE International Conference on Multimedia and Expo (IEEE, 2005), pp. 864–867
B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S.S. Narayanan. The INTERSPEECH 2010 paralinguistic challenge, in 11th Annual Conference of the International Speech Communication Association (2010)
S. Shahnawazuddin, N. Adiga, H.K. Kathania, G. Pradhan, R. Sinha, Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition. Digit. Signal Process. 1(79), 142–51 (2018)
Article MathSciNet Google Scholar
S. Shahnawazuddin, C. Singh, H.K. Kathania, W. Ahmad, G. Pradhan, An experimental study on the significance of variable frame-length and overlap in the context of children’s speech recognition. Circuits Syst. Signal Process. 37(12), 5540–53 (2018)
Article Google Scholar
Y. Shinohara, Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition, in Interspeech, pp. 2369–2372 (2016)
Y. Sun, G. Wen, Emotion recognition using semi-supervised feature selection with speaker normalization. Int. J. Speech Technol. 18(3), 317–31 (2015)
Article MathSciNet Google Scholar
L. Van der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 1 (2008)
MATH Google Scholar
D. Ververidis, C. Kotropoulos. A state of the art review on emotional speech databases, in Proceedings of 1st Richmedia Conference (Citeseer, 2003), pp. 109–119
R. Xia, Y. Liu, A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2015)
Article MathSciNet Google Scholar
J.H. Yang, J.W. Hung, A preliminary study of emotion recognition employing adaptive Gaussian mixture models with the maximum a posteriori principle, in 2014 International Conference on Information Science, Electronics and Electrical Engineering (IEEE, 2014), Vol. 3, pp. 1576–1579
C.K. Yogesh, M. Hariharan, R. Yuvaraj, R. Ngadiran, S. Yaacob, K. Polat, Bispectral features and mean shift clustering for stress and emotion recognition from natural speech. Comput. Electr. Eng. 1(62), 676–91 (2017)
Google Scholar
J. Yi, J. Tao, Z. Wen, Y. Bai, Adversarial multilingual training for low-resource speech recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2018), pp. 4899–4903

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, India
Md Shah Fahad & Akshay Deepak
School of Computing Science and Engineering, Vellore Institute of Technology, Bhopal, India
Md Shah Fahad
Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India
Ashish Ranjan
Department of Electronics and Communication, National Institute of Technology Patna, Patna, India
Gayadhar Pradhan

Authors

Md Shah Fahad
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Ranjan
View author publications
You can also search for this author in PubMed Google Scholar
Akshay Deepak
View author publications
You can also search for this author in PubMed Google Scholar
Gayadhar Pradhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Shah Fahad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fahad, M.S., Ranjan, A., Deepak, A. et al. Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition. Circuits Syst Signal Process 41, 6113–6135 (2022). https://doi.org/10.1007/s00034-022-02068-6

Download citation

Received: 18 August 2021
Revised: 19 May 2022
Accepted: 19 May 2022
Published: 13 June 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00034-022-02068-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation