Abstract
Speech Emotion Recognition (SER) is an active area of signal processing research that aims at identifying emotional states from audio speech signals. Applications of SER range from psychological diagnosis to human-computer interaction and as such, a robust framework is needed for accurate classification. To this end, we propose a two-stage hybrid deep feature selection (HDFS) framework that combines deep learning with automated feature engineering for emotion recognition from human speeches, which shines both in terms of accuracy and computational efficiency. Our pipeline extracts self-learned features using a customized Wide-ResNet-50-2 deep learning model from mel-spectrograms of raw audio signals, whose dimensionality is reduced using a hybrid deep feature selection algorithm that comprises a fuzzy entropy and similarity-based feature ranking method, followed by Whale optimization algorithm, which is a popular meta-heuristic optimization algorithm in literature. A k-nearest neighbor classifier is used to classify the optimized feature subset into the respective emotion classes. The proposed pipeline is evaluated on three publicly available SER datasets using a 5-fold cross-validation scheme, where it is found to outperform several state-of-the-art existing works in literature by significant margins thus, justifying the superiority and reliability of the proposed research. The source codes of the proposed method can be found at: https://github.com/soumitri2001/Wrapper-Filter-Speech-Emotion-Recognition.
Similar content being viewed by others
Data Availability
Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.
References
Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors
Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609
Agrawal P, Abutarboush HF, Ganesh T, Mohamed AW (2021) Metaheuristic algorithms on feature selection: a survey of one decade of research (2009-2019). IEEE Access 9:26766–26791
Ahmed S, Ghosh KK, Garcia-Hernandez L, Abraham A, Sarkar R (2021) Improved coral reefs optimization with adaptive β-hill climbing for feature selection. Neural Comput & Applic 33(12):6467–6486
Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Comm 116:56–76
Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech & Lang 25(3):556–570
Alghowinem S, Goecke R, Wagner M, Epps J, Gedeon T, Breakspear M, Parker G (2013) A comparative study of different classifiers for detecting depression from spontaneous speech. In: ICASSP. IEEE
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician
Ancilin J, Milton A (2021) Improved speech emotion recognition with mel frequency magnitude coefficient. Appl Acoust 179:108046
Bhavan A, Chauhan P, Shah RR, et al. (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B, et al. (2005) A database of german emotional speech. In: Interspeech, vol 5, pp 1517–1520
Chattopadhyay S, Kundu R, Singh PK, Mirjalili S, Sarkar R (2021) Pneumonia detection from lung x-ray images using local search aided sine cosine algorithm based deep feature selection method. International Journal of Intelligent Systems, pp 1–38
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289
Danisman T, Alpkocak A (2008) Emotion classification of audio signals using ensemble of support vector machines. In: International tutorial and research workshop on perception and interactive technologies for speech-based systems. pp 205–216. Springer
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: nsga-ii. IEEE Trans Evol Comput 6 (2):182–197
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition. pp 248–255. IEEE
Dey A, Chattopadhyay S, Singh PK, Ahmadian A, Ferrara M, Sarkar R (2020) A hybrid meta-heuristic feature selection method using golden ratio and equilibrium optimization algorithms for speech emotion recognition. IEEE Access 8:200953–200970
Farooq M, Hussain F, Baloch NK, Raja FR, Yu H, Zikria YB (2020) Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors 20(21):6008
Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18(4):389–405
Ghosh KK, Ahmed S, Singh PK, Geem ZW, Sarkar R (2020) Improved binary sailfish optimizer based on adaptive β-hill climbing for feature selection. IEEE Access 8:83548–83560
Ghosh S, Hassan S, Khan AH, Manna A, Bhowmik S, Sarkar R (2021) Application of texture-based features for text non-text classification in printed document images with novel feature selection algorithm. Soft Computing, pp 1–19
Guha R, Ghosh M, Chakrabarti A, Sarkar R, Mirjalili S (2020) Introducing clustering based population in binary gravitational search algorithm for feature selection. Appl Soft Comput 93:106341
Guha R, Khan AH, Singh PK, Sarkar R, Bhattacharjee D (2021) Cga: a new feature selection model for visual human action recognition. Neural Comput & Applic 33(10):5267–5286
Hajarolasvadi N (2019) Demirel, h.: 3d cnn-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5):479
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Ibrahim H, Loo CK, Alnajjar F (2021) Speech emotion recognition by late fusion for bidirectional reservoir computing with random projection. IEEE Access 9:122855–122871
Kanwal S, Asghar S (2021) Speech emotion recognition using clustering based ga-optimized feature set. IEEE Access 9:125830–125842
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks. vol 4, pp 1942–1948. IEEE
Khalil RA, Jones E, Babar MI, Jan T, Zafar MH, Alhussain T (2019) Speech emotion recognition using deep learning techniques: A review. IEEE Access 7:117327–117345
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: European conference on machine learning. pp 171–182. Springer
Kwon S, et al. (2021) Att-net: enhanced emotion recognition system using lightweight self-attention module. Appl Soft Comput 102:107101
Latif S, Qadir J, Bilal M (2019) Unsupervised adversarial domain adaptation for cross-lingual speech emotion recognition. In: 2019 8Th international conference on affective computing and intelligent interaction (ACII). pp 732–737. IEEE
Latif S, Qayyum A, Usman M, Qadir J (2018) Cross lingual speech emotion recognition: urdu vs. western languages. In: 2018 International conference on frontiers of information technology (FIT). pp 88–93. IEEE
Liu ZT, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280
Liu ZT, Xie Q, Wu M, Cao WH, Mei Y, Mao JW (2018) Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309:145–156
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one
Luukka P (2011) Feature selection using fuzzy entropy measures with similarity classifier. Expert Syst Appl 38(4):4600–4607
Luukka P, Saastamoinen K, Kononen V (2001) A classifier based on the maximal fuzzy similarity in the generalized lukasiewicz-structure. In: 10Th IEEE international conference on fuzzy systems. pp 195–198. IEEE
Machado PP, Beutler LE, Greenberg LS (1999) Emotion recognition in psychotherapy: impact of therapist level of experience and emotional awareness. Journal of Clinical Psychology
Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing
Mafarja M, Qasem A, Heidari AA, Aljarah I, Faris H, Mirjalili S (2020) Efficient hybrid nature-inspired binary optimizers for feature selection. Cognitive Computation
Maldonado S, López J. (2018) Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for svm classification. Applied Soft Computing
Mansouri-Benssassi E, Ye J (2019) Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International joint conference on neural networks (IJCNN). pp 1–8. IEEE
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213
Meftah IT, Le Thanh N, Amar CB (2012) Detecting depression using multimodal approach of emotion recognition. In: 2012 IEEE International conference on complex systems (ICCS). IEEE
Mirjalili S (2016) Sca: a sine cosine algorithm for solving optimization problems. Knowledge-based systems 96:120–133
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: ICASSP. IEEE
Nguyen BH, Xue B, Zhang M (2020) A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol Comput 54:100663
Ooi CS, Seng KP, Ang LM, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 41(13):5858–5869
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8026–8037
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions in Pattern Analyis and Machine Intelligence
Ramakrishnan S, El Emary IM (2013) Speech emotion recognition approaches in human computer interaction. Telecommun Syst 52(3):1467–1478
Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) Gsa: a gravitational search algorithm. Inf Sci 179(13):2232–2248
Sarkar SS, Sheikh KH, Mahanty A, Mali K, Ghosh A, Sarkar R (2021) A harmony search-based wrapper-filter feature selection approach for microstructural image classification. Integr Mater Manuf Innov 10(1):1–19
Schipor OA, Pentiuc SG, Schipor MD (2011) Towards a multimodal emotion recognition framework to be integrated in a computer based speech therapy system. In: 2011 6Th conference on speech technology and human-computer dialogue (sped). IEEE
Sen S, Saha S, Chatterjee S, Mirjalili S, Sarkar R (2021) A bi-stage feature selection approach for covid-19 prediction using chest ct images. Applied Intelligence, pp 1–16
Sheikh KH, Ahmed S, Mukhopadhyay K, Singh PK, Yoon JH, Geem ZW, Sarkar R (2020) Ehhm: electrical harmony based hybrid meta-heuristic for feature selection. IEEE Access 8:158125–158141
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Song P, Zheng W (2018) Feature selection based transfer subspace learning for speech emotion recognition. IEEE Trans Affect Comput 11(3):373–382
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning. pp 1139–1147. PMLR
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol
Tuncer T, Dogan S, Acharya UR (2021) Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems
Yang XS, Deb S (2009) Cuckoo search via lévy flights. In: 2009 World congress on nature & biologically inspired computing (naBIC). IEEE
Yildirim S, Kaya Y, Kılıç F (2021) A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl Acoust 173:107721
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146.
Zehra W, Javed AR, Jalil Z, Khan HU, Gadekallu TR (2021) Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex & Intelligent Systems, pp 1–10
Zhang R, Nie F, Li X, Wei X (2019) Feature selection with multi-view data: a survey. Inf Fusion 50:158–167
Zhang H, Zhang R, Nie F, Li X (2018) A generalized uncorrelated ridge regression with nonnegative labels for unsupervised feature selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp 2781–2785. IEEE.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
All the authors declare that there are no conflicts of interest in the publication of this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Marik, A., Chattopadhyay, S. & Singh, P.K. A hybrid deep feature selection framework for emotion recognition from human speeches. Multimed Tools Appl 82, 11461–11487 (2023). https://doi.org/10.1007/s11042-022-14052-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14052-y