Abstract
In this paper, we propose a speech emotion classification framework by designing a Deep Learning model strategy and exploring the potential of statistical based modeling approach. The main contributions of our work are as follow: Firstly, we build up a parametrization vector by computing the mean value of Mel Frequency Cepstral Coefficients; secondly, by designing a simple convolutional neural network, we could efficiently decrease the training time and ensure satisfactory classification accuracy within the framework. Performance of the model on the RAVDESS dataset is discussed and evaluated. Results are promising in terms of reliability and accuracy. An overall average accuracy of 79.59% was achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
López-de Ipiña, K., Alonso, J.B., Solé-Casals, J., Barroso, N., Henriquez, P., Faundez-Zanuy, M., Travieso, C.M., Ecay-Torres, M., MartÃnez-Lage, P., Eguiraun, H.: On automatic diagnosis of Alzheimer’s disease based on spontaneous speech analysis and emotional temperature. Cogn. Comput. 7(1), 44–55 (2015)
Petrushin, V.A.: Emotion in speech recognition and application to call centers. In: Proceedings of Artificial Neural Networks In Engineering (ANNIE 99), pp. 7–10 (1999)
Riyad, M., Khalil, M., Adib, A.: Incep-EEGNet: a convnet for motor imagery decoding. In: Moataz, A.E., Mammass, D., Mansouri, A., Nouboud, F., (eds.) Image and Signal Processing - 9th International Conference, ICISP 2020, Marrakesh, Morocco, 4–6 June 2020, Proceedings, volume 12119 of Lecture Notes in Computer Science, pp. 103–111. Springer, Cham (2020)
Bouny, L.E., Khalil, M., Adib, A.: ECG heartbeat classification based on multi-scale wavelet convolutional neural networks. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, 4–8 May 2020, pp. 3212–3216. IEEE (2020)
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B.,  Zafeiriou, S.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204 (2016)
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: EmoSPACE (2013)
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014 (Sept 2014)
Pandey, S.K., Shekhawat, H.S., Prasanna, S.R.M.: Emotion recognition from raw speech using wavenet. In: TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), pp. 1292–1297 (2019)
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 801–804, New York, NY, USA (2014). Association for Computing Machinery
Chen, M., He, X., Yang, J., Zhang, H.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Hifny, Y., Ali, A.: Efficient Arabic emotion recognition using deep neural networks. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6710–6714 (2019)
Meftah, A., Alotaibi, Y.A., Selouani, S.A.: Designing, building, and analyzing an Arabic speech emotional corpus: phase 2. In: 5th International Conference on Arabic Language Processing, pp. 181–184 (2014)
Sugan, N., Sai Srinivas, N.S., Kar, N., Kumar, L.S., Nath, M.K., Kanhe, A.: Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET Conference on Control, Communication, and Computing (IC4), pp. 266–271 (2018)
Issa, D., Fatih Demirci, M., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Kim, J., Saurous, R.A.: Emotion recognition from human speech using temporal information and deep learning. Proc. Interspeech 2018, 937–940 (2018)
Lakomkin, E., Zamani, M.A., Weber, C., Magg, S., Wermter, S.: On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 854–860 (2018)
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram & phoneme embedding. In: INTERSPEECH (2018)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Sekkate, S., Khalil, M., Adib, A., Ben Jebara, S.: An investigation of a feature-level fusion for noisy speech emotion recognition. Computers 8(4), 91 (2019)
Bora, M.B., Daimary, D., Amitab, K., Kandar, D.: Handwritten character recognition from images using CNN-ECOC. Procedia Comput. Sci. 167, 2403–2409 (2020). International Conference on Computational Intelligence and Data Science
McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., Nieto, O.: librosa : Audio and music signal analysis in python (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)
Wu, H., Gu, X.: Max-pooling dropout for regularization of convolutional neural networks. In: Neural Information Processing, pp. 46–54 (2015)
Livingstone, S.R., Russo, F.A.: The Rryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English. PLoS One 13, e0196391 (2018)
Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1), 183 (2020)
Zeng, Y., Mao, H., Peng, D., Yi, Z.: Spectrogram based multi-task audio classification. Multimedia Tools Appl. 78(3), 3705–3722 (2019)
Sefara, T.J.; The effects of normalisation methods on speech emotion recognition. In: 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), pp. 1–8 (2019)
Christy, A., Vaithyasubramanian, S., Jesudoss, A., et al.: Multimodal speech emotion recognition and classification using convolutional neural network techniques. In: Int. J. Speech Technol. 23, 381–388 (2020)
Mansouri-Benssassi, E., Ye, J.: Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Acknowledgements
The research presented in this paper is funded through the AL-Khawarizmi project: Artificial Intelligence and its Applications.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sekkate, S., Khalil, M., Adib, A. (2021). A Statistical Based Modeling Approach for Deep Learning Based Speech Emotion Recognition. In: Abraham, A., Piuri, V., Gandhi, N., Siarry, P., Kaklauskas, A., Madureira, A. (eds) Intelligent Systems Design and Applications. ISDA 2020. Advances in Intelligent Systems and Computing, vol 1351. Springer, Cham. https://doi.org/10.1007/978-3-030-71187-0_114
Download citation
DOI: https://doi.org/10.1007/978-3-030-71187-0_114
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71186-3
Online ISBN: 978-3-030-71187-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)