Abstract
Speech emotion recognition (SER) is the process of identifying the emotional characteristics of speech. The efficiency of emotional characteristics extracted from the speech is essential for SER performances. Neural networks based on various deep learning (DL) models have been investigated for SER. The existing SER studies mostly concentrate on a few languages with rich resources. SER for Bangla is very limited although it is a major mother language in the world. This study focused on Bangla SER (BSER) using 3D convolutional neural networks (CNN) and bidirectional long short-term memory networks (Bi-LSTM).The use of various speech signal transformations, using the Short-time Fourier Transform (STFT), Chroma STFT, and Mel-frequency Cepstral Coefficient (MFCC), is the significant challenge of this research. Transformed features of three different methodologies are integrated to set into the 3D CNN block for feature extraction. The time distributed layer’s input is set by flattening the deep features. To classify the emotion, the time distributed layer’s output is put into a bidirectional LSTM layer. The proposed model performed better than comparable recent methods when evaluated on a corpus of emotional speech in Bangla.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Atila, O., Şengür, A.: Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 182 (2021). https://doi.org/10.1016/j.apacoust.2021.108260
Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021). https://doi.org/10.1016/j.specom.2020.12.009
Anvarjon, T., Mustaqeem, Kwon, S.: Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors. 20, 5212 (2020). https://doi.org/10.3390/s20185212
Mustaqeem, Kwon, S.: CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics. 8, 2133 (2020). https://doi.org/10.3390/math8122133
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control. 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Trigeorgis, G., Nicolaou, M.A., Schuller, W.: End-to-end multimodal emotion recognition. IEEE J. Sel. Top. Signal Process. 11, 1301–1309 (2017)
Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021). https://doi.org/10.1109/LSP.2021.3055755
Tang, D., Kuppens, P., Geurts, L., van Waterschoot, T.: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network. EURASIP J. Audio Speech Music Process. 2021(1), 1–16 (2021). https://doi.org/10.1186/s13636-021-00208-5
Zhang, H., Gou, R., Shang, J., Shen, F., Wu, Y., Dai, G.: Pre-trained deep convolution neural network model with attention for speech emotion recognition. Front. Physiol. 12 (2021). https://doi.org/10.3389/fphys.2021.643202
Mansouri-Benssassi, E., Ye, J.: Generalisation and robustness investigation for facial and speech emotion recognition using bio-inspired spiking neural networks. Soft. Comput. 25(3), 1717–1730 (2021). https://doi.org/10.1007/s00500-020-05501-7
Zhao, Z., et al.: Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition. Neural Netw. 141, 52–60 (2021). https://doi.org/10.1016/j.neunet.2021.03.013
Islam, M.R., Akhand, M.A.H., Kamal, M.A.S., Yamada, K.: Recognition of emotion with intensity from speech signal using 3D transformed feature and deep learning. Electronics 11, 2362 (2022). https://doi.org/10.3390/electronics11152362
Chen, J.X., Zhang, P.W., Mao, Z.J., Huang, Y.F., Jiang, D.M., Zhang, Y.N.: Accurate EEG-based emotion recognition on combined features using deep convolutional neural networks. IEEE Access. 7, 44317–44328 (2019). https://doi.org/10.1109/ACCESS.2019.2908285
Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens. (Switz.) 20 (2020). https://doi.org/10.3390/s20010183
Livingstone, S., Russo, F.: The ryerson audio-visual database of emotional speech and song (RAVDESS). PLoS One 13 (2018). https://doi.org/10.5281/zenodo.1188976
Mustaqeem, Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access. 8, 79861–79875 (2020). https://doi.org/10.1109/ACCESS.2020.2990405
Sultana, S., Rahman, M.S., Selim, M.R., Iqbal, M.Z.: SUST Bangla emotional speech corpus (SUBESCO): an audio-only emotional speech corpus for Bangla. PLoS One 16, 1–27 (2021). https://doi.org/10.1371/journal.pone.0250173
Sultana, S., Iqbal, M.Z., Selim, M.R., Rashid, M.M., Rahman, M.S.: Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks. IEEE Access 10, 564–578 (2022). https://doi.org/10.1109/ACCESS.2021.3136251
Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy. 21 (2019). https://doi.org/10.3390/e21050479
Al Mamun, S.K., Hassan, M.M., Islam, M.R., Raihan, M.: Obstructive sleep apnea detection based on sound interval frequency using wearable device. In: 2020 11th International Conference on Computer Communication Network and Technology, ICCCNT 2020, pp. 6–9 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225450
Islam, M.R., Hassan, M.M., Raihan, M., Datto, S.K., Shahriar, A., More, A.: A wireless electronic stethoscope to classify children heart sound abnormalities (2019)
Garrido, M.: The feedforward short-time Fourier transform. IEEE Trans. Circuits Syst. II Express Briefs. 63, 868–872 (2016). https://doi.org/10.1109/TCSII.2016.2534838
Müller, M., Balke, S.: Short-time Fourier transform and chroma features. 10 (2015)
Meng, H., Yan, T., Yuan, F., Wei, H.: Speech Emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007
Angadi, S., Reddy, V.S.: Hybrid deep network scheme for emotion recognition in speech. Int. J. Intell. Eng. Syst. 12, 59–67 (2019). https://doi.org/10.22266/IJIES2019.0630.07
Shahid, F., Zameer, A., Muneeb, M.: Predictions for COVID-19 with deep learning models of LSTM. GRU and Bi-LSTM. Chaos Solitons Fractals 140, 110212 (2020). https://doi.org/10.1016/j.chaos.2020.110212
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Islam, M.R., Akhand, M.A.H., Kamal, M.A.S. (2023). Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model. In: Satu, M.S., Moni, M.A., Kaiser, M.S., Arefin, M.S. (eds) Machine Intelligence and Emerging Technologies. MIET 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 490. Springer, Cham. https://doi.org/10.1007/978-3-031-34619-4_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-34619-4_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34618-7
Online ISBN: 978-3-031-34619-4
eBook Packages: Computer ScienceComputer Science (R0)