Skip to main content

Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model

  • Conference paper
  • First Online:
Machine Intelligence and Emerging Technologies (MIET 2022)

Abstract

Speech emotion recognition (SER) is the process of identifying the emotional characteristics of speech. The efficiency of emotional characteristics extracted from the speech is essential for SER performances. Neural networks based on various deep learning (DL) models have been investigated for SER. The existing SER studies mostly concentrate on a few languages with rich resources. SER for Bangla is very limited although it is a major mother language in the world. This study focused on Bangla SER (BSER) using 3D convolutional neural networks (CNN) and bidirectional long short-term memory networks (Bi-LSTM).The use of various speech signal transformations, using the Short-time Fourier Transform (STFT), Chroma STFT, and Mel-frequency Cepstral Coefficient (MFCC), is the significant challenge of this research. Transformed features of three different methodologies are integrated to set into the 3D CNN block for feature extraction. The time distributed layer’s input is set by flattening the deep features. To classify the emotion, the time distributed layer’s output is put into a bidirectional LSTM layer. The proposed model performed better than comparable recent methods when evaluated on a corpus of emotional speech in Bangla.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Atila, O., Şengür, A.: Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 182 (2021). https://doi.org/10.1016/j.apacoust.2021.108260

  2. Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021). https://doi.org/10.1016/j.specom.2020.12.009

    Article  Google Scholar 

  3. Anvarjon, T., Mustaqeem, Kwon, S.: Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors. 20, 5212 (2020). https://doi.org/10.3390/s20185212

  4. Mustaqeem, Kwon, S.: CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics. 8, 2133 (2020). https://doi.org/10.3390/math8122133

  5. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control. 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  6. Trigeorgis, G., Nicolaou, M.A., Schuller, W.: End-to-end multimodal emotion recognition. IEEE J. Sel. Top. Signal Process. 11, 1301–1309 (2017)

    Article  Google Scholar 

  7. Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021). https://doi.org/10.1109/LSP.2021.3055755

    Article  Google Scholar 

  8. Tang, D., Kuppens, P., Geurts, L., van Waterschoot, T.: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network. EURASIP J. Audio Speech Music Process. 2021(1), 1–16 (2021). https://doi.org/10.1186/s13636-021-00208-5

    Article  Google Scholar 

  9. Zhang, H., Gou, R., Shang, J., Shen, F., Wu, Y., Dai, G.: Pre-trained deep convolution neural network model with attention for speech emotion recognition. Front. Physiol. 12 (2021). https://doi.org/10.3389/fphys.2021.643202

  10. Mansouri-Benssassi, E., Ye, J.: Generalisation and robustness investigation for facial and speech emotion recognition using bio-inspired spiking neural networks. Soft. Comput. 25(3), 1717–1730 (2021). https://doi.org/10.1007/s00500-020-05501-7

    Article  Google Scholar 

  11. Zhao, Z., et al.: Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition. Neural Netw. 141, 52–60 (2021). https://doi.org/10.1016/j.neunet.2021.03.013

    Article  Google Scholar 

  12. Islam, M.R., Akhand, M.A.H., Kamal, M.A.S., Yamada, K.: Recognition of emotion with intensity from speech signal using 3D transformed feature and deep learning. Electronics 11, 2362 (2022). https://doi.org/10.3390/electronics11152362

    Article  Google Scholar 

  13. Chen, J.X., Zhang, P.W., Mao, Z.J., Huang, Y.F., Jiang, D.M., Zhang, Y.N.: Accurate EEG-based emotion recognition on combined features using deep convolutional neural networks. IEEE Access. 7, 44317–44328 (2019). https://doi.org/10.1109/ACCESS.2019.2908285

    Article  Google Scholar 

  14. Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens. (Switz.) 20 (2020). https://doi.org/10.3390/s20010183

  15. Livingstone, S., Russo, F.: The ryerson audio-visual database of emotional speech and song (RAVDESS). PLoS One 13 (2018). https://doi.org/10.5281/zenodo.1188976

  16. Mustaqeem, Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access. 8, 79861–79875 (2020). https://doi.org/10.1109/ACCESS.2020.2990405

  17. Sultana, S., Rahman, M.S., Selim, M.R., Iqbal, M.Z.: SUST Bangla emotional speech corpus (SUBESCO): an audio-only emotional speech corpus for Bangla. PLoS One 16, 1–27 (2021). https://doi.org/10.1371/journal.pone.0250173

    Article  Google Scholar 

  18. Sultana, S., Iqbal, M.Z., Selim, M.R., Rashid, M.M., Rahman, M.S.: Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks. IEEE Access 10, 564–578 (2022). https://doi.org/10.1109/ACCESS.2021.3136251

    Article  Google Scholar 

  19. Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy. 21 (2019). https://doi.org/10.3390/e21050479

  20. Al Mamun, S.K., Hassan, M.M., Islam, M.R., Raihan, M.: Obstructive sleep apnea detection based on sound interval frequency using wearable device. In: 2020 11th International Conference on Computer Communication Network and Technology, ICCCNT 2020, pp. 6–9 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225450

  21. Islam, M.R., Hassan, M.M., Raihan, M., Datto, S.K., Shahriar, A., More, A.: A wireless electronic stethoscope to classify children heart sound abnormalities (2019)

    Google Scholar 

  22. Garrido, M.: The feedforward short-time Fourier transform. IEEE Trans. Circuits Syst. II Express Briefs. 63, 868–872 (2016). https://doi.org/10.1109/TCSII.2016.2534838

  23. Müller, M., Balke, S.: Short-time Fourier transform and chroma features. 10 (2015)

    Google Scholar 

  24. Meng, H., Yan, T., Yuan, F., Wei, H.: Speech Emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007

    Article  Google Scholar 

  25. Angadi, S., Reddy, V.S.: Hybrid deep network scheme for emotion recognition in speech. Int. J. Intell. Eng. Syst. 12, 59–67 (2019). https://doi.org/10.22266/IJIES2019.0630.07

  26. Shahid, F., Zameer, A., Muneeb, M.: Predictions for COVID-19 with deep learning models of LSTM. GRU and Bi-LSTM. Chaos Solitons Fractals 140, 110212 (2020). https://doi.org/10.1016/j.chaos.2020.110212

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. A. H. Akhand .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Islam, M.R., Akhand, M.A.H., Kamal, M.A.S. (2023). Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model. In: Satu, M.S., Moni, M.A., Kaiser, M.S., Arefin, M.S. (eds) Machine Intelligence and Emerging Technologies. MIET 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 490. Springer, Cham. https://doi.org/10.1007/978-3-031-34619-4_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34619-4_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34618-7

  • Online ISBN: 978-3-031-34619-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics