Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model

Islam, Md. Riadul; Akhand, M. A. H.; Kamal, Md Abdus Samad

doi:10.1007/978-3-031-34619-4_42

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 490))

Included in the following conference series:

International Conference on Machine Intelligence and Emerging Technologies

401 Accesses
1 Citations

Abstract

Speech emotion recognition (SER) is the process of identifying the emotional characteristics of speech. The efficiency of emotional characteristics extracted from the speech is essential for SER performances. Neural networks based on various deep learning (DL) models have been investigated for SER. The existing SER studies mostly concentrate on a few languages with rich resources. SER for Bangla is very limited although it is a major mother language in the world. This study focused on Bangla SER (BSER) using 3D convolutional neural networks (CNN) and bidirectional long short-term memory networks (Bi-LSTM).The use of various speech signal transformations, using the Short-time Fourier Transform (STFT), Chroma STFT, and Mel-frequency Cepstral Coefficient (MFCC), is the significant challenge of this research. Transformed features of three different methodologies are integrated to set into the 3D CNN block for feature extraction. The time distributed layer’s input is set by flattening the deep features. To classify the emotion, the time distributed layer’s output is put into a bidirectional LSTM layer. The proposed model performed better than comparable recent methods when evaluated on a corpus of emotional speech in Bangla.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Article 21 February 2023

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

Article 02 October 2023

References

Atila, O., Şengür, A.: Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 182 (2021). https://doi.org/10.1016/j.apacoust.2021.108260
Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021). https://doi.org/10.1016/j.specom.2020.12.009
Article Google Scholar
Anvarjon, T., Mustaqeem, Kwon, S.: Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors. 20, 5212 (2020). https://doi.org/10.3390/s20185212
Mustaqeem, Kwon, S.: CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics. 8, 2133 (2020). https://doi.org/10.3390/math8122133
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control. 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Trigeorgis, G., Nicolaou, M.A., Schuller, W.: End-to-end multimodal emotion recognition. IEEE J. Sel. Top. Signal Process. 11, 1301–1309 (2017)
Article Google Scholar
Guanghui, C., Xiaoping, Z.: Multi-modal emotion recognition by fusing correlation features of speech-visual. IEEE Signal Process. Lett. 28, 533–537 (2021). https://doi.org/10.1109/LSP.2021.3055755
Article Google Scholar
Tang, D., Kuppens, P., Geurts, L., van Waterschoot, T.: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network. EURASIP J. Audio Speech Music Process. 2021(1), 1–16 (2021). https://doi.org/10.1186/s13636-021-00208-5
Article Google Scholar
Zhang, H., Gou, R., Shang, J., Shen, F., Wu, Y., Dai, G.: Pre-trained deep convolution neural network model with attention for speech emotion recognition. Front. Physiol. 12 (2021). https://doi.org/10.3389/fphys.2021.643202
Mansouri-Benssassi, E., Ye, J.: Generalisation and robustness investigation for facial and speech emotion recognition using bio-inspired spiking neural networks. Soft. Comput. 25(3), 1717–1730 (2021). https://doi.org/10.1007/s00500-020-05501-7
Article Google Scholar
Zhao, Z., et al.: Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition. Neural Netw. 141, 52–60 (2021). https://doi.org/10.1016/j.neunet.2021.03.013
Article Google Scholar
Islam, M.R., Akhand, M.A.H., Kamal, M.A.S., Yamada, K.: Recognition of emotion with intensity from speech signal using 3D transformed feature and deep learning. Electronics 11, 2362 (2022). https://doi.org/10.3390/electronics11152362
Article Google Scholar
Chen, J.X., Zhang, P.W., Mao, Z.J., Huang, Y.F., Jiang, D.M., Zhang, Y.N.: Accurate EEG-based emotion recognition on combined features using deep convolutional neural networks. IEEE Access. 7, 44317–44328 (2019). https://doi.org/10.1109/ACCESS.2019.2908285
Article Google Scholar
Mustaqeem, Kwon, S.: A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens. (Switz.) 20 (2020). https://doi.org/10.3390/s20010183
Livingstone, S., Russo, F.: The ryerson audio-visual database of emotional speech and song (RAVDESS). PLoS One 13 (2018). https://doi.org/10.5281/zenodo.1188976
Mustaqeem, Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access. 8, 79861–79875 (2020). https://doi.org/10.1109/ACCESS.2020.2990405
Sultana, S., Rahman, M.S., Selim, M.R., Iqbal, M.Z.: SUST Bangla emotional speech corpus (SUBESCO): an audio-only emotional speech corpus for Bangla. PLoS One 16, 1–27 (2021). https://doi.org/10.1371/journal.pone.0250173
Article Google Scholar
Sultana, S., Iqbal, M.Z., Selim, M.R., Rashid, M.M., Rahman, M.S.: Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks. IEEE Access 10, 564–578 (2022). https://doi.org/10.1109/ACCESS.2021.3136251
Article Google Scholar
Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy. 21 (2019). https://doi.org/10.3390/e21050479
Al Mamun, S.K., Hassan, M.M., Islam, M.R., Raihan, M.: Obstructive sleep apnea detection based on sound interval frequency using wearable device. In: 2020 11th International Conference on Computer Communication Network and Technology, ICCCNT 2020, pp. 6–9 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225450
Islam, M.R., Hassan, M.M., Raihan, M., Datto, S.K., Shahriar, A., More, A.: A wireless electronic stethoscope to classify children heart sound abnormalities (2019)
Google Scholar
Garrido, M.: The feedforward short-time Fourier transform. IEEE Trans. Circuits Syst. II Express Briefs. 63, 868–872 (2016). https://doi.org/10.1109/TCSII.2016.2534838
Müller, M., Balke, S.: Short-time Fourier transform and chroma features. 10 (2015)
Google Scholar
Meng, H., Yan, T., Yuan, F., Wei, H.: Speech Emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007
Article Google Scholar
Angadi, S., Reddy, V.S.: Hybrid deep network scheme for emotion recognition in speech. Int. J. Intell. Eng. Syst. 12, 59–67 (2019). https://doi.org/10.22266/IJIES2019.0630.07
Shahid, F., Zameer, A., Muneeb, M.: Predictions for COVID-19 with deep learning models of LSTM. GRU and Bi-LSTM. Chaos Solitons Fractals 140, 110212 (2020). https://doi.org/10.1016/j.chaos.2020.110212
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Khulna University of Engineering and Technology, Khulna, Bangladesh
Md. Riadul Islam & M. A. H. Akhand
Graduate School of Science and Technology, Gunma University, Kiryu, 376-8515, Japan
Md Abdus Samad Kamal

Authors

Md. Riadul Islam
View author publications
You can also search for this author in PubMed Google Scholar
M. A. H. Akhand
View author publications
You can also search for this author in PubMed Google Scholar
Md Abdus Samad Kamal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. A. H. Akhand .

Editor information

Editors and Affiliations

Noakhali Science and Technology University, Noakhali, Bangladesh
Md. Shahriare Satu
The University of Queensland, St. Lucia, QLD, Australia
Mohammad Ali Moni
Jahangirnagar University, Dhaka, Bangladesh
M. Shamim Kaiser
Daffodil International University, Dhaka, Bangladesh
Mohammad Shamsul Arefin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Islam, M.R., Akhand, M.A.H., Kamal, M.A.S. (2023). Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model. In: Satu, M.S., Moni, M.A., Kaiser, M.S., Arefin, M.S. (eds) Machine Intelligence and Emerging Technologies. MIET 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 490. Springer, Cham. https://doi.org/10.1007/978-3-031-34619-4_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-34619-4_42
Published: 11 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34618-7
Online ISBN: 978-3-031-34619-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Bangla Speech Emotion Recognition Using 3D CNN Bi-LSTM Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation