CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

Tellai, Mohammed; Mao, Qirong

doi:10.1007/s10772-023-10080-7

CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

Published: 27 December 2023

Volume 26, pages 1099–1116, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mohammed Tellai¹ &
Qirong Mao^1,2

120 Accesses
Explore all metrics

Abstract

Speech is a crucial aspect of human-to-human interactions and plays a fundamental role in the advancement of human–computer interaction (HCI) systems. Developing an accurate speech emotion recognition (SER) system for human conversations poses a critical yet challenging task. Existing state-of-the-art (SOTA) research in SER primarily focuses on modeling vocal information within individual conversational speech utterances, overlooking the significance of incorporating transactional information from the interaction context. In this paper, we present a novel Contextualized Convolutional Transformer-GRU Network (CCTG-Net) for recognizing speech emotions using Mel-spectrogram features, effectively integrating contextual information for emotion recognition. Our experiments are conducted on the widely-used emotional benchmark dataset, IEMOCAP. Compared to SOTA methods in four-class emotion recognition, our proposed model achieves a weighted accuracy of 88.4% and an unweighted accuracy (UA) of 89.1%. This marks a substantial 3.0% enhancement in UA while maintaining an optimal balance between performance and complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Automatic speech recognition: a survey

Article 10 November 2020

Data availability

The data and code that support the results presented in this article are available upon request.

Code availability

The data and code that support the results presented in this article are available upon request.

References

Afrillia, Y., Mawengkang, H., Ramli, M., & Fhonna, R. P. (2017). Performance measurement of Mel frequency ceptral coefficient (MFCC) method in learning system of Al-Qur’an based in nagham pattern recognition. Journal of Physics: Conference Series, 930, 012036.
Aftab, A., Morsali, A., Ghaemmaghami, S., & Champagne, B. (2022). Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022-2022 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6912–6916). IEEE.
Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.
Article Google Scholar
Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), 21.
Article Google Scholar
Barsade, S. G. (2002). The ripple effect: Emotional contagion and its influence on group behavior. Administrative Science Quarterly, 47(4), 644–675.
Article Google Scholar
Bingol, M. C., & Aydogmus, O. (2020). Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot. Engineering Applications of Artificial Intelligence, 95, 103903.
Article Google Scholar
Bone, D., Lee, C.-C., Chaspari, T., Gibson, J., & Narayanan, S. (2017). Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Processing Magazine, 34(5), 196–195.
Article Google Scholar
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42, 335–359.
Article Google Scholar
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444.
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011). Flexible, high performance convolutional neural networks for image classification. In 22nd International joint conference on artificial intelligence (IJCAI).
Dong, G.-N., Pun, C.-M., & Zhang, Z. (2022). Temporal relation inference network for multimodal speech emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 6472–6485.
Article Google Scholar
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition, 44(3), 572–587.
Article Google Scholar
Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.
Article Google Scholar
Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech 2014.
Han, T., Zhang, Z., Ren, M., Dong, C., Jiang, X., & Zhuang, Q. (2023). Speech emotion recognition based on deep residual shrinkage network. Electronics, 12(11), 2512.
Article Google Scholar
Hareli, S., David, S., & Hess, U. (2016). The role of emotion transition for the perception of social dominance and affiliation. Cognition and Emotion, 30(7), 1260–1270.
Article Google Scholar
Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., & Zimmermann, R. (2018). ICONn: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2594–2604).
Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.-P., & Zimmermann, R. (2018). Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference of the association for computational linguistics. North American chapter meeting (Vol. 2018, p. 2122). NIH Public Access.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Huang, Z., Xue, W., & Mao, Q. (2015). Speech emotion recognition with unsupervised feature learning. Frontiers of Information Technology & Electronic Engineering, 16(5), 358–366.
Ismail, A., Idris, M. Y. I., Noor, N. M., Razak, Z., & Yusoff, Z. M. (2014). MFCC-VQ approach for qalqalahtajweed rule checking. Malaysian Journal of Computer Science, 27(4), 275–293.
Google Scholar
Jalal, M. A., Milner, R., & Hain, T. (2020). Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In Proceedings of Interspeech (pp. 4113–4117). International Speech Communication Association (ISCA).
Jokinen, K., & McTear, M. (2009). Spoken dialogue systems. Synthesis Lectures on Human Language Technologies, 2(1), 1–151.
Article Google Scholar
Kim, E., & Shin, J. W. (2019). Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6720–6724). IEEE.
Kingma, D. P., & Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprint. arXiv:1412.6980
Kumaran, U., Radha Rammohan, S., Nagarajan, S. M., & Prathik, A. (2021). Fusion of Mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. International Journal of Speech Technology, 24(2), 303–314.
Article Google Scholar
Lee, J., & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech 2015.
Li, R., Wu, Z., Jia, J., Zhao, S., & Meng, H. (2019). Dilated residual network with multi-head self-attention for speech emotion recognition. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6675–6679). IEEE.
Li, Y., Zhao, T., & Kawahara, T. (2019). Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech (pp. 2803–2807).
Lian, Z., Liu, B., & Tao, J. (2021). Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 985–1000.
Article Google Scholar
Liu, K., Wang, C., Chen, J., & Feng, J. (2022). Time-frequency attention for speech emotion recognition with squeeze-and-excitation blocks. In: Proceedings of multimedia modeling: 28th international conference (MMM 2022) (Part I, pp. 533–543), Phu Quoc, Vietnam, June 6–10, 2022. Springer.
Liu, M. (2022). English speech emotion recognition method based on speech recognition. International Journal of Speech Technology, 1–8.
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (Vol. 29).
Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., & Cambria, E. (2019). DialogueRNN: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 6818–6825).
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.
Article Google Scholar
Mao, Q., Xu, G., Xue, W., Gou, J., & Zhan, Y. (2017). Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Communication, 93, 1–10.
Article Google Scholar
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in science conference (Vol. 8, pp. 18–25). Citeseer.
Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access, 7, 125868–125881.
Article Google Scholar
Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2227–2231). IEEE.
Morrison, D., Wang, R., & De Silva, L. C. (2007). Ensemble methods for spoken emotion recognition in call-centres. Speech Communication, 49(2), 98–112.
Article Google Scholar
Mustaqeem, & Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.
Mustaqeem, & Kwon, S. (2021). Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Applied Soft Computing, 102, 107101.
Mustaqeem, Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BILSTM. IEEE Access, 8, 79861–79875.
Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203–1233.
Article Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.
Google Scholar
Rajamani, S. T., Rajamani, K. T., Mallol-Ragolta, A., Liu, S., & Schuller, B. (2021). A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6294–6298). IEEE.
Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., & Prasad, R. (2012). Ensemble of svm trees for multimodal emotion recognition. In Proceedings of the 2012 Asia Pacific signal and information processing association annual summit and conference (pp. 1–4). IEEE.
Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K. K., & Dehak, N. (2018). Emotion identification from raw speech signals using DNNs. In Interspeech (pp. 3097–3101).
Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech using deep learning on spectrograms. In Interspeech (pp. 1089–1093).
Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden Markov model-based speech emotion recognition. In 2003 IEEE International conference on acoustics, speech, and signal processing, 2003. Proceedings (ICASSP’03) (Vol. 2, p. 1). IEEE.
Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., & Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2), 119–131.
Article Google Scholar
Tellai, M., Gao, L., & Mao, Q. (2023). An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network. International Journal of Speech Technology, 26(2), 1–17.
Thornton, M. A., & Tamir, D. I. (2017). Mental models accurately predict emotion transitions. Proceedings of the National Academy of Sciences of the United States of America, 114(23), 5982–5987.
Article Google Scholar
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). IEEE.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NIPS) (Vol. 30).
Xu, M., Zhang, F., Cui, X., & Zhang, W. (2021). Speech emotion recognition with multiscale area attention and data augmentation. In ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6319–6323). IEEE.
Xu, X., Deng, J., Cummins, N., Zhang, Z., Wu, C., Zhao, L., & Schuller, B. (2017). A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7), 1436–1449.
Article Google Scholar
Yeh, S.-L., Lin, Y.-S., & Lee, C.-C. (2019). An interaction-aware attention network for speech emotion recognition in spoken dialogs. In ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6685–6689). IEEE.
Yeh, S.-L., Lin, Y.-S., & Lee, C.-C. (2020). A dialogical emotion decoder for speech emotion recognition in spoken dialog. In ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6479–6483). IEEE.
Yoon, S., Byun, S., Dey, S., & Jung, K. (2019). Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2822–2826). IEEE.
Zayene, B., Jlassi, C., & Arous, N. (2020). 3d convolutional recurrent global neural network for speech emotion recognition. In 2020 5th International conference on advanced technologies for signal and image processing (ATSIP) (pp. 1–5). IEEE.
Zhang, S., Zhang, S., Huang, T., & Gao, W. (2017). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 20(6), 1576–1590.
Article Google Scholar
Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN lSTM networks. Biomedical Signal Processing and Control, 47, 312–323.
Article Google Scholar
Zhou, S., Jia, J., Wang, Q., Dong, Y., Yin, Y., & Lei, K. (2018). Inferring emotion from conversational voice data: A semi-supervised multi-path generative neural network approach. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).

Download references

Funding

This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106, and Jiangsu Province key research and development plan (BE2020036).

Author information

Authors and Affiliations

Department of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, Jiangsu Province, China
Mohammed Tellai & Qirong Mao
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Zhenjiang, 212013, Jiangsu Province, China
Qirong Mao

Authors

Mohammed Tellai
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MT: Conceptualization, data curation, investigation, methodology, resources, software, validation, visualization, writing—original draft, writing—review and editing. QM: Supervision, formal analysis, methodology, validation, writing—review and editing.

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tellai, M., Mao, Q. CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition. Int J Speech Technol 26, 1099–1116 (2023). https://doi.org/10.1007/s10772-023-10080-7

Download citation

Received: 26 July 2023
Accepted: 25 November 2023
Published: 27 December 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10772-023-10080-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Automatic speech recognition: a survey

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Automatic speech recognition: a survey

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation