Skip to main content
Log in

CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech is a crucial aspect of human-to-human interactions and plays a fundamental role in the advancement of human–computer interaction (HCI) systems. Developing an accurate speech emotion recognition (SER) system for human conversations poses a critical yet challenging task. Existing state-of-the-art (SOTA) research in SER primarily focuses on modeling vocal information within individual conversational speech utterances, overlooking the significance of incorporating transactional information from the interaction context. In this paper, we present a novel Contextualized Convolutional Transformer-GRU Network (CCTG-Net) for recognizing speech emotions using Mel-spectrogram features, effectively integrating contextual information for emotion recognition. Our experiments are conducted on the widely-used emotional benchmark dataset, IEMOCAP. Compared to SOTA methods in four-class emotion recognition, our proposed model achieves a weighted accuracy of 88.4% and an unweighted accuracy (UA) of 89.1%. This marks a substantial 3.0% enhancement in UA while maintaining an optimal balance between performance and complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The data and code that support the results presented in this article are available upon request.

Code availability

The data and code that support the results presented in this article are available upon request.

References

  • Afrillia, Y., Mawengkang, H., Ramli, M., & Fhonna, R. P. (2017). Performance measurement of Mel frequency ceptral coefficient (MFCC) method in learning system of Al-Qur’an based in nagham pattern recognition. Journal of Physics: Conference Series, 930, 012036.

  • Aftab, A., Morsali, A., Ghaemmaghami, S., & Champagne, B. (2022). Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022-2022 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6912–6916). IEEE.

  • Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.

    Article  Google Scholar 

  • Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), 21.

    Article  Google Scholar 

  • Barsade, S. G. (2002). The ripple effect: Emotional contagion and its influence on group behavior. Administrative Science Quarterly, 47(4), 644–675.

    Article  Google Scholar 

  • Bingol, M. C., & Aydogmus, O. (2020). Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot. Engineering Applications of Artificial Intelligence, 95, 103903.

    Article  Google Scholar 

  • Bone, D., Lee, C.-C., Chaspari, T., Gibson, J., & Narayanan, S. (2017). Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Processing Magazine, 34(5), 196–195.

    Article  Google Scholar 

  • Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42, 335–359.

    Article  Google Scholar 

  • Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444.

    Article  Google Scholar 

  • Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

  • Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011). Flexible, high performance convolutional neural networks for image classification. In 22nd International joint conference on artificial intelligence (IJCAI).

  • Dong, G.-N., Pun, C.-M., & Zhang, Z. (2022). Temporal relation inference network for multimodal speech emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 6472–6485.

    Article  Google Scholar 

  • El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition, 44(3), 572–587.

    Article  Google Scholar 

  • Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.

    Article  Google Scholar 

  • Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech 2014.

  • Han, T., Zhang, Z., Ren, M., Dong, C., Jiang, X., & Zhuang, Q. (2023). Speech emotion recognition based on deep residual shrinkage network. Electronics, 12(11), 2512.

    Article  Google Scholar 

  • Hareli, S., David, S., & Hess, U. (2016). The role of emotion transition for the perception of social dominance and affiliation. Cognition and Emotion, 30(7), 1260–1270.

    Article  Google Scholar 

  • Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., & Zimmermann, R. (2018). ICONn: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2594–2604).

  • Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.-P., & Zimmermann, R. (2018). Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference of the association for computational linguistics. North American chapter meeting (Vol. 2018, p. 2122). NIH Public Access.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Huang, Z., Xue, W., & Mao, Q. (2015). Speech emotion recognition with unsupervised feature learning. Frontiers of Information Technology & Electronic Engineering, 16(5), 358–366.

  • Ismail, A., Idris, M. Y. I., Noor, N. M., Razak, Z., & Yusoff, Z. M. (2014). MFCC-VQ approach for qalqalahtajweed rule checking. Malaysian Journal of Computer Science, 27(4), 275–293.

    Google Scholar 

  • Jalal, M. A., Milner, R., & Hain, T. (2020). Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In Proceedings of Interspeech (pp. 4113–4117). International Speech Communication Association (ISCA).

  • Jokinen, K., & McTear, M. (2009). Spoken dialogue systems. Synthesis Lectures on Human Language Technologies, 2(1), 1–151.

    Article  Google Scholar 

  • Kim, E., & Shin, J. W. (2019). Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6720–6724). IEEE.

  • Kingma, D. P., & Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprint. arXiv:1412.6980

  • Kumaran, U., Radha Rammohan, S., Nagarajan, S. M., & Prathik, A. (2021). Fusion of Mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. International Journal of Speech Technology, 24(2), 303–314.

    Article  Google Scholar 

  • Lee, J., & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech 2015.

  • Li, R., Wu, Z., Jia, J., Zhao, S., & Meng, H. (2019). Dilated residual network with multi-head self-attention for speech emotion recognition. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6675–6679). IEEE.

  • Li, Y., Zhao, T., & Kawahara, T. (2019). Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech (pp. 2803–2807).

  • Lian, Z., Liu, B., & Tao, J. (2021). Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 985–1000.

    Article  Google Scholar 

  • Liu, K., Wang, C., Chen, J., & Feng, J. (2022). Time-frequency attention for speech emotion recognition with squeeze-and-excitation blocks. In: Proceedings of multimedia modeling: 28th international conference (MMM 2022) (Part I, pp. 533–543), Phu Quoc, Vietnam, June 6–10, 2022. Springer.

  • Liu, M. (2022). English speech emotion recognition method based on speech recognition. International Journal of Speech Technology, 1–8.

  • Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (Vol. 29).

  • Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., & Cambria, E. (2019). DialogueRNN: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 6818–6825).

  • Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.

    Article  Google Scholar 

  • Mao, Q., Xu, G., Xue, W., Gou, J., & Zhan, Y. (2017). Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Communication, 93, 1–10.

    Article  Google Scholar 

  • McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in science conference (Vol. 8, pp. 18–25). Citeseer.

  • Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access, 7, 125868–125881.

    Article  Google Scholar 

  • Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2227–2231). IEEE.

  • Morrison, D., Wang, R., & De Silva, L. C. (2007). Ensemble methods for spoken emotion recognition in call-centres. Speech Communication, 49(2), 98–112.

    Article  Google Scholar 

  • Mustaqeem, & Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.

  • Mustaqeem, & Kwon, S. (2021). Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Applied Soft Computing, 102, 107101.

  • Mustaqeem, Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BILSTM. IEEE Access, 8, 79861–79875.

  • Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203–1233.

    Article  Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.

    Google Scholar 

  • Rajamani, S. T., Rajamani, K. T., Mallol-Ragolta, A., Liu, S., & Schuller, B. (2021). A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6294–6298). IEEE.

  • Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., & Prasad, R. (2012). Ensemble of svm trees for multimodal emotion recognition. In Proceedings of the 2012 Asia Pacific signal and information processing association annual summit and conference (pp. 1–4). IEEE.

  • Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K. K., & Dehak, N. (2018). Emotion identification from raw speech signals using DNNs. In Interspeech (pp. 3097–3101).

  • Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech using deep learning on spectrograms. In Interspeech (pp. 1089–1093).

  • Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden Markov model-based speech emotion recognition. In 2003 IEEE International conference on acoustics, speech, and signal processing, 2003. Proceedings (ICASSP’03) (Vol. 2, p. 1). IEEE.

  • Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., & Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2), 119–131.

    Article  Google Scholar 

  • Tellai, M., Gao, L., & Mao, Q. (2023). An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network. International Journal of Speech Technology, 26(2), 1–17.

  • Thornton, M. A., & Tamir, D. I. (2017). Mental models accurately predict emotion transitions. Proceedings of the National Academy of Sciences of the United States of America, 114(23), 5982–5987.

    Article  Google Scholar 

  • Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200–5204). IEEE.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NIPS) (Vol. 30).

  • Xu, M., Zhang, F., Cui, X., & Zhang, W. (2021). Speech emotion recognition with multiscale area attention and data augmentation. In ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6319–6323). IEEE.

  • Xu, X., Deng, J., Cummins, N., Zhang, Z., Wu, C., Zhao, L., & Schuller, B. (2017). A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7), 1436–1449.

    Article  Google Scholar 

  • Yeh, S.-L., Lin, Y.-S., & Lee, C.-C. (2019). An interaction-aware attention network for speech emotion recognition in spoken dialogs. In ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6685–6689). IEEE.

  • Yeh, S.-L., Lin, Y.-S., & Lee, C.-C. (2020). A dialogical emotion decoder for speech emotion recognition in spoken dialog. In ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 6479–6483). IEEE.

  • Yoon, S., Byun, S., Dey, S., & Jung, K. (2019). Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2822–2826). IEEE.

  • Zayene, B., Jlassi, C., & Arous, N. (2020). 3d convolutional recurrent global neural network for speech emotion recognition. In 2020 5th International conference on advanced technologies for signal and image processing (ATSIP) (pp. 1–5). IEEE.

  • Zhang, S., Zhang, S., Huang, T., & Gao, W. (2017). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 20(6), 1576–1590.

    Article  Google Scholar 

  • Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN lSTM networks. Biomedical Signal Processing and Control, 47, 312–323.

    Article  Google Scholar 

  • Zhou, S., Jia, J., Wang, Q., Dong, Y., Yin, Y., & Lei, K. (2018). Inferring emotion from conversational voice data: A semi-supervised multi-path generative neural network approach. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).

Download references

Funding

This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106, and Jiangsu Province key research and development plan (BE2020036).

Author information

Authors and Affiliations

Authors

Contributions

MT: Conceptualization, data curation, investigation, methodology, resources, software, validation, visualization, writing—original draft, writing—review and editing. QM: Supervision, formal analysis, methodology, validation, writing—review and editing.

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tellai, M., Mao, Q. CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition. Int J Speech Technol 26, 1099–1116 (2023). https://doi.org/10.1007/s10772-023-10080-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10080-7

Keywords

Navigation