Skip to main content
Log in

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The use of machine learning and artificial intelligence enables us to create intelligent systems. Speech emotion recognition system analyzes the speaker’s speech to determine his/her emotional state. Speech emotion recognition is a challenging pattern recognition task. This paper proposes a new robust and lightweight speech emotion recognition system based on a dual-stream CNN-Transformer fusion network that effectively captures spatial and temporal information disentangled in long-distance raw features by exploring both MFCCs and Mel-spectrograms in a parallel mechanism. Experiments are performed on the widely used emotional benchmark datasets. Our approach has shown to be quite efficient. It outperforms the best-known state-of-the-art models with an accuracy of 97.64%, 99.42%, and 97.53% on RAVDESS, TESS, and EMO-DB datasets respectively. The results demonstrate the significant advantages of the proposed model and the ability of this architecture to learn to recognize emotional data features accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data and code availability

The data and code that supports the results presented in this article are available upon request.

References

  • Afrillia, Y., Mawengkang, H., Ramli, M., Fhonna, R. P., et al. (2017). Performance measurement of mel frequency ceptral coefficient (MFCC) method in learning system of al-qur’an based in nagham pattern recognition. Journal of Physics, 930, 012036.

    Google Scholar 

  • Aftab, A., Morsali, A., Ghaemmaghami, S., & Champagne, B. (2022). Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6912–6916). IEEE.

  • Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.

    Google Scholar 

  • Anvarjon, T., & Kwon, S. (2020). Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors, 20(18), 5212.

    Google Scholar 

  • Araujo, A., Norris, W., & Sim, J. (2019). Computing receptive fields of convolutional neural networks. Distill, 4(11), 21.

    Google Scholar 

  • Assunção, G., Menezes, P., & Perdigão, F. (2020). Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 16(4), 15–22.

    Google Scholar 

  • Atila, O., & Şengür, A. (2021). Attention guided 3d CNN-LSTM model for accurate speech based emotion recognition. Applied Acoustics, 182, 108260.

    Google Scholar 

  • Badshah, A. M., Rahim, N., Ullah, N., Ahmad, J., Muhammad, K., Lee, M. Y., Kwon, S., & Baik, S. W. (2019). Deep features-based speech emotion recognition for smart affective services. Multimedia Tools and Applications, 78(5), 5571–5589.

    Google Scholar 

  • Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.

    Google Scholar 

  • Bingol, M. C., & Aydogmus, O. (2020). Performing predefined tasks using the human-robot interaction on speech recognition for an industrial robot. Engineering Applications of Artificial Intelligence, 95, 103903.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B., et al. (2005). A database of German emotional speech. Interspeech, 5, 1517–1520.

    Google Scholar 

  • Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444.

    Google Scholar 

  • Choudhary, R. R., Meena, G., & Mohbey, K. K. (2022). Speech emotion based sentiment recognition using deep neural networks. Journal of Physics, 2236, 012003.

    Google Scholar 

  • Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011). Flexible, high performance convolutional neural networks for image classification. In Proceedings of the twenty-second international joint conference on artificial intelligence (IJCAI). AAAI Press.

  • Daneshfar, F., Kabudian, S. J., & Neekabadi, A. (2020). Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Applied Acoustics, 166, 107360.

    Google Scholar 

  • Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS)-younger talker_happy.

  • Dupuis, K., & Pichora-Fuller, M. K. (2011). Recognition of emotional speech for younger and older talkers: Behavioural findings from the Toronto emotional speech set. Canadian Acoustics, 39(3), 182–183.

    Google Scholar 

  • El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

    MATH  Google Scholar 

  • Gomathy, M. (2021). Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. International Journal of Speech Technology, 24(1), 155–163.

    Google Scholar 

  • Gong, Y., Chung, Y.-A., & Glass, J. (2021). Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.

  • Gumelar, A. B., Yuniarno, E. M., Adi, D. P., Setiawan, R., Sugiarto, I., & Purnomo, M. H. (2022). Transformer-CNN automatic hyperparameter tuning for speech emotion recognition. In 2022 IEEE international conference on imaging systems and techniques (IST) (pp. 1–6). IEEE.

  • Guo, L., Wang, L., Dang, J., Liu, Z., & Guan, H. (2019). Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access, 7, 75798–75809.

    Google Scholar 

  • Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech.

  • Han, S., Leng, F., & Jin, Z. (2021). Speech emotion recognition with a resnet-CNN-transformer parallel neural network. In 2021 International conference on communications, information system and computer engineering (CISCE) (pp. 803–807). IEEE.

  • Huang, A., & Bao, P. (2019). Human vocal sentiment analysis. arXiv preprint arXiv:1905.08632.

  • Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on multimedia (pp. 801–804).

  • Huang, Z.-W., Xue, W.-T., & Mao, Q.-R. (2015). Speech emotion recognition with unsupervised feature learning. Frontiers of Information Technology & Electronic Engineering, 16(5), 358–366.

    Google Scholar 

  • Ismail, A., Idris, M. Y. I., Noor, N. M., Razak, Z., & Yusoff, Z. M. (2014). MFCC-VQ approach for qalqalahtajweed rule checking. Malaysian Journal of Computer Science, 27(4), 275–293.

    Google Scholar 

  • Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.

    Google Scholar 

  • Jalal, M. A., Loweimi, E., Moore, R. K., & Hain, T. (2019). Learning temporal clusters using capsule routing for speech emotion recognition. In Proceedings of interspeech 2019 (pp. 1701–1705). ISCA.

  • Jason, C. A., Kumar, S., et al. (2020). An appraisal on speech and emotion recognition technologies based on machine learning. Language, 67, 68.

    Google Scholar 

  • Jiang, P., Fu, H., Tao, H., Lei, P., & Zhao, L. (2019). Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, 7, 90368–90377.

    Google Scholar 

  • Karim, F., Majumdar, S., & Darabi, H. (2019). Insights into LSTM fully convolutional networks for time series classification. IEEE Access, 7, 67718–67725.

    Google Scholar 

  • Kumaran, U., Radha Rammohan, S., Nagarajan, S. M., & Prathik, A. (2021). Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep c-rnn. International Journal of Speech Technology, 24(2), 303–314.

    Google Scholar 

  • Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183.

    Google Scholar 

  • Kwon, S. (2021). Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. International Journal of Intelligent Systems, 36(9), 5116–5135.

    Google Scholar 

  • Kwon, S. (2021). MLT-Dnet: Speech emotion recognition using 1d dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 167, 114177.

    Google Scholar 

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Google Scholar 

  • Lee, J., & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech.

  • Li, Y., Zhao, T., & Kawahara, T. (2019). Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech (pp. 2803–2807).

  • Liu, M. (2022). English speech emotion recognition method based on speech recognition. International Journal of Speech Technology, 25(2), 391–398.

    Google Scholar 

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), 0196391.

    Google Scholar 

  • Li, S., Xing, X., Fan, W., Cai, B., Fordson, P., & Xu, X. (2021). Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing, 448, 238–248.

    Google Scholar 

  • Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, 29.

  • Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.

    Google Scholar 

  • Mao, Q., Xu, G., Xue, W., Gou, J., & Zhan, Y. (2017). Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Communication, 93, 1–10.

    Google Scholar 

  • McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in Python. In Proceedings of the 14th Python in science conference (Vol. 8, pp. 18–25). Citeseer.

  • Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access, 7, 125868–125881.

    Google Scholar 

  • Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2227–2231). IEEE.

  • Morrison, D., Wang, R., & De Silva, L. C. (2007). Ensemble methods for spoken emotion recognition in call-centres. Speech Communication, 49(2), 98–112.

    Google Scholar 

  • Mustaqeem, K. S. (2021). 1d-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. CMC-Computers Materials & Continua, 67(3), 4039–4059.

    Google Scholar 

  • Naqvi, R. A., Arsalan, M., Rehman, A., Rehman, A. U., Loh, W.-K., & Paul, A. (2020). Deep learning-based drivers emotion classification system in time series data for remote applications. Remote Sensing, 12(3), 587.

    Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.

    Google Scholar 

  • Praseetha, V., & Vadivel, S. (2018). Deep learning models for speech emotion recognition. Journal of Computer Science, 14(11), 1577–1587.

    Google Scholar 

  • Rahaman, M. E., Alam, S. S., Mondal, H. S., Muntaseer, A.S., Mandal, R., & Raihan, M. (2019). Performance analysis of isolated speech recognition technique using MFCC and cross-correlation. In 2019 10th international conference on computing, communication and networking technologies (ICCCNT) (pp. 1–4). IEEE.

  • Sajjad, M., Kwon, S., et al. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861–79875.

    Google Scholar 

  • Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech using deep learning on spectrograms. In Interspeech (pp. 1089–1093).

  • Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., & Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2), 119–131.

    Google Scholar 

  • Singh, Y. B., & Goel, S. (2022). A systematic literature review of speech emotion recognition approaches. Neurocomputing.

  • Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., & Schuller, B. (2011). Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5688–5691). IEEE

  • Tao, F., & Liu, G. (2018). Advanced LSTM: A study about better time dependency modeling in emotion recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2906–2910). IEEE.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

  • Wang, H., Zhang, Q., Wu, J., Pan, S., & Chen, Y. (2019). Time series feature learning with labeled and unlabeled data. Pattern Recognition, 89, 55–66.

    Google Scholar 

  • Xu, X., Deng, J., Cummins, N., Zhang, Z., Wu, C., Zhao, L., & Schuller, B. (2017). A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7), 1436–1449.

    Google Scholar 

  • Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman, I. (2019). Emotion detection from speech signals using voting mechanism on classified frames. In 2019 International conference on robotics, electrical and signal processing techniques (ICREST) (pp. 281–285). IEEE.

  • Zayene, B., Jlassi, C., & Arous, N. (2020). 3D convolutional recurrent global neural network for speech emotion recognition. In 2020 5th International conference on advanced technologies for signal and image processing (ATSIP) (pp. 1–5). IEEE.

  • Zeng, M., & Xiao, N. (2019). Effective combination of densenet and BiLSTM for keyword spotting. IEEE Access, 7, 10767–10775.

    Google Scholar 

  • Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3), 3705–3722.

    Google Scholar 

  • Zhang, S., Zhang, S., Huang, T., & Gao, W. (2017). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 20(6), 1576–1590.

    Google Scholar 

  • Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.

    Google Scholar 

Download references

Funding

This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China of 62176106, and Jiangsu Province key research and development plan (BE2020036).

Author information

Authors and Affiliations

Authors

Contributions

MT: Conceptualization, Data curation, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review editing. LG: Formal analysis, Validation, Writing—review editing. QM: Supervision, Formal analysis, Methodology, Validation, Writing—review editing.

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tellai, M., Gao, L. & Mao, Q. An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network. Int J Speech Technol 26, 541–557 (2023). https://doi.org/10.1007/s10772-023-10035-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10035-y

Keywords

Navigation