skip to main content
10.1145/3591569.3591610acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciitConference Proceedingsconference-collections
research-article

SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings

Published: 13 July 2023 Publication History

Abstract

This paper proposes a multi-modal approach for speech emotion recognition (SER) using both text and audio inputs. The audio embedding is extracted by using a vision-based architecture, namely VGGish, while the text embedding is extracted by using a transformer-based architecture, namely BERT. Then, these embeddings are fused using concatenation to recognize emotional states. To evaluate the effectiveness of the proposed method, the benchmark dataset, namely IEMOCAP, is employed in this study. Experimental results indicate that the proposed method is very competitive and better than most of the latest and state-of-the-art methods using multi-modal analysis for SER. The proposed method achieves 63.00% unweighted accuracy (UA) and 63.10% weighted accuracy (WA) on the IEMOCAP dataset. In the future, an extension of multi-task learning and multi-lingual approaches will be investigated to improve the performance and robustness of multi-modal SER. For reproducibility purposes, our code is publicly available.

References

[1]
[1] Lijiang Chen, Xia Mao, Yu-Li Xue, and Lee Lung Cheng. Speech emotion recognition: Features and classification models. Digit. Signal Process., 22(6):1154–1160, 2012.
[2]
[2] Dong Liu, Longxi Chen, Zhiyong Wang, and Guangqiang Diao. Speech expression multimodal emotion recognition based on deep belief network. J. Grid Comput., 19(2):22, 2021.
[3]
[3] Ngoc-Huynh Ho, Hyung-Jeong Yang, Soo-Hyung Kim, and Gueesang Lee. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access, 8:61672–61686, 2020.
[4]
[4] Licai Sun, Bin Liu, Jianhua Tao, and Zheng Lian. Multimodal cross- and self-attention network for speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pages 4275–4279. IEEE, 2021.
[5]
[5] Nhat Truong Pham, Duc Ngoc Minh Dang, and Sy Dung Nguyen. A method upon deep learning for speech emotion recognition. Journal of Advanced Engineering and Computation, 4(4):273–285, 2020.
[6]
[6] Mustaqeem, Muhammad Sajjad, and Soonil Kwon. Clustering-based speech emotion recognition by incorporating learned features and deep bilstm. IEEE Access, 8:79861–79875, 2020.
[7]
[7] Mingyong Li, Xue Qiu, Shuang Peng, Lirong Tang, Qiqi Li, Wenhui Yang, and Yan Ma. Multimodal emotion recognition model based on a deep neural network with multiobjective optimization. Wirel. Commun. Mob. Comput., 2021:6971100:1–6971100:10, 2021.
[8]
[8] Dev Priya Goel, Kushagra Mahajan, Ngoc Duy Nguyen, Natesan Srinivasan, and Chee Peng Lim. Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network. Neural Computing and Applications, pages 1–13, 2022.
[9]
[9] Fang Bao, Michael Neumann, and Ngoc Thang Vu. Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In Gernot Kubin and Zdravko Kacic, editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2828–2832. ISCA, 2019.
[10]
[10] Florian Eyben, Martin Wöllmer, and Björn W. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Alberto Del Bimbo, Shih-Fu Chang, and Arnold W. M. Smeulders, editors, Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, October 25-29, 2010, pages 1459–1462. ACM, 2010.
[11]
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
[12]
[12] Thanh Tin Nguyen, Nhat Truong Pham, Ngoc Duy Nguyen, Hai Nguyen, Long H. Nguyen, and Yong-Guk Kim. Hcilab at memotion 2.0 2022: Analysis of sentiment, emotion and intensity of emotion classes from meme images using single and multi modalities (short paper). In Amitava Das, Tanmay Chakraborty, Asif Ekbal, and Amit P. Sheth, editors, Proceedings of the Workshop on Multi-Modal Fake News and Hate-Speech Detection (DE-FACTIFY 2022) co-located with the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), Virtual Event, Vancouver, Canada, February 27, 2022, volume 3199 of CEUR Workshop Proceedings. CEUR-WS.org, 2022.
[13]
[13] Weidong Chen, Xiaofeng Xing, Xiangmin Xu, Jichen Yang, and Jianxin Pang. Key-sparse transformer with cascaded cross-attention block for multimodal speech emotion recognition. CoRR, abs/2106.11532, 2021.
[14]
[14] Nhat Truong Pham, Anh-Tuan Tran, Bich Ngoc Hong Pham, Hanh Dang-Ngoc, Sy Dzung Nguyen, and Duc Ngoc Minh Dang. Speech emotion recognition: A brief review of multi-modal multi-task learning approaches. In International Conference on Advanced Engineering Theory and Applications, pages 563–572. Springer, 2022.
[15]
[15] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 131–135. IEEE, 2017.
[16]
[16] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation, 42(4):335–359, 2008.
[17]
[17] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880–2894, 2020.
[18]
[18] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 776–780. IEEE, 2017.
[19]
[19] Cristina Luna-Jiménez, Ricardo Kleinlein, David Griol, Zoraida Callejas, Juan M Montero, and Fernando Fernández-Martínez. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Applied Sciences, 12(1):327, 2021.
[20]
[20] Long H. Nguyen, Nhat Truong Pham, Van Huong Do, Liu Tai Nguyen, Thanh Tin Nguyen, Hai Nguyen, Ngoc Duy Nguyen, Thanh Thi Nguyen, Sy Dzung Nguyen, Asim Bhatti, and Chee Peng Lim. Fruit-cov: An efficient vision-based framework for speedy detection and diagnosis of sars-cov-2 infections through recorded cough sounds. Expert Systems with Applications, 213:119212, 2023.
[21]
[21] Chunyi Wang, Ying Ren, Na Zhang, Fuwei Cui, and Shiying Luo. Speech emotion recognition based on multi-feature and multi-lingual fusion. Multim. Tools Appl., 81(4):4897–4907, 2022.
[22]
[22] Sirisha Velampalli, Chandrashekar Muniyappa, and Ashutosh Saxena. Performance evaluation of sentiment analysis on text and emoji data using end-to-end, transfer learning, distributed and explainable ai models. Journal of Advances in Information Technology Vol, 13(2), 2022.
[23]
[23] Sanghyun Lee, David K. Han, and Hanseok Ko. Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access, 9:94557–94572, 2021.
[24]
[24] Yoonhyung Lee, Seunghyun Yoon, and Kyomin Jung. Multimodal speech emotion recognition using cross attention with aligned audio and text. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 2717–2721. ISCA, 2020.
[25]
[25] Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, pages 112–118. IEEE, 2018.
[26]
[26] Shao-Yen Tseng, Shrikanth Narayanan, and Panayiotis G. Georgiou. Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Process. Lett., 28:608–612, 2021.
[27]
[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[28]
[28] Nhat Truong Pham, Duc Ngoc Minh Dang, and Sy Dzung Nguyen. Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. arXiv preprint arXiv:2109.09026, 2021.
[29]
[29] Sy Dzung Nguyen, Vu Song Thuy Nguyen, and Nhat Truong Pham. Determination of the optimal number of clusters: A fuzzy-set based method. IEEE Trans. Fuzzy Syst., 30(9):3514–3526, 2022.

Cited By

View all
  • (2024)Dimensional Speech Emotion Recognition from Bimodal FeaturesAnais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024)10.5753/sbcas.2024.2779(579-590)Online publication date: 25-Jun-2024
  • (2024)Enhancing Speech Emotion Recognition Through Knowledge Distillation2024 15th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC62082.2024.10826904(197-202)Online publication date: 16-Oct-2024
  • (2024)MERSA: Multimodal Emotion Recognition with Self-Align Embedding2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572116(500-505)Online publication date: 17-Jan-2024
  • Show More Cited By

Index Terms

  1. SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICIIT '23: Proceedings of the 2023 8th International Conference on Intelligent Information Technology
    February 2023
    310 pages
    ISBN:9781450399616
    DOI:10.1145/3591569
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BERT
    2. VGGish
    3. multi-modal emotion recognition
    4. speech emotion recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICIIT 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Dimensional Speech Emotion Recognition from Bimodal FeaturesAnais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024)10.5753/sbcas.2024.2779(579-590)Online publication date: 25-Jun-2024
    • (2024)Enhancing Speech Emotion Recognition Through Knowledge Distillation2024 15th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC62082.2024.10826904(197-202)Online publication date: 16-Oct-2024
    • (2024)MERSA: Multimodal Emotion Recognition with Self-Align Embedding2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572116(500-505)Online publication date: 17-Jan-2024
    • (2023)Comparative analysis of multi-loss functions for enhanced multi-modal speech emotion recognition2023 14th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC58733.2023.10392928(425-429)Online publication date: 11-Oct-2023
    • (2023)Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head AttentionIndustrial Networks and Intelligent Systems10.1007/978-3-031-47359-3_11(148-158)Online publication date: 31-Oct-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media