research-article

SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings

Authors:

Nhat Truong Pham,

Duc Ngoc Minh Dang,

Bich Ngoc Hong Pham,

Sy Dzung NguyenAuthors Info & Claims

ICIIT '23: Proceedings of the 2023 8th International Conference on Intelligent Information Technology

Pages 234 - 238

https://doi.org/10.1145/3591569.3591610

Published: 13 July 2023 Publication History

Abstract

This paper proposes a multi-modal approach for speech emotion recognition (SER) using both text and audio inputs. The audio embedding is extracted by using a vision-based architecture, namely VGGish, while the text embedding is extracted by using a transformer-based architecture, namely BERT. Then, these embeddings are fused using concatenation to recognize emotional states. To evaluate the effectiveness of the proposed method, the benchmark dataset, namely IEMOCAP, is employed in this study. Experimental results indicate that the proposed method is very competitive and better than most of the latest and state-of-the-art methods using multi-modal analysis for SER. The proposed method achieves 63.00% unweighted accuracy (UA) and 63.10% weighted accuracy (WA) on the IEMOCAP dataset. In the future, an extension of multi-task learning and multi-lingual approaches will be investigated to improve the performance and robustness of multi-modal SER. For reproducibility purposes, our code is publicly available.

References

[1]

[1] Lijiang Chen, Xia Mao, Yu-Li Xue, and Lee Lung Cheng. Speech emotion recognition: Features and classification models. Digit. Signal Process., 22(6):1154–1160, 2012.

Digital Library

[2]

[2] Dong Liu, Longxi Chen, Zhiyong Wang, and Guangqiang Diao. Speech expression multimodal emotion recognition based on deep belief network. J. Grid Comput., 19(2):22, 2021.

Digital Library

[3]

[3] Ngoc-Huynh Ho, Hyung-Jeong Yang, Soo-Hyung Kim, and Gueesang Lee. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access, 8:61672–61686, 2020.

[4]

[4] Licai Sun, Bin Liu, Jianhua Tao, and Zheng Lian. Multimodal cross- and self-attention network for speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pages 4275–4279. IEEE, 2021.

[5]

[5] Nhat Truong Pham, Duc Ngoc Minh Dang, and Sy Dung Nguyen. A method upon deep learning for speech emotion recognition. Journal of Advanced Engineering and Computation, 4(4):273–285, 2020.

[6]

[6] Mustaqeem, Muhammad Sajjad, and Soonil Kwon. Clustering-based speech emotion recognition by incorporating learned features and deep bilstm. IEEE Access, 8:79861–79875, 2020.

[7]

[7] Mingyong Li, Xue Qiu, Shuang Peng, Lirong Tang, Qiqi Li, Wenhui Yang, and Yan Ma. Multimodal emotion recognition model based on a deep neural network with multiobjective optimization. Wirel. Commun. Mob. Comput., 2021:6971100:1–6971100:10, 2021.

[8]

[8] Dev Priya Goel, Kushagra Mahajan, Ngoc Duy Nguyen, Natesan Srinivasan, and Chee Peng Lim. Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network. Neural Computing and Applications, pages 1–13, 2022.

[9]

[9] Fang Bao, Michael Neumann, and Ngoc Thang Vu. Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In Gernot Kubin and Zdravko Kacic, editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2828–2832. ISCA, 2019.

[10]

[10] Florian Eyben, Martin Wöllmer, and Björn W. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Alberto Del Bimbo, Shih-Fu Chang, and Arnold W. M. Smeulders, editors, Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, October 25-29, 2010, pages 1459–1462. ACM, 2010.

[11]

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.

[12]

[12] Thanh Tin Nguyen, Nhat Truong Pham, Ngoc Duy Nguyen, Hai Nguyen, Long H. Nguyen, and Yong-Guk Kim. Hcilab at memotion 2.0 2022: Analysis of sentiment, emotion and intensity of emotion classes from meme images using single and multi modalities (short paper). In Amitava Das, Tanmay Chakraborty, Asif Ekbal, and Amit P. Sheth, editors, Proceedings of the Workshop on Multi-Modal Fake News and Hate-Speech Detection (DE-FACTIFY 2022) co-located with the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), Virtual Event, Vancouver, Canada, February 27, 2022, volume 3199 of CEUR Workshop Proceedings. CEUR-WS.org, 2022.

[13]

[13] Weidong Chen, Xiaofeng Xing, Xiangmin Xu, Jichen Yang, and Jianxin Pang. Key-sparse transformer with cascaded cross-attention block for multimodal speech emotion recognition. CoRR, abs/2106.11532, 2021.

[14]

[14] Nhat Truong Pham, Anh-Tuan Tran, Bich Ngoc Hong Pham, Hanh Dang-Ngoc, Sy Dzung Nguyen, and Duc Ngoc Minh Dang. Speech emotion recognition: A brief review of multi-modal multi-task learning approaches. In International Conference on Advanced Engineering Theory and Applications, pages 563–572. Springer, 2022.

[15]

[15] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 131–135. IEEE, 2017.

Digital Library

[16]

[16] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation, 42(4):335–359, 2008.

[17]

[17] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880–2894, 2020.

Digital Library

[18]

[18] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 776–780. IEEE, 2017.

[19]

[19] Cristina Luna-Jiménez, Ricardo Kleinlein, David Griol, Zoraida Callejas, Juan M Montero, and Fernando Fernández-Martínez. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Applied Sciences, 12(1):327, 2021.

[20]

[20] Long H. Nguyen, Nhat Truong Pham, Van Huong Do, Liu Tai Nguyen, Thanh Tin Nguyen, Hai Nguyen, Ngoc Duy Nguyen, Thanh Thi Nguyen, Sy Dzung Nguyen, Asim Bhatti, and Chee Peng Lim. Fruit-cov: An efficient vision-based framework for speedy detection and diagnosis of sars-cov-2 infections through recorded cough sounds. Expert Systems with Applications, 213:119212, 2023.

Digital Library

[21]

[21] Chunyi Wang, Ying Ren, Na Zhang, Fuwei Cui, and Shiying Luo. Speech emotion recognition based on multi-feature and multi-lingual fusion. Multim. Tools Appl., 81(4):4897–4907, 2022.

Digital Library

[22]

[22] Sirisha Velampalli, Chandrashekar Muniyappa, and Ashutosh Saxena. Performance evaluation of sentiment analysis on text and emoji data using end-to-end, transfer learning, distributed and explainable ai models. Journal of Advances in Information Technology Vol, 13(2), 2022.

[23]

[23] Sanghyun Lee, David K. Han, and Hanseok Ko. Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access, 9:94557–94572, 2021.

[24]

[24] Yoonhyung Lee, Seunghyun Yoon, and Kyomin Jung. Multimodal speech emotion recognition using cross attention with aligned audio and text. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 2717–2721. ISCA, 2020.

[25]

[25] Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, pages 112–118. IEEE, 2018.

[26]

[26] Shao-Yen Tseng, Shrikanth Narayanan, and Panayiotis G. Georgiou. Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Process. Lett., 28:608–612, 2021.

[27]

[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[28]

[28] Nhat Truong Pham, Duc Ngoc Minh Dang, and Sy Dzung Nguyen. Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. arXiv preprint arXiv:2109.09026, 2021.

[29]

[29] Sy Dzung Nguyen, Vu Song Thuy Nguyen, and Nhat Truong Pham. Determination of the optimal number of clusters: A fuzzy-set based method. IEEE Trans. Fuzzy Syst., 30(9):3514–3526, 2022.

Cited By

Guder LAires JMeneguzzi FGriebler D(2024)Dimensional Speech Emotion Recognition from Bimodal FeaturesAnais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024)10.5753/sbcas.2024.2779(579-590)Online publication date: 25-Jun-2024
https://doi.org/10.5753/sbcas.2024.2779
Nguyen TTran PDang D(2024)Enhancing Speech Emotion Recognition Through Knowledge Distillation2024 15th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC62082.2024.10826904(197-202)Online publication date: 16-Oct-2024
https://doi.org/10.1109/ICTC62082.2024.10826904
Le QTuan Trinh KHung Son NTran PNguyen CNgoc Minh Dang D(2024)MERSA: Multimodal Emotion Recognition with Self-Align Embedding2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572116(500-505)Online publication date: 17-Jan-2024
https://doi.org/10.1109/ICOIN59985.2024.10572116
Show More Cited By

Index Terms

SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Segment-based emotion recognition from continuous Mandarin Chinese speech

Recognition of emotion in speech has recently matured to one of the key disciplines in speech analysis serving next generation human-machine interaction and communication. However, compared to automatic speech recognition, that emotion recognition from ...
Vocal Emotion Recognition Based on HMM and GMM for Mandarin Speech
ETCS '11: Proceedings of the 2011 Third International Workshop on Education Technology and Computer Science - Volume 02

The recognition of emotions from speech is a challenging issue. In this paper, two Hidden Markov Model-based vocal emotion classifiers are trained and evaluated by an emotional mandarin speech corpus based on Mel-Frequency Cepstral Coefficient features. ...
Enhancing Speech Emotion Recognition Using Transfer Learning from Speaker Embeddings
Text, Speech, and Dialogue
Abstract
Understanding and identifying emotions from speech is a key challenge in automatic Speech Emotion Recognition (SER). Speech carries a variety of information about speaker’s emotional state or contextual emotions, but the lack of large and diverse ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIIT '23: Proceedings of the 2023 8th International Conference on Intelligent Information Technology

February 2023

310 pages

ISBN:9781450399616

DOI:10.1145/3591569

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIIT 2023

ICIIT 2023: 2023 8th International Conference on Intelligent Information Technology

February 24 - 26, 2023

Da Nang, Vietnam

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
144
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guder LAires JMeneguzzi FGriebler D(2024)Dimensional Speech Emotion Recognition from Bimodal FeaturesAnais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024)10.5753/sbcas.2024.2779(579-590)Online publication date: 25-Jun-2024
https://doi.org/10.5753/sbcas.2024.2779
Nguyen TTran PDang D(2024)Enhancing Speech Emotion Recognition Through Knowledge Distillation2024 15th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC62082.2024.10826904(197-202)Online publication date: 16-Oct-2024
https://doi.org/10.1109/ICTC62082.2024.10826904
Le QTuan Trinh KHung Son NTran PNguyen CNgoc Minh Dang D(2024)MERSA: Multimodal Emotion Recognition with Self-Align Embedding2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572116(500-505)Online publication date: 17-Jan-2024
https://doi.org/10.1109/ICOIN59985.2024.10572116
Tran PVu TTruong Pham NDang-Ngoc HMinh Dang D(2023)Comparative analysis of multi-loss functions for enhanced multi-modal speech emotion recognition2023 14th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC58733.2023.10392928(425-429)Online publication date: 11-Oct-2023
https://doi.org/10.1109/ICTC58733.2023.10392928
Tran PVu TDang DPham NTran A(2023)Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head AttentionIndustrial Networks and Intelligent Systems10.1007/978-3-031-47359-3_11(148-158)Online publication date: 31-Oct-2023
https://doi.org/10.1007/978-3-031-47359-3_11

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten