research-article

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

Authors:
Yang Wu

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

,
Zhenyu Zhang

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

,
Pai Peng

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

,
Yanyan Zhao

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

,
Bing Qin

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and ChallengeOctober 2022Pages 101–109https://doi.org/10.1145/3551876.3554813

Published:10 October 2022Publication History

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

Pages 101–109

ABSTRACT

Multi-modal emotion recognition aims to recognize emotion states from multi-modal inputs. Existing end-to-end models typically fuse the uni-modal representations in the last layers without leveraging the multi-modal interactions among the intermediate representations. In this paper, we propose the multi-modal Recurrent Intermediate-Layer Aggregation (RILA) model to explore the effectiveness of leveraging the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition. At the heart of our model is the Intermediate-Representation Fusion Module (IRFM), which consists of the multi-modal aggregation gating module and multi-modal token attention module. Specifically, at each layer, we first use the multi-modal aggregation gating module to capture the utterance-level interactions across the modalities and layers. Then we utilize the multi-modal token attention module to leverage the token-level multi-modal interactions. The experimental results on IEMOCAP and CMU-MOSEI show that our model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations.

References

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016).Google Scholar
Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Meßner, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. MuSe 2022 Challenge: Multimodal Humour, Emotional Reactions, and Stress. In Proceedings of the 30th ACM International Conference on Multimedia (MM'22), October 10--14, 2022, Lisbon, Portugal. Association for Computing Machinery, Lisbon, Portugal. 3 pages, to appear.Google ScholarDigital Library
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18--1208Google ScholarCross Ref
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59--66.Google ScholarDigital Library
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation , Vol. 42, 4 (2008), 335--359.Google Scholar
Shi Chen and Qi Zhao. 2018. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 12 (2018), 3048--3056.Google Scholar
Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. In Proceedings of the 3rd Multimodal Sentiment Analysis Challenge. Association for Computing Machinery, Lisbon, Portugal. Workshop held at ACM Multimedia 2022, to appear.Google Scholar
Wenliang Dai, Samuel Cahyawijaya, Zihan Liu, and Pascale Fung. 2021. Multimodal End-to-End Sparse Model for Emotion Recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5305--5316. https://www.aclweb.org/anthology/2021.naacl-main.417Google ScholarCross Ref
Wenliang Dai, Zihan Liu, Tiezheng Yu, and Pascale Fung. 2020. Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 269--280. https://aclanthology.org/2020.aacl-main.30Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.Google Scholar
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.Google ScholarDigital Library
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 154--164.Google ScholarCross Ref
Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. 2021b. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753 (2021).Google Scholar
Yuan Gong, Yu-An Chung, and James Glass. 2021a. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571--575. https://doi.org/10.21437/Interspeech.2021--698Google Scholar
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 1122--1131. https://doi.org/10.1145/3394171.3413678Google ScholarDigital Library
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).Google Scholar
Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941--3945.Google ScholarCross Ref
Ganesh Jawahar, Beno^it Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. In ACL 2019--57th Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.Google ScholarCross Ref
Xincheng Ju, Dong Zhang, Junhui Li, and Guodong Zhou. 2020. Transformer-based label set generation for multi-modal multi-label emotion detection. In Proceedings of the 28th ACM International Conference on Multimedia. 512--520.Google ScholarDigital Library
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. 4171--4186.Google Scholar
Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2528--2540.Google ScholarCross Ref
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7057--7075. https://doi.org/10.18653/v1/2020.emnlp-main.574Google ScholarCross Ref
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvSGoogle Scholar
Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6419--6423.Google ScholarCross Ref
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarCross Ref
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.Google ScholarCross Ref
Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.Google ScholarDigital Library
Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. 2021. Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In IEEE Automatic Speech Recognition and Understanding Workshop-ASRU 2021.Google ScholarCross Ref
Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. 2022. Revisiting Over-smoothing in BERT from the Perspective of Graph. In International Conference on Learning Representations. https://openreview.net/forum?id=dUV91uaXm3Google Scholar
Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1547--1556.Google ScholarCross Ref
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.Google Scholar
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417Google ScholarCross Ref
Zeguan Xiao, Jiarun Wu, Qingliang Chen, and Congjian Deng. 2021. BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9193--9200.Google ScholarCross Ref
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.Google ScholarCross Ref

Index Terms

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our solutions for the Cross-cultural Emotion Sub-challenge (CES) of Audio/Visual Emotion ...
Read More
Dense Attention Memory Network for Multi-modal emotion recognition
MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

Emotion is a significant aspect of human communication. Speech, text, action, and other information are complementary to human communication, which helps the machine understand emotion better. In this paper, we propose a dense attention memory model (...
Read More
Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

This paper presents our pioneering effort in addressing a new and realistic scenario in multi-modal dialogue systems called Multi-modal Real-time Emotion Pre-recognition in Conversations (MREPC). The objective is to predict the emotion of a forthcoming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge
October 2022
118 pages
ISBN:9781450394840
DOI:10.1145/3551876
General Chairs:
Shahin Amiriparian
University of Augsburg, GER
,
Lukas Christ
University of Augsburg, GER
,
Andreas König
University of Passau, GER
,
Alan Cowen
Hume AI, USA
,
Eva-Maria Meßner
University of Ulm, GER
,
Erik Cambria
Nanyang Technological University, SNG
,
Björn W. Schuller
Imperial College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multi-modal emotion recognition
multi-modal fusion
Qualifiers
- research-article
Conference

Acceptance Rates
MuSe' 22 Paper Acceptance Rate14of17submissions,82%Overall Acceptance Rate14of17submissions,82%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 780
  Total Downloads
- Downloads (Last 12 months)530
- Downloads (Last 6 weeks)95
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions

Dense Attention Memory Network for Multi-modal emotion recognition

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions

Dense Attention Memory Network for Multi-modal emotion recognition

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media