skip to main content
10.1145/3551876.3554813acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Multi-modal emotion recognition aims to recognize emotion states from multi-modal inputs. Existing end-to-end models typically fuse the uni-modal representations in the last layers without leveraging the multi-modal interactions among the intermediate representations. In this paper, we propose the multi-modal Recurrent Intermediate-Layer Aggregation (RILA) model to explore the effectiveness of leveraging the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition. At the heart of our model is the Intermediate-Representation Fusion Module (IRFM), which consists of the multi-modal aggregation gating module and multi-modal token attention module. Specifically, at each layer, we first use the multi-modal aggregation gating module to capture the utterance-level interactions across the modalities and layers. Then we utilize the multi-modal token attention module to leverage the token-level multi-modal interactions. The experimental results on IEMOCAP and CMU-MOSEI show that our model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations.

References

  1. Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016).Google ScholarGoogle Scholar
  2. Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Meßner, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. MuSe 2022 Challenge: Multimodal Humour, Emotional Reactions, and Stress. In Proceedings of the 30th ACM International Conference on Multimedia (MM'22), October 10--14, 2022, Lisbon, Portugal. Association for Computing Machinery, Lisbon, Portugal. 3 pages, to appear.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google ScholarGoogle Scholar
  4. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18--1208Google ScholarGoogle ScholarCross RefCross Ref
  5. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation , Vol. 42, 4 (2008), 335--359.Google ScholarGoogle Scholar
  7. Shi Chen and Qi Zhao. 2018. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 12 (2018), 3048--3056.Google ScholarGoogle Scholar
  8. Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. In Proceedings of the 3rd Multimodal Sentiment Analysis Challenge. Association for Computing Machinery, Lisbon, Portugal. Workshop held at ACM Multimedia 2022, to appear.Google ScholarGoogle Scholar
  9. Wenliang Dai, Samuel Cahyawijaya, Zihan Liu, and Pascale Fung. 2021. Multimodal End-to-End Sparse Model for Emotion Recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5305--5316. https://www.aclweb.org/anthology/2021.naacl-main.417Google ScholarGoogle ScholarCross RefCross Ref
  10. Wenliang Dai, Zihan Liu, Tiezheng Yu, and Pascale Fung. 2020. Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 269--280. https://aclanthology.org/2020.aacl-main.30Google ScholarGoogle Scholar
  11. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  12. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 154--164.Google ScholarGoogle ScholarCross RefCross Ref
  14. Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. 2021b. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753 (2021).Google ScholarGoogle Scholar
  15. Yuan Gong, Yu-An Chung, and James Glass. 2021a. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571--575. https://doi.org/10.21437/Interspeech.2021--698Google ScholarGoogle Scholar
  16. Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 1122--1131. https://doi.org/10.1145/3394171.3413678Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).Google ScholarGoogle Scholar
  18. Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941--3945.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ganesh Jawahar, Beno^it Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. In ACL 2019--57th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  20. Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.Google ScholarGoogle ScholarCross RefCross Ref
  21. Xincheng Ju, Dong Zhang, Junhui Li, and Guodong Zhou. 2020. Transformer-based label set generation for multi-modal multi-label emotion detection. In Proceedings of the 28th ACM International Conference on Multimedia. 512--520.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. 4171--4186.Google ScholarGoogle Scholar
  23. Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2528--2540.Google ScholarGoogle ScholarCross RefCross Ref
  24. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7057--7075. https://doi.org/10.18653/v1/2020.emnlp-main.574Google ScholarGoogle ScholarCross RefCross Ref
  25. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvSGoogle ScholarGoogle Scholar
  26. Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6419--6423.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarGoogle ScholarCross RefCross Ref
  28. Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.Google ScholarGoogle ScholarCross RefCross Ref
  29. Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. 2021. Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In IEEE Automatic Speech Recognition and Understanding Workshop-ASRU 2021.Google ScholarGoogle ScholarCross RefCross Ref
  31. Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. 2022. Revisiting Over-smoothing in BERT from the Perspective of Graph. In International Conference on Learning Representations. https://openreview.net/forum?id=dUV91uaXm3Google ScholarGoogle Scholar
  32. Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1547--1556.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.Google ScholarGoogle Scholar
  34. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656Google ScholarGoogle ScholarCross RefCross Ref
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  36. Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417Google ScholarGoogle ScholarCross RefCross Ref
  37. Zeguan Xiao, Jiarun Wu, Qingliang Chen, and Congjian Deng. 2021. BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9193--9200.Google ScholarGoogle ScholarCross RefCross Ref
  38. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge
          October 2022
          118 pages
          ISBN:9781450394840
          DOI:10.1145/3551876

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 October 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MuSe' 22 Paper Acceptance Rate14of17submissions,82%Overall Acceptance Rate14of17submissions,82%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)530
          • Downloads (Last 6 weeks)95

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader