research-article

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

Authors:
Haiwen Hong

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Xuan Jin

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Yin Zhang

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Yunqing Hu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Jingfeng Zhang

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Yuan He

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Hui Xue

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 5591–5599https://doi.org/10.1145/3474085.3475702

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5591–5599

ABSTRACT

In multimodal tasks, the importance of text and image modal information often varies for different input cases. To model the difference of importance of different modal information, we propose a high-performance and highly general Dual-Router Dynamic Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion unit. The text router and image router in Dual-Router take text modal information and image modal information respectively, and MWF-Layer is responsible to determine the importance of modal information. Based on the result of the determination, MWF-Layer generates fused weights for the subsequent experts fusion. Experts can adopt a variety of backbones that match the current multimodal or unimodal task. DRDF features high generality and modularity, and we test 12 backbones such as Visual BERT and their corresponding DRDF instances on the multimodal dataset Hateful memes, and unimodal datasets CIFAR10, CIFAR100, and TinyImagenet. Our DRDF instance outperforms those backbones. We also validate the effectiveness of components of DRDF by ablation studies, and discuss the reasons and ideas of DRDF design.

References

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5659--5667.Google ScholarCross Ref
Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. 2020. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11030--11039.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).Google Scholar
Hang Gao, Xizhou Zhu, Stephen Lin, and Jifeng Dai. 2019. Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation. In International Conference on Learning Representations.Google Scholar
Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. 2021. Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906 (2021).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision. 4193--4202.Google ScholarCross Ref
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed- ings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google Scholar
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019).Google Scholar
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Chal- lenge: Detecting Hate Speech in Multimodal Memes. Advances in Neural Information Processing Systems 33 (2020).Google Scholar
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).Google Scholar
Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7 (2015).Google Scholar
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder- vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.Google ScholarCross Ref
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).Google Scholar
Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. 2019. Selective kernel networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 510--519.Google ScholarCross Ref
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13--23. Google ScholarDigital Library
John Edison Arevalo Ovalle, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. 2017. Gated Multimodal Units for Information Fusion.. In ICLR (Workshop).Google Scholar
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556--2565.Google ScholarCross Ref
Saurabh Sharma, Ning Yu, Mario Fritz, and Bernt Schiele. 2021. Long-tailed recognition using class-balanced experts. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28-October 1, 2020, Proceedings 42. Springer, 86--100.Google ScholarCross Ref
K Simonyan and A Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. (2015).Google Scholar
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations.Google Scholar
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5100--5111.Google ScholarCross Ref
Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, and Joseph E Gonzalez. 2020. NBDT: Neural-Backed Decision Trees. arXiv preprint arXiv:2004.00221 (2020).Google Scholar
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.Google ScholarDigital Library
Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. 2019. Condconv: Conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems. 1307--1318. Google ScholarDigital Library
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In British Machine Vision Conference 2016. British Machine Vision Association.Google Scholar
Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, and Alexander Wong. 2020. Squeeze-and- Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13065--13074.Google ScholarCross Ref

Index Terms

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Rethinking Fusion Baselines for Multi-modal Human Action Recognition
Advances in Multimedia Information Processing – PCM 2018
Abstract
In this paper we study fusion baselines for multi-modal action recognition. Our work explores different strategies for multiple stream fusion. First, we consider the early fusion which fuses the different modal inputs by directly stacking them ...
Read More
Understanding Conversational and Expressive Style in a Multimodal Embodied Conversational Agent
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Embodied conversational agents have changed the ways we can interact with machines. However, these systems often do not meet users’ expectations. A limitation is that the agents are monotonic in behavior and do not adapt to an interlocutor. We present ...
Read More
Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

One of the key challenges in designing Embodied Conversational Agents (ECA) is to produce human-like gestural and visual prosody expressivity. Another major challenge is to maintain the interlocutor's attention by adapting the agent's behavior to the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mixture of experts
multi-modality
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 119
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Rethinking Fusion Baselines for Multi-modal Human Action Recognition

Understanding Conversational and Expressive Style in a Multimodal Embodied Conversational Agent

Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents