research-article

K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

Authors:

Ling ShaoAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1245 - 1254

https://doi.org/10.1145/3394171.3413998

Published: 12 October 2020 Publication History

Abstract

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

Supplementary Material

MP4 File (3394171.3413998.mp4)

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.

Download
43.86 MB

References

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. CVPR (2018).

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[3]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. NAACL (2016).

[4]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.

Digital Library

[5]

Gabriel Bender, Pieterjan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V Le. [n.d.]. Understanding and Simplifying One-Shot Architecture Search. ICML ( [n.,d.]).

[6]

Remi Cadene, Hedi Benyounes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal Relational Reasoning for Visual Question Answering. CVPR (2019).

[7]

Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR (2019).

[8]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[9]

Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2019), 4171--4186.

[10]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).

[11]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2019 a. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering. CVPR (2019), 6639--6648.

[12]

Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. 2019 b. Multi-modality Latent Interaction Network for Visual Question Answering. (2019), 5825--5835.

[13]

Yash Goyal, Tejas Khot, Douglas Summersstay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.

[14]

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation (1997).

[15]

Drew A Hudson and Christopher D Manning. 2018. Compositional Attention Networks for Machine Reasoning. ICLR (2018).

[16]

Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR (2019).

[17]

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv preprint arXiv: 1807.09956 (2018).

[18]

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Feifei, C Lawrence Zitnick, and Ross B Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR (2017).

[19]

Kushal Kafle and Christopher Kanan. 2017. An Analysis of Visual Question Answering Algorithms. ICCV (2017).

[20]

Kushal Kafle, Brian L Price, Scott D Cohen, and Christopher Kanan. 2018. DVQA: Understanding Data Visualizations via Question Answering. CVPR (2018).

[21]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. (2014), 787--798.

[22]

Jinhwa Kim, Jaehyun Jun, and Byoungtak Zhang. 2018. Bilinear Attention Networks. NIPS (2018).

[23]

Jinhwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jungwoo Ha, and Byoungtak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. ICLR (2017).

[24]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[25]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[26]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).

[27]

Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).

[28]

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 b. Relation-aware Graph Attention Network for Visual Question Answering. arXiv preprint arXiv: 1903.12314 (2019).

[29]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 c. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[30]

Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. 2020. HRank: Filter Pruning using High-Rank Feature Map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1529--1538.

[31]

Daqing Liu, Hanwang Zhang, Zhengjun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. ICCV (2019), 4673--4682.

[32]

Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2018. Hierarchical Representations for Efficient Architecture Search. ICLR (2018).

[33]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019 a. DARTS: Differentiable Architecture Search. ICLR (2019).

[34]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.

[35]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019 b. 12-in-1: Multi-Task Vision and Language Representation Learning. arXiv preprint arXiv:1912.02315.

[36]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS.

[37]

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Ji Rongrong. 2020. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In CVPR.

[38]

Hyeonseob Nam, Jung Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. (2017).

[39]

Duykien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. CVPR (2018).

[40]

Will Norcliffe-Brown, Efstathios Vafeias, and Sarah Parisot. 2018. Learning Conditioned Graph Structures for Interpretable Visual Question Answering. arXiv preprint arXiv:1806.07243 (2018).

[41]

Badri Patro and Vinay P Namboodiri. 2018. Differential Attention for Visual Question Answering. In CVPR.

[42]

Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2018. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering. arXiv preprint arXiv: 1812.05252 (2018).

[43]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.

[44]

Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6966--6975.

[45]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized Evolution for Image Classifier Architecture Search. AAAI (2019).

[46]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NeurIPS.

[47]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [n.d.]. Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI ( [n.,d.]).

[48]

Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-Consistency for Robust Visual Question Answering. arXiv preprint arXiv: 1902.05660 (2019).

[49]

Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer Them All! Toward Universal Visual Question Answering Models. CVPR (2019).

[50]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).

[51]

Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).

[52]

Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. 2018. Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge. CVPR (2018).

[53]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. COMMUN ACM (2016).

[54]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. NIPS (2017).

[55]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. CVPR (2015).

[56]

Lingxi Xie and Alan L Yuille. 2017. Genetic CNN. ICCV (2017).

[57]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.

[58]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018a. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR (2018), 1307--1315.

[59]

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017a. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. CVPR (2017), 3521--3529.

[60]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. CVPR (2019).

[61]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comp. Vis.

[62]

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018b. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. TNN (2018).

[63]

Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018c. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. international joint conference on artificial intelligence (2018), 1114--1120.

[64]

Yan Zhang, Jonathon S Hare, and Adam Prugelbennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. ICLR (2018).

[65]

Xiawu Zheng, Rongrong Ji, Lang Tang, Baochang Zhang, Jianzhuang Liu, and Qi Tian. 2019. Multinomial Distribution Learning for Effective Neural Architecture Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[66]

Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, and Qi Tian. 2020. Rethinking Performance Estimation in Neural Architecture Search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]

Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. 2019. Dynamic Capsule Attention for Visual Question Answering. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 33, 1 (2019), 9324--9331.

[68]

Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Xiangming Li. 2019. Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning. AAAI (2019).

[69]

Barret Zoph and Quoc V Le. 2017. Neural Architecture Search with Reinforcement Learning. ICLR (2017).

[70]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning Transferable Architectures for Scalable Image Recognition. (2018).

Cited By

Xu NGao YLiu ATian HZhang Y(2024)Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question AnsweringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427036:11(6628-6640)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3384270
Huang SWu QZhou Y(2024)Adapting Pre-trained Language Models to Vision-Language Tasksvia Dynamic Visual Prompting2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651317(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651317
Al-Sabri RGao JChen JOloulade BWu Z(2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
https://doi.org/10.1016/j.neunet.2024.106427
Show More Cited By

Index Terms

K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Multi-armed bandit problem with known trend

We consider a variant of the multi-armed bandit model, which we call multi-armed bandit problem with known trend, where the gambler knows the shape of the reward function of each arm but not its distribution. This new problem is motivated by different ...
Multi-armed Bandit with Additional Observations
SIGMETRICS '18

We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...
Multi-armed Bandit with Additional Observations
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems

We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program
National Natural Science Foundation of China
Key R&D Program of Jiangxi Province

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
429
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu NGao YLiu ATian HZhang Y(2024)Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question AnsweringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427036:11(6628-6640)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3384270
Huang SWu QZhou Y(2024)Adapting Pre-trained Language Models to Vision-Language Tasksvia Dynamic Visual Prompting2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651317(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651317
Al-Sabri RGao JChen JOloulade BWu Z(2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
https://doi.org/10.1016/j.neunet.2024.106427
Xue LWang WWang RYang J(2024)Modular dual-stream visual fusion network for visual question answeringThe Visual Computer10.1007/s00371-024-03346-x41:1(549-562)Online publication date: 28-May-2024
https://doi.org/10.1007/s00371-024-03346-x
Gao JYi JMurphey Y(2023)Distilled Routing Transformer for Driving Behavior PredictionSAE International Journal of Transportation Safety10.4271/09-12-01-000312:1Online publication date: 10-Oct-2023
https://doi.org/10.4271/09-12-01-0003
Qian TChen JChen SWu BJiang Y(2023)Scene Graph Refinement Network for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.316906525(3950-3961)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3169065
Ma YJi JSun XZhou YWu YHuang FJi R(2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3164787
Li WWang YSu YLi XLiu AZhang Y(2023)Multi-Scale Fine-Grained Alignments for Image and Sentence MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.312874425(543-556)Online publication date: 2023
https://doi.org/10.1109/TMM.2021.3128744
Yao LYang YHu J(2023)Dual-feature collaborative relation-attention networks for visual question answeringInternational Journal of Multimedia Information Retrieval10.1007/s13735-023-00283-812:2Online publication date: 4-Aug-2023
https://doi.org/10.1007/s13735-023-00283-8
Li XLei LZhang CKuang G(2022)Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2022.319327360(1-14)Online publication date: 2022
https://doi.org/10.1109/TGRS.2022.3193273
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten