skip to main content
10.1145/3394171.3413998acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

Published: 12 October 2020 Publication History

Abstract

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

Supplementary Material

MP4 File (3394171.3413998.mp4)
In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.

References

[1]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. CVPR (2018).
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[3]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. NAACL (2016).
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.
[5]
Gabriel Bender, Pieterjan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V Le. [n.d.]. Understanding and Simplifying One-Shot Architecture Search. ICML ( [n.,d.]).
[6]
Remi Cadene, Hedi Benyounes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal Relational Reasoning for Visual Question Answering. CVPR (2019).
[7]
Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR (2019).
[8]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[9]
Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2019), 4171--4186.
[10]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).
[11]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2019 a. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering. CVPR (2019), 6639--6648.
[12]
Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. 2019 b. Multi-modality Latent Interaction Network for Visual Question Answering. (2019), 5825--5835.
[13]
Yash Goyal, Tejas Khot, Douglas Summersstay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.
[14]
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation (1997).
[15]
Drew A Hudson and Christopher D Manning. 2018. Compositional Attention Networks for Machine Reasoning. ICLR (2018).
[16]
Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR (2019).
[17]
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv preprint arXiv: 1807.09956 (2018).
[18]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Feifei, C Lawrence Zitnick, and Ross B Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR (2017).
[19]
Kushal Kafle and Christopher Kanan. 2017. An Analysis of Visual Question Answering Algorithms. ICCV (2017).
[20]
Kushal Kafle, Brian L Price, Scott D Cohen, and Christopher Kanan. 2018. DVQA: Understanding Data Visualizations via Question Answering. CVPR (2018).
[21]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. (2014), 787--798.
[22]
Jinhwa Kim, Jaehyun Jun, and Byoungtak Zhang. 2018. Bilinear Attention Networks. NIPS (2018).
[23]
Jinhwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jungwoo Ha, and Byoungtak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. ICLR (2017).
[24]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[25]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[26]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).
[27]
Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).
[28]
Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 b. Relation-aware Graph Attention Network for Visual Question Answering. arXiv preprint arXiv: 1903.12314 (2019).
[29]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 c. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[30]
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. 2020. HRank: Filter Pruning using High-Rank Feature Map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1529--1538.
[31]
Daqing Liu, Hanwang Zhang, Zhengjun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. ICCV (2019), 4673--4682.
[32]
Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2018. Hierarchical Representations for Efficient Architecture Search. ICLR (2018).
[33]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019 a. DARTS: Differentiable Architecture Search. ICLR (2019).
[34]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.
[35]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019 b. 12-in-1: Multi-Task Vision and Language Representation Learning. arXiv preprint arXiv:1912.02315.
[36]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS.
[37]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Ji Rongrong. 2020. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In CVPR.
[38]
Hyeonseob Nam, Jung Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. (2017).
[39]
Duykien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. CVPR (2018).
[40]
Will Norcliffe-Brown, Efstathios Vafeias, and Sarah Parisot. 2018. Learning Conditioned Graph Structures for Interpretable Visual Question Answering. arXiv preprint arXiv:1806.07243 (2018).
[41]
Badri Patro and Vinay P Namboodiri. 2018. Differential Attention for Visual Question Answering. In CVPR.
[42]
Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2018. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering. arXiv preprint arXiv: 1812.05252 (2018).
[43]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
[44]
Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6966--6975.
[45]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized Evolution for Image Classifier Architecture Search. AAAI (2019).
[46]
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NeurIPS.
[47]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [n.d.]. Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI ( [n.,d.]).
[48]
Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-Consistency for Robust Visual Question Answering. arXiv preprint arXiv: 1902.05660 (2019).
[49]
Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer Them All! Toward Universal Visual Question Answering Models. CVPR (2019).
[50]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
[51]
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
[52]
Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. 2018. Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge. CVPR (2018).
[53]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. COMMUN ACM (2016).
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. NIPS (2017).
[55]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. CVPR (2015).
[56]
Lingxi Xie and Alan L Yuille. 2017. Genetic CNN. ICCV (2017).
[57]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.
[58]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018a. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR (2018), 1307--1315.
[59]
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017a. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. CVPR (2017), 3521--3529.
[60]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. CVPR (2019).
[61]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comp. Vis.
[62]
Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018b. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. TNN (2018).
[63]
Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018c. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. international joint conference on artificial intelligence (2018), 1114--1120.
[64]
Yan Zhang, Jonathon S Hare, and Adam Prugelbennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. ICLR (2018).
[65]
Xiawu Zheng, Rongrong Ji, Lang Tang, Baochang Zhang, Jianzhuang Liu, and Qi Tian. 2019. Multinomial Distribution Learning for Effective Neural Architecture Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[66]
Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, and Qi Tian. 2020. Rethinking Performance Estimation in Neural Architecture Search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[67]
Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. 2019. Dynamic Capsule Attention for Visual Question Answering. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 33, 1 (2019), 9324--9331.
[68]
Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Xiangming Li. 2019. Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning. AAAI (2019).
[69]
Barret Zoph and Quoc V Le. 2017. Neural Architecture Search with Reinforcement Learning. ICLR (2017).
[70]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning Transferable Architectures for Scalable Image Recognition. (2018).

Cited By

View all
  • (2024)Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question AnsweringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427036:11(6628-6640)Online publication date: Nov-2024
  • (2024)Adapting Pre-trained Language Models to Vision-Language Tasksvia Dynamic Visual Prompting2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651317(1-8)Online publication date: 30-Jun-2024
  • (2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
  • Show More Cited By

Index Terms

  1. K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. network architecture search
    2. visual question answering

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program
    • National Natural Science Foundation of China
    • Key R&D Program of Jiangxi Province

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question AnsweringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427036:11(6628-6640)Online publication date: Nov-2024
    • (2024)Adapting Pre-trained Language Models to Vision-Language Tasksvia Dynamic Visual Prompting2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651317(1-8)Online publication date: 30-Jun-2024
    • (2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
    • (2024)Modular dual-stream visual fusion network for visual question answeringThe Visual Computer10.1007/s00371-024-03346-x41:1(549-562)Online publication date: 28-May-2024
    • (2023)Distilled Routing Transformer for Driving Behavior PredictionSAE International Journal of Transportation Safety10.4271/09-12-01-000312:1Online publication date: 10-Oct-2023
    • (2023)Scene Graph Refinement Network for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.316906525(3950-3961)Online publication date: 1-Jan-2023
    • (2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
    • (2023)Multi-Scale Fine-Grained Alignments for Image and Sentence MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.312874425(543-556)Online publication date: 2023
    • (2023)Dual-feature collaborative relation-attention networks for visual question answeringInternational Journal of Multimedia Information Retrieval10.1007/s13735-023-00283-812:2Online publication date: 4-Aug-2023
    • (2022)Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2022.319327360(1-14)Online publication date: 2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media