research-article

C3VQG: category consistent cyclic visual question generation

Authors:

Sarthak Bhagat,

Rajiv Ratn ShahAuthors Info & Claims

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Article No.: 49, Pages 1 - 7

https://doi.org/10.1145/3444685.3446302

Published: 03 May 2021 Publication History

Abstract

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.

Supplementary Material

PDF File (a49-uppal-suppl.pdf)

Supplemental files.

Download
794.22 KB

References

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2015. VQA: Visual Question Answering. International Journal of Computer Vision 123 (2015), 4--31.

Digital Library

[2]

Abdul Fatir Ansari and Harold Soh. 2018. Hyperprior Induced Unsupervised Disentanglement of Latent Representations. In AAAI.

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).

[4]

Sarthak Bhagat, Shagun Uppal, Vivian T. Yin, and N. Lim. 2020. Disentangling Multiple Features in Video Sequences Using Gaussian Processes in Variational Autoencoders. In ECCV.

[5]

Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. 2019. Deep Learning for Video Captioning: A Review. In IJCAI.

[6]

Pallabi Ghosh and Larry S. Davis. 2018. Understanding Center Loss Based Network for Image Retrieval with Few Training Data. In ECCV Workshops.

[7]

Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet-Center Loss for Multi-view 3D Object Retrieval. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 1945--1954.

[8]

MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for Image Captioning. ACM Comput. Surv. 51, 6, Article 118 (Feb. 2019), 36 pages.

Digital Library

[9]

Unnat Jain, Ziyu Zhang, and Alexander G. Schwing. 2017. Creativity: Generating Diverse Questions Using Variational Autoencoders. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5415--5424.

[10]

Hadi Kazemi, Sobhan Soleymani, Ali Dabouei, Seyed Mehdi Iranmanesh, and Nasser M. Nasrabadi. 2018. Attribute-Centered Loss for Soft-Biometrics Guided Face Sketch-Photo Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018), 612--6128.

[11]

Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. 2019. Bayes-Factor-VAE: Hierarchical Bayesian Deep Auto-Encoder Models for Factor Disentanglement. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 2979--2987.

[12]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[13]

Jack Klys, Jake Snell, and Richard S. Zemel. 2018. Learning Latent Subspaces in Variational Autoencoders. ArXiv abs/1812.06190 (2018).

[14]

Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2019. Information Maximizing Visual Question Generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 2008--2018.

[15]

Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. 2019. A Logic-Driven Framework for Consistency of Neural Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

[16]

Yikang Li, Nan Duan, Bolei Zhou, X. R. Chu, Wanli Ouyang, and Xiaogang Wang. 2017. Visual Question Generation as Dual Task of Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 6116--6124.

[17]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74--81. https://www.aclweb.org/anthology/W04-1013

[18]

Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou Yang, and Changyin Sun. 2018. iVQA: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8611--8619.

[19]

Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. arXiv:1410.0210 [cs.AI]

[20]

Nasrin Mostafazadeh, Chris Brockett, William B. Dolan, Michel Galley, Jianfeng Gao, Georgios P. Spithourakis, and Lucy Vanderwende. 2017. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. In IJCNLP.

[21]

Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating Natural Questions About an Image. ArXiv abs/1603.06059 (2016).

[22]

Osaid Rehman Nasir, S. K. Jha, M. S. Grover, Y. Yu, Ajit Kumar, and R. Shah. 2019. Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions. 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (2019), 58--67.

[23]

Sharan Pai, Nikhil Sachdeva, R. Shah, and R. Zimmermann. 2019. User Input Based Style Transfer While Retaining Facial Attributes. 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (2019), 68--76.

[24]

Rajiv Shah and Roger Zimmermann. 2017. Multimodal Analysis of User-Generated Multimedia Content (1st ed.). Springer Publishing Company, Incorporated.

[25]

Ankita Shukla, Sarthak Bhagat, Shagun Uppal, Saket Anand, and Pavan K. Turaga. 2019. Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning. In BMVC.

[26]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, R. Zimmermann, and Amir Zadeh. 2020. Emerging Trends of Multimodal Research in Vision and Language. ArXiv abs/2010.09522 (2020).

[27]

Shagun Uppal, Vivek Gupta, Avinash Swaminathan, Haimin Zhang, Debanjan Mahata, Rakesh Gosangi, Rajiv Ratn Shah, and Amanda Stent. 2020. Two-Step Classification using Recasted Data for Low Resource Settings. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 706--719. https://www.aclweb.org/anthology/2020.aacl-main.71

[28]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), 4566--4575.

[29]

Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. A Joint Model for Question Answering and Question Generation. ArXiv abs/1706.01450 (2017).

[30]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In ECCV.

[31]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2018. A Comprehensive Study on Center Loss for Deep Face Recognition. International Journal of Computer Vision 127 (2018), 668--683.

Digital Library

[32]

Xing Xu, Jingkuan Song, Huimin Lu, Li He, Yang Yang, and Fumin Shen. 2018. Dual learning for visual question generation. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[33]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. In CoRL.

[34]

Yezhou Yang, Yi Li, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Neural Self Talk: Image Understanding via Continuous Questioning and Answering. ArXiv abs/1512.03460 (2015).

[35]

Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang, and Jiawan Zhang. 2016. Automatic Generation of Grounded Visual Questions. ArXiv abs/1612.06530 (2016).

[36]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2015. Visual7W: Grounded Question Answering in Images. arXiv:1511.03416 [cs.CV]

Cited By

Uehara KHarada T(2024)Learning by Asking Questions for Knowledge-Based Novel Object RecognitionInternational Journal of Computer Vision10.1007/s11263-023-01976-7132:6(2290-2309)Online publication date: 12-Jan-2024
https://doi.org/10.1007/s11263-023-01976-7
Zhu HTogo ROgawa THaseyama M(2023)Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question GenerationSensors10.3390/s2303105723:3(1057)Online publication date: 17-Jan-2023
https://doi.org/10.3390/s23031057
Uehara KHarada T(2023)K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00438(4390-4398)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00438
Show More Cited By

Index Terms

C3VQG: category consistent cyclic visual question generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
  2. Machine learning

Recommendations

Diverse Visual Question Generation Based on Multiple Objects Selection
Visual question generation task aims at generating high-quality questions about a given image. To make this tak applicable to various scenarios, e.g., the growing demand for exams, it is important to generate diverse questions. The existing methods for ...
Inferential Visual Question Generation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The task of Visual Question Generation (VQG) aims to generate natural language questions for images. Many methods regard it as a reverse Visual Question Answering (VQA) task. They trained a data-driven generator on VQA datasets, which is hard to obtain ...
Multiple Objects-Aware Visual Question Generation
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Visual question generation task aims to generate meaningful questions about an image according to a target answer. Existing studies mainly focus on merely one object related to the target answer in an image to generate a question. However, a target ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

March 2021

512 pages

ISBN:9781450383080

DOI:10.1145/3444685

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Jingdong Wang
Microsoft Research
,
Qi Tian
Huawei Noah's Ark
,
Program Chairs:
Cathal Gurrin
Dublin City University
,
Jia Jia
Tsinghua University
,
Hanwang Zhang
Nanyang Technological University
,
Qianru Sun
Singapore Management University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MMAsia '20

Sponsor:

SIGMM

MMAsia '20: ACM Multimedia Asia

March 7, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
104
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Uehara KHarada T(2024)Learning by Asking Questions for Knowledge-Based Novel Object RecognitionInternational Journal of Computer Vision10.1007/s11263-023-01976-7132:6(2290-2309)Online publication date: 12-Jan-2024
https://doi.org/10.1007/s11263-023-01976-7
Zhu HTogo ROgawa THaseyama M(2023)Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question GenerationSensors10.3390/s2303105723:3(1057)Online publication date: 17-Jan-2023
https://doi.org/10.3390/s23031057
Uehara KHarada T(2023)K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00438(4390-4398)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00438
Bashmal LBazi YMelgani FRicci RAl Rahhal MZuair M(2023)Visual Question Generation From Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.326136116(3279-3293)Online publication date: 2023
https://doi.org/10.1109/JSTARS.2023.3261361
Zhu HTogo ROgawa THaseyama M(2023)A Medical Domain Visual Question Generation Model via Large Language Model2023 International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan)10.1109/ICCE-Taiwan58799.2023.10227045(163-164)Online publication date: 17-Jul-2023
https://doi.org/10.1109/ICCE-Taiwan58799.2023.10227045
Mulla NGharpure P(2023)Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applicationsProgress in Artificial Intelligence10.1007/s13748-023-00295-912:1(1-32)Online publication date: 30-Jan-2023
https://dl.acm.org/doi/10.1007/s13748-023-00295-9
Song ZHu ZHong R(2022)Efficient and self-adaptive rationale knowledge base for visual commonsense reasoningMultimedia Systems10.1007/s00530-021-00867-629:5(3017-3026)Online publication date: 3-Jan-2022
https://dl.acm.org/doi/10.1007/s00530-021-00867-6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten