skip to main content
10.1145/3444685.3446302acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

C3VQG: category consistent cyclic visual question generation

Published: 03 May 2021 Publication History

Abstract

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.

Supplementary Material

PDF File (a49-uppal-suppl.pdf)
Supplemental files.

References

[1]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2015. VQA: Visual Question Answering. International Journal of Computer Vision 123 (2015), 4--31.
[2]
Abdul Fatir Ansari and Harold Soh. 2018. Hyperprior Induced Unsupervised Disentanglement of Latent Representations. In AAAI.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
[4]
Sarthak Bhagat, Shagun Uppal, Vivian T. Yin, and N. Lim. 2020. Disentangling Multiple Features in Video Sequences Using Gaussian Processes in Variational Autoencoders. In ECCV.
[5]
Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. 2019. Deep Learning for Video Captioning: A Review. In IJCAI.
[6]
Pallabi Ghosh and Larry S. Davis. 2018. Understanding Center Loss Based Network for Image Retrieval with Few Training Data. In ECCV Workshops.
[7]
Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet-Center Loss for Multi-view 3D Object Retrieval. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 1945--1954.
[8]
MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for Image Captioning. ACM Comput. Surv. 51, 6, Article 118 (Feb. 2019), 36 pages.
[9]
Unnat Jain, Ziyu Zhang, and Alexander G. Schwing. 2017. Creativity: Generating Diverse Questions Using Variational Autoencoders. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5415--5424.
[10]
Hadi Kazemi, Sobhan Soleymani, Ali Dabouei, Seyed Mehdi Iranmanesh, and Nasser M. Nasrabadi. 2018. Attribute-Centered Loss for Soft-Biometrics Guided Face Sketch-Photo Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018), 612--6128.
[11]
Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. 2019. Bayes-Factor-VAE: Hierarchical Bayesian Deep Auto-Encoder Models for Factor Disentanglement. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 2979--2987.
[12]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[13]
Jack Klys, Jake Snell, and Richard S. Zemel. 2018. Learning Latent Subspaces in Variational Autoencoders. ArXiv abs/1812.06190 (2018).
[14]
Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2019. Information Maximizing Visual Question Generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 2008--2018.
[15]
Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. 2019. A Logic-Driven Framework for Consistency of Neural Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
[16]
Yikang Li, Nan Duan, Bolei Zhou, X. R. Chu, Wanli Ouyang, and Xiaogang Wang. 2017. Visual Question Generation as Dual Task of Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 6116--6124.
[17]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74--81. https://www.aclweb.org/anthology/W04-1013
[18]
Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou Yang, and Changyin Sun. 2018. iVQA: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8611--8619.
[19]
Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. arXiv:1410.0210 [cs.AI]
[20]
Nasrin Mostafazadeh, Chris Brockett, William B. Dolan, Michel Galley, Jianfeng Gao, Georgios P. Spithourakis, and Lucy Vanderwende. 2017. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. In IJCNLP.
[21]
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating Natural Questions About an Image. ArXiv abs/1603.06059 (2016).
[22]
Osaid Rehman Nasir, S. K. Jha, M. S. Grover, Y. Yu, Ajit Kumar, and R. Shah. 2019. Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions. 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (2019), 58--67.
[23]
Sharan Pai, Nikhil Sachdeva, R. Shah, and R. Zimmermann. 2019. User Input Based Style Transfer While Retaining Facial Attributes. 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (2019), 68--76.
[24]
Rajiv Shah and Roger Zimmermann. 2017. Multimodal Analysis of User-Generated Multimedia Content (1st ed.). Springer Publishing Company, Incorporated.
[25]
Ankita Shukla, Sarthak Bhagat, Shagun Uppal, Saket Anand, and Pavan K. Turaga. 2019. Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning. In BMVC.
[26]
Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, R. Zimmermann, and Amir Zadeh. 2020. Emerging Trends of Multimodal Research in Vision and Language. ArXiv abs/2010.09522 (2020).
[27]
Shagun Uppal, Vivek Gupta, Avinash Swaminathan, Haimin Zhang, Debanjan Mahata, Rakesh Gosangi, Rajiv Ratn Shah, and Amanda Stent. 2020. Two-Step Classification using Recasted Data for Low Resource Settings. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 706--719. https://www.aclweb.org/anthology/2020.aacl-main.71
[28]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), 4566--4575.
[29]
Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. A Joint Model for Question Answering and Question Generation. ArXiv abs/1706.01450 (2017).
[30]
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In ECCV.
[31]
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2018. A Comprehensive Study on Center Loss for Deep Face Recognition. International Journal of Computer Vision 127 (2018), 668--683.
[32]
Xing Xu, Jingkuan Song, Huimin Lu, Li He, Yang Yang, and Fumin Shen. 2018. Dual learning for visual question generation. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[33]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. In CoRL.
[34]
Yezhou Yang, Yi Li, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Neural Self Talk: Image Understanding via Continuous Questioning and Answering. ArXiv abs/1512.03460 (2015).
[35]
Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang, and Jiawan Zhang. 2016. Automatic Generation of Grounded Visual Questions. ArXiv abs/1612.06530 (2016).
[36]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2015. Visual7W: Grounded Question Answering in Images. arXiv:1511.03416 [cs.CV]

Cited By

View all
  • (2024)Learning by Asking Questions for Knowledge-Based Novel Object RecognitionInternational Journal of Computer Vision10.1007/s11263-023-01976-7132:6(2290-2309)Online publication date: 12-Jan-2024
  • (2023)Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question GenerationSensors10.3390/s2303105723:3(1057)Online publication date: 17-Jan-2023
  • (2023)K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00438(4390-4398)Online publication date: Jan-2023
  • Show More Cited By

Index Terms

  1. C3VQG: category consistent cyclic visual question generation

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia
        March 2021
        512 pages
        ISBN:9781450383080
        DOI:10.1145/3444685
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 03 May 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. cycle consistency
        2. multimodal
        3. visual question generation

        Qualifiers

        • Research-article

        Conference

        MMAsia '20
        Sponsor:
        MMAsia '20: ACM Multimedia Asia
        March 7, 2021
        Virtual Event, Singapore

        Acceptance Rates

        Overall Acceptance Rate 59 of 204 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)24
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 17 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Learning by Asking Questions for Knowledge-Based Novel Object RecognitionInternational Journal of Computer Vision10.1007/s11263-023-01976-7132:6(2290-2309)Online publication date: 12-Jan-2024
        • (2023)Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question GenerationSensors10.3390/s2303105723:3(1057)Online publication date: 17-Jan-2023
        • (2023)K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00438(4390-4398)Online publication date: Jan-2023
        • (2023)Visual Question Generation From Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.326136116(3279-3293)Online publication date: 2023
        • (2023)A Medical Domain Visual Question Generation Model via Large Language Model2023 International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan)10.1109/ICCE-Taiwan58799.2023.10227045(163-164)Online publication date: 17-Jul-2023
        • (2023)Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applicationsProgress in Artificial Intelligence10.1007/s13748-023-00295-912:1(1-32)Online publication date: 30-Jan-2023
        • (2022)Efficient and self-adaptive rationale knowledge base for visual commonsense reasoningMultimedia Systems10.1007/s00530-021-00867-629:5(3017-3026)Online publication date: 3-Jan-2022

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media