skip to main content
10.1145/3394171.3413761acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Medical Visual Question Answering via Conditional Reasoning

Published: 12 October 2020 Publication History

Abstract

Medical visual question answering (Med-VQA) aims to accurately answer a clinical question presented with a medical image. Despite its enormous potential in healthcare industry and services, the technology is still in its infancy and is far from practical use. Med-VQA tasks are highly challenging due to the massive diversity of clinical questions and the disparity of required visual reasoning skills for different types of questions. In this paper, we propose a novel conditional reasoning framework for Med-VQA, aiming to automatically learn effective reasoning skills for various Med-VQA tasks. Particularly, we develop a question-conditioned reasoning module to guide the importance selection over multimodal fusion features. Considering the different nature of closed-ended and open-ended Med-VQA tasks, we further propose a type-conditioned reasoning module to learn a different set of reasoning skills for the two types of tasks separately. Our conditional reasoning framework can be easily applied to existing Med-VQA systems to bring performance gains. In the experiments, we build our system on top of a recent state-of-the-art Med-VQA model and evaluate it on the VQA-RAD benchmark [23]. Remarkably, our system achieves significantly increased accuracy in predicting answers to both closed-ended and open-ended questions, especially for open-ended questions, where a 10.8% increase in absolute accuracy is obtained. The source code can be downloaded from https://github.com/awenbocc/med-vqa.

Supplementary Material

MP4 File (3394171.3413761.mp4)
A brief introduction to our work.

References

[1]
Asma Ben Abacha, Soumya Gayen, Jason J. Lau, Sivaramakrishnan Rajaraman, and Dina Demner-Fushman. 2018. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR-WS.org, Avignon, France.
[2]
Asma Ben Abacha, Sadid A. Hasan, Vivek V. Datla, Joey Liu, Dina Demner-Fushman, and Henning Mü ller. 2019. VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland.
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Salt Lake City, UT, USA, 6077--6086.
[4]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Las Vegas, NV, USA, 39--48.
[5]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, Santiago, Chile, 2425--2433.
[6]
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv e-prints (Nov. 2015), arXiv:1511.05960.
[7]
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103--111.
[8]
Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference, MM. ACM, Seoul, Republic of Korea, 54--62.
[9]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 70). PMLR, Sydney, NSW, Australia, 1126--1135.
[10]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP. The Association for Computational Linguistics, Austin, Texas, USA, 457--468.
[11]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Honolulu, HI, USA, 6325--6334.
[12]
Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[13]
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, Venice, Italy, 804--813.
[14]
Drew A. Hudson and Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. In 6th International Conference on Learning Representations, ICLR. OpenReview.net, Vancouver, BC, Canada.
[15]
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 6700--6709.
[16]
Bogdan Ionescu, Henning Mü ller, Mauricio Villegas, Alba Garcia Seco de Herrera, Carsten Eickhoff, Vincent Andrearczyk, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Sadid A. Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Matthew P. Lungren, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux, and Cathal Gurrin. 2018. Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, CLEF (Lecture Notes in Computer Science, Vol. 11018). Springer, Avignon, France, 309--334.
[17]
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv e-prints (July 2018), arXiv:1807.09956.
[18]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Honolulu, HI, USA, 1988--1997.
[19]
Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Comput. Vis. Image Underst., Vol. 163 (2017), 3--20.
[20]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. NeurIPS, Montré al, Canada, 1571--1581.
[21]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings. OpenReview.net, San Diego, CA, USA.
[22]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI Press, Austin, Texas, USA, 2267--2273.
[23]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, Vol. 5, 1 (2018), 1--10.
[24]
Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sá nchez. 2017. A survey on deep learning in medical image analysis. Medical Image Anal., Vol. 42 (2017), 60--88.
[25]
Fei Liu, Jing Liu, Richang Hong, and Hanqing Lu. 2019. Erasing-based Attention Learning for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia, MM. ACM, Nice, France, 1175--1183.
[26]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, NeurIPS. Barcelona, Spain, 289--297.
[27]
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2019. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In 7th International Conference on Learning Representations, ICLR. OpenReview.net, New Orleans, LA, USA.
[28]
David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. 2018. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, Salt Lake City, UT, USA, 4942--4950.
[29]
Jonathan Masci, Ueli Meier, Dan C. Ciresan, and Jü rgen Schmidhuber. 2011. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 6791). Springer, Espoo, Finland, 52--59.
[30]
Binh D. Nguyen, Thanh-Toan Do, Binh X. Nguyen, Tuong Do, Erman Tjiputra, and Quang D. Tran. 2019. Overcoming Data Limitation in Medical Visual Question Answering. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Part IV (Lecture Notes in Computer Science, Vol. 11767). Springer, Shenzhen, China, 522--530.
[31]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kö pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS. Vancouver, BC, Canada, 8024--8035.
[32]
Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed Relation Attention Network for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia, MM. ACM, Nice, France, 1202--1210.
[33]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, Doha, Qatar, 1532--1543.
[34]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press, New Orleans, Louisiana, USA, 3942--3951.
[35]
Maithra Raghu, Chiyuan Zhang, Jon M. Kleinberg, and Samy Bengio. 2019. Transfusion: Understanding Transfer Learning for Medical Imaging. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS. Vancouver, BC, Canada, 3342--3352.
[36]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, NeurIPS. Montreal, Quebec, Canada, 91--99.
[37]
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, 440--450.
[38]
Lei Shi, Feifan Liu, and Max P. Rosen. 2019. Deep Multimodal Learning for Medical Visual Question Answering. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland.
[39]
Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer Them All! Toward Universal Visual Question Answering Models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 10472--10481.
[40]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway Networks. arXiv e-prints (May 2015), arXiv:1505.00387.
[41]
Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Computer Vision - ECCV 2016 - 14th European Conference, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 9911). Springer, Amsterdam, The Netherlands, 451--466.
[42]
Xin Yan, Lin Li, Chulin Xie, Jun Xiao, and Lin Gu. 2019. Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2380). CEUR-WS.org, Lugano, Switzerland.
[43]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, NV, USA, 21--29.
[44]
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. 2020. CLEVRER: Collision Events for Video Representation and Reasoning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, Addis Ababa, Ethiopia.
[45]
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. Montré al, Canada, 1039--1050.
[46]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, Venice, Italy, 1839--1848.
[47]
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple Baseline for Visual Question Answering. arXiv e-prints (Dec. 2015), arXiv:1512.02167.
[48]
Yangyang Zhou, Xin Kang, and Fuji Ren. 2018. Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR-WS.org, Avignon, France.

Cited By

View all
  • (2025)Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future DirectionsIEEE Reviews in Biomedical Engineering10.1109/RBME.2024.349674418(172-191)Online publication date: 2025
  • (2025)Consistency Conditioned Memory Augmented Dynamic Diagnosis Model for Medical Visual Question AnsweringIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.349214129:2(1357-1370)Online publication date: Feb-2025
  • (2025)Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question AnsweringIEEE Access10.1109/ACCESS.2025.353230813(16455-16465)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. Medical Visual Question Answering via Conditional Reasoning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attention mechanism
    2. conditional reasoning
    3. medical visual question answering

    Qualifiers

    • Research-article

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)219
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future DirectionsIEEE Reviews in Biomedical Engineering10.1109/RBME.2024.349674418(172-191)Online publication date: 2025
    • (2025)Consistency Conditioned Memory Augmented Dynamic Diagnosis Model for Medical Visual Question AnsweringIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.349214129:2(1357-1370)Online publication date: Feb-2025
    • (2025)Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question AnsweringIEEE Access10.1109/ACCESS.2025.353230813(16455-16465)Online publication date: 2025
    • (2025)A Multimodal Biomedical Foundation Model Trained from Fifteen Million Image–Text PairsNEJM AI10.1056/AIoa24006402:1Online publication date: Jan-2025
    • (2025)VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question AnsweringNeurocomputing10.1016/j.neucom.2024.128730613(128730)Online publication date: Jan-2025
    • (2025)UnICLAM: Contrastive representation learning with adversarial masking for unified and interpretable Medical Vision Question AnsweringMedical Image Analysis10.1016/j.media.2025.103464101(103464)Online publication date: Apr-2025
    • (2025)Targeted Visual Prompting for Medical Visual Question AnsweringApplications of Medical Artificial Intelligence10.1007/978-3-031-82007-6_7(64-73)Online publication date: 8-Feb-2025
    • (2024)Detecting any instruction-to-answer interaction relationshipProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694281(53909-53927)Online publication date: 21-Jul-2024
    • (2024)Developing ChatGPT for biology and medicine: a complete review of biomedical question answeringBiophysics Reports10.52601/bpr.2024.2400049(1)Online publication date: 2024
    • (2024)Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question AnsweringElectronics10.3390/electronics1312227313:12(2273)Online publication date: 10-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media