research-article

MMCN: Multi-Modal Co-attention Network for Medical Visual Question Answering

Authors:
Ming Sun

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

,
Qilong Xu

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

,
Ercong Wang

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

,
Wenjun Wang

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

,
Lei Tan

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

,
Xiu Yang Zhao

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China

Provincial Key Laboratory of Network based Intelligent Computing, University of Jinan, China
View Profile

CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent SystemAugust 2022Pages 1–6https://doi.org/10.1145/3562007.3562008

Published:12 October 2022Publication History

CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System

Pages 1–6

ABSTRACT

Medical visual question Answering (MedVQA) attempts to answer medical questions posed by a correlative medical image. Although it has vast potential in the medical domain, this technology is still difficult to apply in real-life. It has not been widely adopted because accurate answer prediction requires a refined understanding of medical images and question text. Existing methods directly use the whole image and the whole question for multi-modal fusion to predict the answer. However, for a question, important information only exists in a small part of the whole image and a few critical words in the question, and extra information may interfere with the answer prediction. To this end, we introduce an effective multi-modal co-attention network (MMCN) for learning essential words in the question and essential regions in the image. Each word and region is scored by the attention weighting method, which is used to indicate the importance of each word and region in the process of model reasoning. Experimental comparisons show that our MMCN is superior to the most advanced methods of the public RAD-VQA dataset.

References

Lau, J., Gayen, S., Ben Abacha, A. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data 5, 180251 (2018).Google Scholar
Abacha A B , Gayen S , Lau J J , NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain[C]// Ceur Workshop. 2018.Google Scholar
Jiang Z , Chi C , Zhan Y . Research on Medical Question Answering System Based on Knowledge Graph[J]. IEEE Access, 2021, PP(99):1-1.Google ScholarCross Ref
Xiang L A , Mc B , Jl C , A hybrid medical text classification framework: Integrating attentive rule construction and neural network[J]. Neurocomputing, 2021, 443:345-355.Google ScholarCross Ref
Gong H , Chen G , Liu S , Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering[J]. 2021 International Conference on Multimedia Retrieval (ICMR '21), August 21–24, 2021, Taipei, Taiwan, 2021.Google Scholar
Yu Z , Yu J , Cui Y , Deep Modular Co-Attention Networks for Visual Question Answering[J]. 2019 IEEE Conference on Computer Vision and Pattern Recognition, 2019.Google Scholar
Guo Z , Han D . Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering[J]. Sensors, 2020, 20(23):6758.Google ScholarCross Ref
Cornia M , Stefanini M , Baraldi L , Meshed-Memory Transformer for Image Captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.Google Scholar
Ren S , He K , Girshick R , Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.Google ScholarDigital Library
Anderson P , He X , Buehler C , Bottom-Up and Top-Down Attention for Image Captioning and VQA[J]. 2017.Google Scholar
Abacha A B , Hasan S A , Datla V V , VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019[J]. Lecture Notes in Computer Science, 2019.Google Scholar
Sarrouti M . NLM at VQA-Med 2020: Visual Question Answering and Generation in the Medical Domain[C]// ImageCLEF. 2020.Google Scholar
Jia D , Wei D , Socher R , ImageNet: A large-scale hierarchical image database[C]// 2009:248-255.Google Scholar
Do T , Nguyen B X , Tjiputra E , Multiple Meta-model Quantifying for Medical Visual Question Answering[J]. 2021.Google ScholarDigital Library
Finn C , Abbeel P , Levine S . Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks[C]// 2017.Google Scholar
Liu B , Zhan L M , Wu X M . Contrastive Pre-training andRepresentation Distillation forMedical Visual Question Answering Based onRadiology Images[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021.Google Scholar
He K , Zhang X , Ren S , Deep Residual Learning for Image Recognition[J]. IEEE, 2016.Google ScholarCross Ref
Gong H , Chen G , Liu S , Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering[J]. 2021 International Conference on Multimedia Retrieval (ICMR '21), August 21–24, 2021, Taipei, Taiwan, 2021.Google Scholar
Yang Z , He X , Gao J , Stacked Attention Networks for Image Question Answering[J]. IEEE Computer Society, 2015.Google Scholar
Lu J , Yang J , Batra D , Hierarchical Co-Attention for Visual Question Answering[J]. 2016.Google Scholar
Vu M H , T Löfstedt, Nyholm T , A Question-Centric Model for Visual Question Answering in Medical Imaging[J]. IEEE Transactions on Medical Imaging, 2020, PP(99):1-1.Google Scholar
Ye L , Rochan M , Liu Z , Cross-Modal Self-Attention Network for Referring Image Segmentation[J]. 2019.Google ScholarCross Ref
Nguyen B D , Do T T , Nguyen B X , Overcoming Data Limitation in Medical Visual Question Answering[J]. 2019.Google ScholarDigital Library
Zhang Y , Chen Q , Yang Z , BioWordVec, improving biomedical word embeddings with subword information and MeSH[J]. Scientific Data, 2019, 6(1).Google ScholarCross Ref
Hochreiter S , Schmidhuber J . Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.Google ScholarDigital Library
Vaswani A , Shazeer N , Parmar N , Attention Is All You Need[C]// arXiv. arXiv, 2017.Google Scholar
Kim J H , Jun J , Zhang B T . Bilinear Attention Networks[J]. 2018.Google Scholar

Recommendations

Medical Visual Question Answering via Conditional Reasoning
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Medical visual question answering (Med-VQA) aims to accurately answer a clinical question presented with a medical image. Despite its enormous potential in healthcare industry and services, the technology is still in its infancy and is far from ...
Read More
Medical knowledge-based network for Patient-oriented Visual Question Answering
Abstract
Visual Question Answering (VQA) systems have achieved great success in general scenarios. In medical domain, VQA systems are still in their infancy as the datasets are limited by scale and application scenarios. Current medical VQA ...
Highlights
- We introduce a new Patient-oriented medical VQA dataset (P-VQA).
- P-VQA covers ...
Read More
Medical visual question answering based on question-type reasoning and semantic space constraint
Abstract
Medical visual question answering (Med-VQA) aims to accurately answer clinical questions about medical images. Despite its enormous potential for application in the medical domain, the current technology is still in its infancy. ...
Highlights
- A new Framework has been proposed for the medical visual question answering tasks.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System
August 2022
253 pages
ISBN:9781450396851
DOI:10.1145/3562007

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Medical visual question answering
attention mechanism
multi-modal learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 114
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

MMCN: Multi-Modal Co-attention Network for Medical Visual Question Answering

CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System

ABSTRACT

References

Cited By

Recommendations

Medical Visual Question Answering via Conditional Reasoning

Medical knowledge-based network for Patient-oriented Visual Question Answering

Medical visual question answering based on question-type reasoning and semantic space constraint

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media