skip to main content
10.1145/3562007.3562008acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccrisConference Proceedingsconference-collections
research-article

MMCN: Multi-Modal Co-attention Network for Medical Visual Question Answering

Authors Info & Claims
Published:12 October 2022Publication History

ABSTRACT

Medical visual question Answering (MedVQA) attempts to answer medical questions posed by a correlative medical image. Although it has vast potential in the medical domain, this technology is still difficult to apply in real-life. It has not been widely adopted because accurate answer prediction requires a refined understanding of medical images and question text. Existing methods directly use the whole image and the whole question for multi-modal fusion to predict the answer. However, for a question, important information only exists in a small part of the whole image and a few critical words in the question, and extra information may interfere with the answer prediction. To this end, we introduce an effective multi-modal co-attention network (MMCN) for learning essential words in the question and essential regions in the image. Each word and region is scored by the attention weighting method, which is used to indicate the importance of each word and region in the process of model reasoning. Experimental comparisons show that our MMCN is superior to the most advanced methods of the public RAD-VQA dataset.

References

  1. Lau, J., Gayen, S., Ben Abacha, A. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data 5, 180251 (2018).Google ScholarGoogle Scholar
  2. Abacha A B , Gayen S , Lau J J , NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain[C]// Ceur Workshop. 2018.Google ScholarGoogle Scholar
  3. Jiang Z , Chi C , Zhan Y . Research on Medical Question Answering System Based on Knowledge Graph[J]. IEEE Access, 2021, PP(99):1-1.Google ScholarGoogle ScholarCross RefCross Ref
  4. Xiang L A , Mc B , Jl C , A hybrid medical text classification framework: Integrating attentive rule construction and neural network[J]. Neurocomputing, 2021, 443:345-355.Google ScholarGoogle ScholarCross RefCross Ref
  5. Gong H , Chen G , Liu S , Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering[J]. 2021 International Conference on Multimedia Retrieval (ICMR '21), August 21–24, 2021, Taipei, Taiwan, 2021.Google ScholarGoogle Scholar
  6. Yu Z , Yu J , Cui Y , Deep Modular Co-Attention Networks for Visual Question Answering[J]. 2019 IEEE Conference on Computer Vision and Pattern Recognition, 2019.Google ScholarGoogle Scholar
  7. Guo Z , Han D . Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering[J]. Sensors, 2020, 20(23):6758.Google ScholarGoogle ScholarCross RefCross Ref
  8. Cornia M , Stefanini M , Baraldi L , Meshed-Memory Transformer for Image Captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.Google ScholarGoogle Scholar
  9. Ren S , He K , Girshick R , Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Anderson P , He X , Buehler C , Bottom-Up and Top-Down Attention for Image Captioning and VQA[J]. 2017.Google ScholarGoogle Scholar
  11. Abacha A B , Hasan S A , Datla V V , VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019[J]. Lecture Notes in Computer Science, 2019.Google ScholarGoogle Scholar
  12. Sarrouti M . NLM at VQA-Med 2020: Visual Question Answering and Generation in the Medical Domain[C]// ImageCLEF. 2020.Google ScholarGoogle Scholar
  13. Jia D , Wei D , Socher R , ImageNet: A large-scale hierarchical image database[C]// 2009:248-255.Google ScholarGoogle Scholar
  14. Do T , Nguyen B X , Tjiputra E , Multiple Meta-model Quantifying for Medical Visual Question Answering[J]. 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Finn C , Abbeel P , Levine S . Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks[C]// 2017.Google ScholarGoogle Scholar
  16. Liu B , Zhan L M , Wu X M . Contrastive Pre-training andRepresentation Distillation forMedical Visual Question Answering Based onRadiology Images[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021.Google ScholarGoogle Scholar
  17. He K , Zhang X , Ren S , Deep Residual Learning for Image Recognition[J]. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  18. Gong H , Chen G , Liu S , Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering[J]. 2021 International Conference on Multimedia Retrieval (ICMR '21), August 21–24, 2021, Taipei, Taiwan, 2021.Google ScholarGoogle Scholar
  19. Yang Z , He X , Gao J , Stacked Attention Networks for Image Question Answering[J]. IEEE Computer Society, 2015.Google ScholarGoogle Scholar
  20. Lu J , Yang J , Batra D , Hierarchical Co-Attention for Visual Question Answering[J]. 2016.Google ScholarGoogle Scholar
  21. Vu M H , T Löfstedt, Nyholm T , A Question-Centric Model for Visual Question Answering in Medical Imaging[J]. IEEE Transactions on Medical Imaging, 2020, PP(99):1-1.Google ScholarGoogle Scholar
  22. Ye L , Rochan M , Liu Z , Cross-Modal Self-Attention Network for Referring Image Segmentation[J]. 2019.Google ScholarGoogle ScholarCross RefCross Ref
  23. Nguyen B D , Do T T , Nguyen B X , Overcoming Data Limitation in Medical Visual Question Answering[J]. 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhang Y , Chen Q , Yang Z , BioWordVec, improving biomedical word embeddings with subword information and MeSH[J]. Scientific Data, 2019, 6(1).Google ScholarGoogle ScholarCross RefCross Ref
  25. Hochreiter S , Schmidhuber J . Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vaswani A , Shazeer N , Parmar N , Attention Is All You Need[C]// arXiv. arXiv, 2017.Google ScholarGoogle Scholar
  27. Kim J H , Jun J , Zhang B T . Bilinear Attention Networks[J]. 2018.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System
    August 2022
    253 pages
    ISBN:9781450396851
    DOI:10.1145/3562007

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 October 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited
  • Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)12

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format