Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Cai, Linqin; Fang, Haodu; Li, Zhiqing

doi:10.1007/s11227-023-05195-2

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Published: 29 March 2023

Volume 79, pages 13696–13723, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Linqin Cai¹,
Haodu Fang¹ &
Zhiqing Li¹

356 Accesses
1 Citation
Explore all metrics

Abstract

Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

M2FNet: Multi-granularity Feature Fusion Network for Medical Visual Question Answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Article 11 March 2023

Data availability

All data generated or analyzed during this study are included in this published article.

Notes

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433
Chebbi I (2021) Chabbiimen at vqa-med 2021: visual generation of relevant natural language questions from radiology images for anomaly detection. In: CLEF (Working Notes), pp. 1201–1210
Abacha AB, Datla VV, Hasan SA, Demner-Fushman D, Müller H (2020) Overview of the vqa-med task at imageclef 2020: visual question answering and generation in the medical domain. In: CLEF (Working Notes)
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems, pp 91–99
Wu J, Mooney R (2019) Self-critical reasoning for robust visual question answering. Advances in Neural Information Processing Systems, pp 8604–8614
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR pp. 8748–8763
Gupta D, Suman S, Ekbal A (2021) Hierarchical deep multi-modal network for medical visual question answering. Expert Sys Appl 164:113993
Article Google Scholar
Selvaraju RR, Lee S, Shen Y, Jin H, Ghosh S, Heck L, Batra D, Parikh D (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600
Cadene R, Dancette C, Cord M, Parikh D, et al (2019) Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, pp 841–852
Qiao T, Dong J, Xu D (2018) Exploring human-like attention supervision in visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, 32
Agarwal V, Shetty R, Fritz M (2020) Towards causal vqa: revealing and reducing spurious correlations by invariant and covariant semantic editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9690–9698
Gong H, Chen G, Liu S, Yu Y, Li G (2021) Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 456–460
Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R (2021) Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12700–12710
Nguyen BD, Do T-T, Nguyen BX, Do T, Tjiputra E, Tran QD (2019) Overcoming data limitation in medical visual question answering. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 522–530. Springer, Berlin
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, PMLR, pp. 1126–1135
Masci J, Meier U, Cireşan D, Schmidhuber J (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In: International Conference on Artificial Neural Networks, pp. 52–59. Springer
Eslami S, de Melo G, Meinel C (2021) Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906
Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D (2018) A dataset of clinically generated visual questions and answers about radiology images. Scient Data 5(1):1–10
Article Google Scholar
Zhan L-M, Liu B, Fan L, Chen J, Wu X-M (2020) Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354
Vu MH, Löfstedt T, Nyholm T, Sznitman R (2020) A question-centric model for visual question answering in medical imaging. IEEE Trans Med Imag 39(9):2856–2868
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, PMLR pp. 2048–2057
Liu S, Zhang X, Zhou X, Yang J (2022) Bpi-mvqa: a bi-branch model for medical visual question answering. BMC Med Imag 22(1):1–19
Article Google Scholar
Ren F, Zhou Y (2020) Cgmvqa: a new classification and generative model for medical visual question answering. IEEE Access 8:50626–50636
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Susano Pinto A, Keysers D, Houlsby N (2021) Scaling vision with sparse mixture of experts. Adv Neural Inf Process Sys 34:8583–8595
Google Scholar
Pelka O, Koitka S, Rückert J, Nensa F, Friedrich CM (2018) Radiology objects in context (roco): a multimodal image dataset. In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pp. 180–189. Springer, Berlin
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, pp 289–297
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, pp 6000–6010
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543
Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. Advances in neural information processing systems, pp 1571–1581
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16 x 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Liu B, Zhan L-M, Xu L, Ma L, Yang Y, Wu X-M (2021) Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), IEEE pp. 1650–1654
Simpson AL, Antonelli M, Bakas S, Bilello M, Farahani K, Van Ginneken B, Kopp-Schneider A, Landman BA, Litjens G, Menze B, et al (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106
Kavur AE, Gezer NS, Barış M, Aslan S, Conze P-H, Groza V, Pham DD, Chatterjee S, Ernst P, Özkan S et al (2021) Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation. Med Image Anal 69:101950
Article Google Scholar
Gasmi K, Ltaifa IB, Lejeune G, Alshammari H, Ammar LB, Mahmood MA (2022) Optimal deep neural network-based model for answering visual medical question. Cybern Sys 53(5):403–424
Article Google Scholar
Do T, Nguyen BX, Tjiputra E, Tran M, Tran QD, Nguyen A (2021) Multiple meta-model quantifying for medical visual question answering. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 64–74.
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. IEEE
Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. In: IEEE Computer Society
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, 1839–1848
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Article Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626

Download references

Funding

This work is supported by the National Natural Science Foundation of China (62277008) and the Educational Informatization Project of Chongqing University of Posts and Telecommunications (xxhyf2022-08).

Author information

Authors and Affiliations

School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Linqin Cai, Haodu Fang & Zhiqing Li

Authors

Linqin Cai
View author publications
You can also search for this author in PubMed Google Scholar
Haodu Fang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqing Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LC helped in conceptualization, methodology, software, investigation, and writing and editing. HF helped in experiment, data processing, writing—original manuscript, visualization, and data curation. ZL helped in experiment, software, and validation. All authors reviewed the manuscript.

Corresponding author

Correspondence to Haodu Fang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service, and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cai, L., Fang, H. & Li, Z. Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering. J Supercomput 79, 13696–13723 (2023). https://doi.org/10.1007/s11227-023-05195-2

Download citation

Accepted: 12 March 2023
Published: 29 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11227-023-05195-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Abstract

Access this article

Similar content being viewed by others

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

M2FNet: Multi-granularity Feature Fusion Network for Medical Visual Question Answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Abstract

Access this article

Similar content being viewed by others

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

M2FNet: Multi-granularity Feature Fusion Network for Medical Visual Question Answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation