Skip to main content
Log in

Multi-modal multi-head self-attention for medical VQA

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Medical Visual Question answering (MedVQA) systems provide answers to questions based on radiology images. Medical images are more complex than general images. They have low contrast and are very similar to one another. The difference between medical images can only be understood by medical practitioners. While general images have very high quality and their differences can easily be spotted by anyone. Therefore, methods used for general-domain Visual Question Answering (VQA) Systems can not be used directly. The performance of MedVQA systems depends mainly on the method used to combine the features of the two input modalities: medical image and question. In this work, we propose an architecturally simple fusion strategy that uses multi-head self-attention to combine medical images and questions of the VQA-Med dataset of the ImageCLEF 2019 challenge. The model captures long-range dependencies between input modalities using the attention mechanism of the Transformer. We have experimentally shown that the representational power of the model is improved by increasing the length of the embeddings, used in the transformer. We have achieved an overall accuracy of 60.0% which improves by 1.35% from the existing model. We have also performed the ablation study to elucidate the importance of each model component.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

We have used VQA-Med-2019 and VQA-Med-2020 datasets for this task. The complete VQA-Med-2019 dataset is publicly available at https://github.com/abachaa/VQA-Med-2019. VQA-Med-2020 is available at https://github.com/abachaa/VQA-Med-2020. Only the Validation set and Test set of VQA-Med-2020 is available publicly so we have used only that for training purpose.

Notes

  1. https://medpix.nlm.nih.gov/

References

  1. McDonald RJ, Schwartz KM, Eckel LJ, Diehn FE, Hunt CH, Bartholmai BJ, Erickson BJ, Kallmes DF (2015) The effects of changes in utilization and technological advancements of cross-sectional imaging on radiologist workload. Academic radiology 22(9):1191–1198

    Article  Google Scholar 

  2. Itri JN, Tappouni RR, McEachern RO, Pesch AJ, Patel SH (2018) Fundamentals of diagnostic error in imaging. Radiographics 38(6):1845–1865

    Article  Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  4. Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1, p 2

  5. Khare Y, Bagal V, Mathew M, Devi A, Priyakumar UD, Jawahar C (2021) Mmbert: Multimodal bert pretraining for improved medical vqa. In: 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp 1033–1036. IEEE

  6. Ren F, Zhou Y (2020) Cgmvqa: A new classification and generative model for medical visual question answering. IEEE Access. 8:50626–50636

    Article  Google Scholar 

  7. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Wilson RC, Hancock ER, Smith WAP (eds) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19–22, 2016. http://www.bmva.org/bmvc/2016/papers/paper087/index.html

  8. Abacha AB, Hasan SA, Datla VV, Liu J, Demner-Fushman D, Müller H (2019) Vqa–med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes) 2

  9. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  10. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Su J, Carreras X, Duh K (eds) Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp 457–468 . https://doi.org/10.18653/v1/d16-1044

  11. Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theoretical Comput Sci 312(1):3–15

    Article  MathSciNet  Google Scholar 

  12. Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 239–247

  13. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830

  14. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

  15. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  16. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Process Syst 29

  17. Peng Y, Liu F, Rosen MP (2018) Umass at imageclef medical visual question answering (med-vqa) 2018 task. In: CLEF (working notes)

  18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation. 9(8):1735–1780

    Article  Google Scholar 

  20. Zhou Y, Kang X, Ren F (2018) Employing inception-resnet-v2 and bi-lstm for medical domain visual question answering. In: CLEF (working notes)

  21. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  22. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Proces 45(11):2673–2681

    Article  Google Scholar 

  23. Abacha AB, Gayen S, Lau JJ, Rajaraman S, Demner-Fushman D (2018) Nlm at imageclef 2018 visual question answering in the medical domain. In: CLEF (working notes)

  24. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings

  25. Talafha B, Al-Ayyoub M (2018) Just at vqa-med: A vgg-seq2seq model. In: CLEF (working notes)

  26. Allaouzi I, Ahmed MB (2018) Deep neural networks and decision tree classifier for visual question answering in the medical domain. In: CLEF (working notes)

  27. Cho K, Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A Meeting of SIGDAT, a special interest group of The ACL, pp 1724–1734 . https://doi.org/10.3115/v1/d14-1179

  28. Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. Adv Neural Inf Process Syst 31

  29. Yan X, Li L, Xie C, Xiao J, Gu L (2019) Zhejiang university at imageclef 2019 visual question answering in the medical domain. CLEF (working notes) 85

  30. Sharma D, Purushotham S, Reddy CK (2021) Medfusenet: An attention-based multimodal deep learning model for visual question answering in the medical domain. Scientific Reports 11(1):1–18

    Article  Google Scholar 

  31. Vu MH, Löfstedt T, Nyholm T, Sznitman R (2020) A question-centric model for visual question answering in medical imaging. IEEE Trans Medical Imaging 39(9):2856–2868

    Article  Google Scholar 

  32. Kafle K, Kanan C (2017) Visual question answering: Datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20

    Article  Google Scholar 

  33. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR

  34. Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10502–10511

  35. Gong H, Chen G, Liu S, Yu Y, Li G (2021) Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: Proceedings of the 2021 international conference on multimedia retrieval, pp 456–460

  36. Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D (2018) A dataset of clinically generated visual questions and answers about radiology images. Scientific data. 5(1):1–10

    Article  Google Scholar 

  37. Abacha AB, Datla VV, Hasan SA, Demner-Fushman D, Müller H (2020) Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain. In: CLEF (working notes)

  38. Ben Abacha A, Sarrouti M, Demner-Fushman D, Hasan SA, Müller H (2021) Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In: Proceedings of the CLEF 2021 conference and labs of the evaluation forum-working notes. 21–24 Sept 2021

  39. Liu B, Zhan L-M, Xu L, Ma L, Yang Y, Wu X-M (2021) Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp 1650–1654. IEEE

  40. Shi L, Liu F, Rosen MP (2019) Deep multimodal learning for medical visual question answering. In: CLEF (working notes)

  41. Kornuta T, Rajan D, Shivade C, Asseman A, Ozcan AS (2019) Leveraging medical visual question answering with supporting facts. arXiv preprint arXiv:1905.12008

  42. Bansal M, Gadgil T, Shah R, Verma P (2019) Medical visual question answering at image clef 2019-vqa med. In: CLEF (working notes)

  43. Verma H, Ramachandran S (2020) Harendrakv at vqa-med 2020: Sequential vqa with attention for medical visual question answering. In: CLEF (working notes)

  44. Liu S, Ding H, Zhou X (2020) Shengyan at vqa-med 2020: An encoder-decoder model for medical domain visual question answering task. In: CLEF (working notes)

  45. Sitara NMS, Srinivasan K (2021) Ssn mlrg at vqa-med 2021: An approach for vqa to solve abnormality related queries using improved datasets. In: CLEF (working Notes), pp 1329–1335

  46. Manmadhan S, Kovoor BC (2023) Parallel multi-head attention and term-weighted question embedding for medical visual question answering. Multimedia Tools and Applications 1–22

  47. Liu B, Zhan L-M, Wu X-M (2021) Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: Medical image computing and computer assisted intervention–MICCAI 2021: 24th International conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, pp 210–220 . Springer

  48. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108

  49. Xu J, Sun X, Zhang Z, Zhao G, Lin J (2019) Understanding and improving layer normalization. Adv Neural Inf Process Syst 32

  50. Al-Sadi A, Al-Ayyoub M, Jararweh Y, Costen F (2021) Visual question answering in the medical domain based on deep learning approaches: A comprehensive study. Pattern Recogn Lett 150:57–75

    Article  Google Scholar 

  51. Xu J, Li Z, Du B, Zhang M, Liu J (2020) Reluplex made more practical: Leaky relu. In: 2020 IEEE symposium on computers and communications (ISCC), pp 1–7 . IEEE

  52. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings . arXiv:1412.6980

  53. Alsentzer E, Murphy J, Boag W, Weng W-H, Jin D, Naumann T, McDermott M (2019) Publicly available clinical BERT embeddings. In: Proceedings of the 2nd clinical natural language processing workshop, pp 72–78. Association for computational linguistics, Minneapolis, Minnesota, USA. https://doi.org/10.18653/v1/W19-1909, https://www.aclweb.org/anthology/W19-1909

  54. Kazemi V, Elqursh A (2017) Show ask attend and answer: A strong baseline for visual question answering. CoRR. arXiv:1704.03162

  55. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  56. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913

  57. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasudha Joshi.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joshi, V., Mitra, P. & Bose, S. Multi-modal multi-head self-attention for medical VQA. Multimed Tools Appl 83, 42585–42608 (2024). https://doi.org/10.1007/s11042-023-17162-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17162-3

Keywords

Navigation