Skip to main content

Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15034))

Included in the following conference series:

  • 139 Accesses

Abstract

The Medical Imaging Question Answering task combines medical imaging and natural language processing to answer questions related to medical imaging. Despite the progress that has been made in the field, problems remain. Currently, most image encoders use a transformer structure to extract features and output the final layer of the model for further processing. This approach ignores the complex semantic context between image and text, limiting the ability of the model to capture cross-modal semantics. To address this limitation and further explore the semantic interactions between images and text, this paper designs a Contextual Interactive Attention Connection module. The module utilises deep and shallow feature representations of the encoder and applies a variant of the attention mechanism to enable deep interaction between image and text features. This greatly improves semantic consistency and overall performance in medical vision question answering tasks. Considering that accurate answers to specialised medical questions often depend on rich medical a priori knowledge, effective integration of this knowledge to improve the accuracy of a question answering system is very costly in terms of human and financial resources. To address this problem, this paper proposes a learning matrix assistance module that utilises a learning matrix to assist the model. Experiments on two datasets, VQA-RAD and SLAKE, show that the model proposed in this paper outperforms other state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  2. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.: Overview of ImageCLEF 2018 medical domain visual question answering task. In: CLEF 2018 Working Notes (2018)

    Google Scholar 

  3. Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, D., Coy, A., Karargyris, A., Guo, Y., Beymer, D.B., et al.: Towards visual dialog for radiology. In: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pp. 60–69 (2020)

    Google Scholar 

  4. Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: a survey. Artif. Intell. Med., 102611 (2023)

    Google Scholar 

  5. Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13-17, 2019. Proceedings, Part IV, vol. 22, pp. 522–530. Springer (2019)

    Google Scholar 

  6. Gong, H., Chen, G., Mao, M., Li, Z., Li, G.: VQAMix: conditional triplet mixup for medical visual question answering. IEEE Trans. Med. Imaging 41, 3332–3343 (2022)

    Article  Google Scholar 

  7. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked Attention Networks for Image Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

    Google Scholar 

  8. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  9. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)

    Google Scholar 

  10. Chen, G., Gong, H., Li, G.: HCP-MIC at VQA-Med 2020: effective visual representation for medical visual question answering. In: CLEF 2020 Working Notes (2020)

    Google Scholar 

  11. Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple Meta-model quantifying for medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part V, vol. 24, pp. 64–74. Springer (2021)

    Google Scholar 

  12. Zhan, L.-M., Liu, B., Fan, L., Chen, J., Wu, X.-M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)

    Google Scholar 

  13. Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X., Chang, T.-H.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 679–689. Springer (2022)

    Google Scholar 

  14. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020)

  15. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

  16. Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–825 (2022)

    Google Scholar 

  17. Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5, 1–10 (2018)

    Article  Google Scholar 

  18. Liu, B., Zhan, L.-M., Xu, L., Ma, L., Yang, Y., Wu, X.-M.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, pp. 1650–1654 (2021)

    Google Scholar 

  19. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv preprint arXiv:2010.01412 (2020)

  20. Cong, F., Xu, S., Guo, L., Tian, Y.: Anomaly matters: an anomaly-oriented model for medical visual question answering. IEEE Trans. Med. Imaging 41, 3385–3397 (2022)

    Article  Google Scholar 

  21. Cong, F., Xu, S., Guo, L., Tian, Y.: Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3569–3577 (2022)

    Google Scholar 

  22. Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part II, vol. 24, pp. 210–220. Springer (2021)

    Google Scholar 

  23. Zhang, A., Tao, W., Li, Z., Wang, H., Zhang, W.: Type-aware medical visual question answering. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4838–4842. IEEE (2022)

    Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China under (Grant No.62072135)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiyan Lan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gong, C., Pan, H., Lan, H., Zhang, K., He, S., Jia, X. (2025). Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8505-6_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8504-9

  • Online ISBN: 978-981-97-8505-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics