Abstract
The Medical Imaging Question Answering task combines medical imaging and natural language processing to answer questions related to medical imaging. Despite the progress that has been made in the field, problems remain. Currently, most image encoders use a transformer structure to extract features and output the final layer of the model for further processing. This approach ignores the complex semantic context between image and text, limiting the ability of the model to capture cross-modal semantics. To address this limitation and further explore the semantic interactions between images and text, this paper designs a Contextual Interactive Attention Connection module. The module utilises deep and shallow feature representations of the encoder and applies a variant of the attention mechanism to enable deep interaction between image and text features. This greatly improves semantic consistency and overall performance in medical vision question answering tasks. Considering that accurate answers to specialised medical questions often depend on rich medical a priori knowledge, effective integration of this knowledge to improve the accuracy of a question answering system is very costly in terms of human and financial resources. To address this problem, this paper proposes a learning matrix assistance module that utilises a learning matrix to assist the model. Experiments on two datasets, VQA-RAD and SLAKE, show that the model proposed in this paper outperforms other state-of-the-art models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.: Overview of ImageCLEF 2018 medical domain visual question answering task. In: CLEF 2018 Working Notes (2018)
Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, D., Coy, A., Karargyris, A., Guo, Y., Beymer, D.B., et al.: Towards visual dialog for radiology. In: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pp. 60–69 (2020)
Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: a survey. Artif. Intell. Med., 102611 (2023)
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13-17, 2019. Proceedings, Part IV, vol. 22, pp. 522–530. Springer (2019)
Gong, H., Chen, G., Mao, M., Li, Z., Li, G.: VQAMix: conditional triplet mixup for medical visual question answering. IEEE Trans. Med. Imaging 41, 3332–3343 (2022)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked Attention Networks for Image Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Chen, G., Gong, H., Li, G.: HCP-MIC at VQA-Med 2020: effective visual representation for medical visual question answering. In: CLEF 2020 Working Notes (2020)
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple Meta-model quantifying for medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part V, vol. 24, pp. 64–74. Springer (2021)
Zhan, L.-M., Liu, B., Fan, L., Chen, J., Wu, X.-M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X., Chang, T.-H.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 679–689. Springer (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–825 (2022)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5, 1–10 (2018)
Liu, B., Zhan, L.-M., Xu, L., Ma, L., Yang, Y., Wu, X.-M.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, pp. 1650–1654 (2021)
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv preprint arXiv:2010.01412 (2020)
Cong, F., Xu, S., Guo, L., Tian, Y.: Anomaly matters: an anomaly-oriented model for medical visual question answering. IEEE Trans. Med. Imaging 41, 3385–3397 (2022)
Cong, F., Xu, S., Guo, L., Tian, Y.: Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3569–3577 (2022)
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part II, vol. 24, pp. 210–220. Springer (2021)
Zhang, A., Tao, W., Li, Z., Wang, H., Zhang, W.: Type-aware medical visual question answering. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4838–4842. IEEE (2022)
Acknowledgements
The work was supported by the National Natural Science Foundation of China under (Grant No.62072135)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gong, C., Pan, H., Lan, H., Zhang, K., He, S., Jia, X. (2025). Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_1
Download citation
DOI: https://doi.org/10.1007/978-981-97-8505-6_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)