Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix

Gong, Cheng; Pan, Haiwei; Lan, Haiyan; Zhang, Kejia; He, Shuning; Jia, Xiteng

doi:10.1007/978-981-97-8505-6_1

Cheng Gong¹⁵,
Haiwei Pan¹⁵,
Haiyan Lan¹⁵,
Kejia Zhang¹⁵,
Shuning He¹⁵ &
…
Xiteng Jia¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15034))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

139 Accesses

Abstract

The Medical Imaging Question Answering task combines medical imaging and natural language processing to answer questions related to medical imaging. Despite the progress that has been made in the field, problems remain. Currently, most image encoders use a transformer structure to extract features and output the final layer of the model for further processing. This approach ignores the complex semantic context between image and text, limiting the ability of the model to capture cross-modal semantics. To address this limitation and further explore the semantic interactions between images and text, this paper designs a Contextual Interactive Attention Connection module. The module utilises deep and shallow feature representations of the encoder and applies a variant of the attention mechanism to enable deep interaction between image and text features. This greatly improves semantic consistency and overall performance in medical vision question answering tasks. Considering that accurate answers to specialised medical questions often depend on rich medical a priori knowledge, effective integration of this knowledge to improve the accuracy of a question answering system is very costly in terms of human and financial resources. To address this problem, this paper proposes a learning matrix assistance module that utilises a learning matrix to assist the model. Experiments on two datasets, VQA-RAD and SLAKE, show that the model proposed in this paper outperforms other state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.: Overview of ImageCLEF 2018 medical domain visual question answering task. In: CLEF 2018 Working Notes (2018)
Google Scholar
Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, D., Coy, A., Karargyris, A., Guo, Y., Beymer, D.B., et al.: Towards visual dialog for radiology. In: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pp. 60–69 (2020)
Google Scholar
Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: a survey. Artif. Intell. Med., 102611 (2023)
Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13-17, 2019. Proceedings, Part IV, vol. 22, pp. 522–530. Springer (2019)
Google Scholar
Gong, H., Chen, G., Mao, M., Li, Z., Li, G.: VQAMix: conditional triplet mixup for medical visual question answering. IEEE Trans. Med. Imaging 41, 3332–3343 (2022)
Article Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked Attention Networks for Image Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Chen, G., Gong, H., Li, G.: HCP-MIC at VQA-Med 2020: effective visual representation for medical visual question answering. In: CLEF 2020 Working Notes (2020)
Google Scholar
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple Meta-model quantifying for medical visual question answering. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part V, vol. 24, pp. 64–74. Springer (2021)
Google Scholar
Zhan, L.-M., Liu, B., Fan, L., Chen, J., Wu, X.-M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Google Scholar
Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X., Chang, T.-H.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 679–689. Springer (2022)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–825 (2022)
Google Scholar
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5, 1–10 (2018)
Article Google Scholar
Liu, B., Zhan, L.-M., Xu, L., Ma, L., Yang, Y., Wu, X.-M.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, pp. 1650–1654 (2021)
Google Scholar
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv preprint arXiv:2010.01412 (2020)
Cong, F., Xu, S., Guo, L., Tian, Y.: Anomaly matters: an anomaly-oriented model for medical visual question answering. IEEE Trans. Med. Imaging 41, 3385–3397 (2022)
Article Google Scholar
Cong, F., Xu, S., Guo, L., Tian, Y.: Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3569–3577 (2022)
Google Scholar
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part II, vol. 24, pp. 210–220. Springer (2021)
Google Scholar
Zhang, A., Tao, W., Li, Z., Wang, H., Zhang, W.: Type-aware medical visual question answering. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4838–4842. IEEE (2022)
Google Scholar

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China under (Grant No.62072135)

Author information

Authors and Affiliations

Department of Computer Science, Harbin Engineering University, Harbin, China
Cheng Gong, Haiwei Pan, Haiyan Lan, Kejia Zhang, Shuning He & Xiteng Jia

Authors

Cheng Gong
View author publications
You can also search for this author in PubMed Google Scholar
Haiwei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Lan
View author publications
You can also search for this author in PubMed Google Scholar
Kejia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shuning He
View author publications
You can also search for this author in PubMed Google Scholar
Xiteng Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyan Lan .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gong, C., Pan, H., Lan, H., Zhang, K., He, S., Jia, X. (2025). Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_1

Download citation

DOI: https://doi.org/10.1007/978-981-97-8505-6_1
Published: 07 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Contextual Feature-Based Medical Visual Question Answering Aided by Learnable Matrix