MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Long, Shaopei; Yang, Zhenguo; Li, Yong; Qian, Xiaobo; Zeng, Kun; Hao, Tianyong

doi:10.1007/978-981-99-5847-4_15

Shaopei Long¹²,
Zhenguo Yang¹³,
Yong Li¹²,
Xiaobo Qian¹²,
Kun Zeng¹⁴ &
…
Tianyong Hao¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1870))

Included in the following conference series:

International Conference on Neural Computing for Advanced Applications

704 Accesses

Abstract

Medical Visual Question Answering (VQA) targets at accurately answering clinical questions about images. The existing medical VQA models show great potential, but most of them ignore the influence of word-level fine-grained features which benefit filtering out irrelevant regions in medical images more precisely. We present a Multi-level Attention-based Multimodal Fusion model named MAMF, aiming at learning a multi-level multimodal semantic representation for medical VQA. First, we develop a Word-to-Image attention and a Sentence-to-Image attention to obtain the correlations of word embeddings and question feature to image feature. In addition, we propose an attention alignment loss which contributes to adjust the weights of image regions gained from word embeddings and question feature to emphasize relevant regions for improving the quality of predicted answers. Results on VQA-RAD and PathVQA datasets suggest that our MAMF significantly outperforms the related state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Article 29 March 2023

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Article 11 March 2023

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

Article Open access 30 October 2022

References

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the International Conference on Machine Learning, pp. 1126–1135 (2017)
Google Scholar
Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Proceedings of the International Conference on Artificial Neural Networks, pp. 52–59 (2011)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hao, T., Li, X., He, Y., Wang, F.L., Qu, Y.: Recent progress in leveraging deep learning methods for question answering. Neural Comput. Appl. 34, 2765–2783 (2022)
Google Scholar
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 1–10 (2018)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Abacha, A.B., Gayen, S., Lau, J.J., Rajaraman, S., Demner-Fushman, D.: NLM at ImageCLEF 2018 visual question answering in the medical domain. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR WS.org, Avignon, France (2018)
Google Scholar
Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-Med: overview of the medical visual question answering task at ImageCLEF 2019. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, vol. 2380). CEUR-WS.org, Lugano, Switzerland (2019)
Google Scholar
Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Trans-fusion: understanding transfer learning for medical imaging. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, pp. 3342–3352. NeurIPS, Vancouver, BC, Canada (2019)
Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Chapter Google Scholar
Brown, T., et al.: Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020)
Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Google Scholar
Li, Y., et al.: A Bi-level representation learning model for medical visual question answering. J. Biomed. Inform. 134, 104183 (2022)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_2
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q. V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. arXiv preprint arXiv:1805.07932 (2018)
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple Meta-Model Quantifying for Medical Visual Question Answering. arXiv preprint arXiv:2105.08913 (2021)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering. arXiv preprint arXiv:2105.00136 (2021)
Liu, Bo., Zhan, L.-M., Wu, X.-M.: Contrastive Pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Chapter Google Scholar
Vu, M.H., Löfstedt, T., Nyholm, T., Sznitman, R.: A question-centric model for visual question answering in medical imaging. IEEE Trans. Med. Imaging 39(9), 2856–2868 (2020)
Article Google Scholar
Sharma, D., Purushotham, S., Reddy, C.K.: MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep. 11(1), 1–18 (2021)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Association for Computational Linguistics, Doha, Qatar (2014)
Google Scholar
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: PathVQA: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar

Download references

Acknowledgements

The work is supported by grants from Humanities and Social Sciences Research Foundation of the Ministry of Education, “Intelligent Analysis and Evaluation of Learning Effection Based on Multi-Modal Data” (No. 21YJAZH072).

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, China
Shaopei Long, Yong Li, Xiaobo Qian & Tianyong Hao
School of Computer Science, Guangdong University of Technology, Guangzhou, China
Zhenguo Yang
School of Computer Science, Sun Yat-Sen University, Guangzhou, China
Kun Zeng

Authors

Shaopei Long
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobo Qian
View author publications
You can also search for this author in PubMed Google Scholar
Kun Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Tianyong Hao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianyong Hao .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Shenzhen, China
Haijun Zhang
Chaohu University, Hefei, China
Yinggen Ke
Chongqing University, Chongqing, China
Zhou Wu
South China Normal University, Guangzhou, China
Tianyong Hao
Hefei University of Technology, Hefei, China
Zhao Zhang
Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
Chaohu University, Hefei, China
Yuanyuan Mu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Long, S., Yang, Z., Li, Y., Qian, X., Zeng, K., Hao, T. (2023). MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering. In: Zhang, H., et al. International Conference on Neural Computing for Advanced Applications. NCAA 2023. Communications in Computer and Information Science, vol 1870. Springer, Singapore. https://doi.org/10.1007/978-981-99-5847-4_15

Download citation

DOI: https://doi.org/10.1007/978-981-99-5847-4_15
Published: 30 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5846-7
Online ISBN: 978-981-99-5847-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation