Abstract
Medical Visual Question Answering (VQA) targets at accurately answering clinical questions about images. The existing medical VQA models show great potential, but most of them ignore the influence of word-level fine-grained features which benefit filtering out irrelevant regions in medical images more precisely. We present a Multi-level Attention-based Multimodal Fusion model named MAMF, aiming at learning a multi-level multimodal semantic representation for medical VQA. First, we develop a Word-to-Image attention and a Sentence-to-Image attention to obtain the correlations of word embeddings and question feature to image feature. In addition, we propose an attention alignment loss which contributes to adjust the weights of image regions gained from word embeddings and question feature to emphasize relevant regions for improving the quality of predicted answers. Results on VQA-RAD and PathVQA datasets suggest that our MAMF significantly outperforms the related state-of-the-art baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the International Conference on Machine Learning, pp. 1126–1135 (2017)
Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Proceedings of the International Conference on Artificial Neural Networks, pp. 52–59 (2011)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hao, T., Li, X., He, Y., Wang, F.L., Qu, Y.: Recent progress in leveraging deep learning methods for question answering. Neural Comput. Appl. 34, 2765–2783 (2022)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 1–10 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Abacha, A.B., Gayen, S., Lau, J.J., Rajaraman, S., Demner-Fushman, D.: NLM at ImageCLEF 2018 visual question answering in the medical domain. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 2125). CEUR WS.org, Avignon, France (2018)
Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-Med: overview of the medical visual question answering task at ImageCLEF 2019. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, vol. 2380). CEUR-WS.org, Lugano, Switzerland (2019)
Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Trans-fusion: understanding transfer learning for medical imaging. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, pp. 3342–3352. NeurIPS, Vancouver, BC, Canada (2019)
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Brown, T., et al.: Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020)
Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Li, Y., et al.: A Bi-level representation learning model for medical visual question answering. J. Biomed. Inform. 134, 104183 (2022)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_2
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q. V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. arXiv preprint arXiv:1805.07932 (2018)
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple Meta-Model Quantifying for Medical Visual Question Answering. arXiv preprint arXiv:2105.08913 (2021)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering. arXiv preprint arXiv:2105.00136 (2021)
Liu, Bo., Zhan, L.-M., Wu, X.-M.: Contrastive Pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Vu, M.H., Löfstedt, T., Nyholm, T., Sznitman, R.: A question-centric model for visual question answering in medical imaging. IEEE Trans. Med. Imaging 39(9), 2856–2868 (2020)
Sharma, D., Purushotham, S., Reddy, C.K.: MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep. 11(1), 1–18 (2021)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Association for Computational Linguistics, Doha, Qatar (2014)
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: PathVQA: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Acknowledgements
The work is supported by grants from Humanities and Social Sciences Research Foundation of the Ministry of Education, “Intelligent Analysis and Evaluation of Learning Effection Based on Multi-Modal Data” (No. 21YJAZH072).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Long, S., Yang, Z., Li, Y., Qian, X., Zeng, K., Hao, T. (2023). MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering. In: Zhang, H., et al. International Conference on Neural Computing for Advanced Applications. NCAA 2023. Communications in Computer and Information Science, vol 1870. Springer, Singapore. https://doi.org/10.1007/978-981-99-5847-4_15
Download citation
DOI: https://doi.org/10.1007/978-981-99-5847-4_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5846-7
Online ISBN: 978-981-99-5847-4
eBook Packages: Computer ScienceComputer Science (R0)