Abstract
Different from Visual Question Answering (VQA) in the general domain, Medical VQA is more challenging due to the lack of large-scale labeled datasets. In addition, Medical VQA requires high interpretability when making decisions to answer clinical questions. Thus, it should be clear which visual elements within the medical image such as organs or abnormalities are essential for answering clinical questions. To overcome these challenges, we propose a novel method based on Vision Transformer (ViT), which reformulates Medical VQA as a multi-task learning task. We first construct soft pseudo labels of logits for essential selected visual elements from limited annotation data of the existing Medical VQA dataset. Then, we apply these pseudo labels in our proposed Medical VQA model by predicting the answer and pseudo labels at the same time, which not only improves the performance of the proposed model but also presents better interpretability. Extensive experiments on two Medical VQA datasets demonstrate the effectiveness of our proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abacha, A.B., Gayen, S., Lau, J., Rajaraman, S., Demner-Fushman, D.: NLM at imageCLEF 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018)
Chen, Z., et al.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: MICCAI. LNCS, vol. 13435, pp. 679–689. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16443-9_65
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR abs/2112.13906 (2021)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP abs/1606.01847, pp. 457–468 (2016)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460. ACM (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, 339–351 (2017)
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS, pp. 1571–1581 (2018)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Lau, J., Gayen, S., Abacha, A.B., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5 (2018)
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR. OpenReview.net (2019)
Moon, J.H., Lee, H., Shin, W., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. CoRR (2021)
Naseer, M., Ranasinghe, K., Khan, S.H., Hayat, M., Khan, F.S., Yang, M.: Intriguing properties of vision transformers. CoRR abs/2105.10497 (2021)
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Peng, Y., Liu, F.: UMass at ImageCLEF medical visual question answering (Med-VQA) 2018 task. In: CLEF (2018)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Shi, L., Liu, F., Rosen, M.P.: Deep multimodal learning for medical visual question answering. In: CLEF (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang University at ImageCLEF 2019 visual question answering in the medical domain. In: CLEF (2019)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.: Medical visual question answering via conditional reasoning. In: ACM MM, pp. 2345–2354. ACM (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, Z., Xie, Y., Xia, Y., Wu, Q. (2023). PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data. In: Woo, J., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops. MICCAI 2023. Lecture Notes in Computer Science, vol 14394. Springer, Cham. https://doi.org/10.1007/978-3-031-47425-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-47425-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47424-8
Online ISBN: 978-3-031-47425-5
eBook Packages: Computer ScienceComputer Science (R0)