Skip to main content

PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops (MICCAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14394))

  • 187 Accesses

Abstract

Different from Visual Question Answering (VQA) in the general domain, Medical VQA is more challenging due to the lack of large-scale labeled datasets. In addition, Medical VQA requires high interpretability when making decisions to answer clinical questions. Thus, it should be clear which visual elements within the medical image such as organs or abnormalities are essential for answering clinical questions. To overcome these challenges, we propose a novel method based on Vision Transformer (ViT), which reformulates Medical VQA as a multi-task learning task. We first construct soft pseudo labels of logits for essential selected visual elements from limited annotation data of the existing Medical VQA dataset. Then, we apply these pseudo labels in our proposed Medical VQA model by predicting the answer and pseudo labels at the same time, which not only improves the performance of the proposed model but also presents better interpretability. Extensive experiments on two Medical VQA datasets demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abacha, A.B., Gayen, S., Lau, J., Rajaraman, S., Demner-Fushman, D.: NLM at imageCLEF 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018)

    Google Scholar 

  2. Chen, Z., et al.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: MICCAI. LNCS, vol. 13435, pp. 679–689. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16443-9_65

  3. Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7

    Chapter  Google Scholar 

  4. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  5. Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR abs/2112.13906 (2021)

    Google Scholar 

  6. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP abs/1606.01847, pp. 457–468 (2016)

    Google Scholar 

  7. Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460. ACM (2021)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  9. Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, 339–351 (2017)

    Article  Google Scholar 

  10. Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)

    Google Scholar 

  11. Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS, pp. 1571–1581 (2018)

    Google Scholar 

  12. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)

    Google Scholar 

  13. Lau, J., Gayen, S., Abacha, A.B., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5 (2018)

    Google Scholar 

  14. Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)

    Google Scholar 

  15. Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20

    Chapter  Google Scholar 

  16. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR. OpenReview.net (2019)

    Google Scholar 

  17. Moon, J.H., Lee, H., Shin, W., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. CoRR (2021)

    Google Scholar 

  18. Naseer, M., Ranasinghe, K., Khan, S.H., Hayat, M., Khan, F.S., Yang, M.: Intriguing properties of vision transformers. CoRR abs/2105.10497 (2021)

    Google Scholar 

  19. Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57

    Chapter  Google Scholar 

  20. Peng, Y., Liu, F.: UMass at ImageCLEF medical visual question answering (Med-VQA) 2018 task. In: CLEF (2018)

    Google Scholar 

  21. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  22. Shi, L., Liu, F., Rosen, M.P.: Deep multimodal learning for medical visual question answering. In: CLEF (2019)

    Google Scholar 

  23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)

    Google Scholar 

  24. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang University at ImageCLEF 2019 visual question answering in the medical domain. In: CLEF (2019)

    Google Scholar 

  25. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)

    Google Scholar 

  26. Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.: Medical visual question answering via conditional reasoning. In: ACM MM, pp. 2345–2354. ACM (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, Z., Xie, Y., Xia, Y., Wu, Q. (2023). PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data. In: Woo, J., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops. MICCAI 2023. Lecture Notes in Computer Science, vol 14394. Springer, Cham. https://doi.org/10.1007/978-3-031-47425-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47425-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47424-8

  • Online ISBN: 978-3-031-47425-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics