PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data

Yu, Zheng; Xie, Yutong; Xia, Yong; Wu, Qi

doi:10.1007/978-3-031-47425-5_32

Zheng Yu³⁵,
Yutong Xie³⁵,
Yong Xia³⁶ &
…
Qi Wu³⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14394))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

187 Accesses

Abstract

Different from Visual Question Answering (VQA) in the general domain, Medical VQA is more challenging due to the lack of large-scale labeled datasets. In addition, Medical VQA requires high interpretability when making decisions to answer clinical questions. Thus, it should be clear which visual elements within the medical image such as organs or abnormalities are essential for answering clinical questions. To overcome these challenges, we propose a novel method based on Vision Transformer (ViT), which reformulates Medical VQA as a multi-task learning task. We first construct soft pseudo labels of logits for essential selected visual elements from limited annotation data of the existing Medical VQA dataset. Then, we apply these pseudo labels in our proposed Medical VQA model by predicting the answer and pseudo labels at the same time, which not only improves the performance of the proposed model but also presents better interpretability. Extensive experiments on two Medical VQA datasets demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abacha, A.B., Gayen, S., Lau, J., Rajaraman, S., Demner-Fushman, D.: NLM at imageCLEF 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018)
Google Scholar
Chen, Z., et al.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: MICCAI. LNCS, vol. 13435, pp. 679–689. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16443-9_65
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR abs/2112.13906 (2021)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP abs/1606.01847, pp. 457–468 (2016)
Google Scholar
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460. ACM (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, 339–351 (2017)
Article Google Scholar
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)
Google Scholar
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS, pp. 1571–1581 (2018)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Lau, J., Gayen, S., Abacha, A.B., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5 (2018)
Google Scholar
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)
Google Scholar
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Chapter Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR. OpenReview.net (2019)
Google Scholar
Moon, J.H., Lee, H., Shin, W., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. CoRR (2021)
Google Scholar
Naseer, M., Ranasinghe, K., Khan, S.H., Hayat, M., Khan, F.S., Yang, M.: Intriguing properties of vision transformers. CoRR abs/2105.10497 (2021)
Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Chapter Google Scholar
Peng, Y., Liu, F.: UMass at ImageCLEF medical visual question answering (Med-VQA) 2018 task. In: CLEF (2018)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Shi, L., Liu, F., Rosen, M.P.: Deep multimodal learning for medical visual question answering. In: CLEF (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
Google Scholar
Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang University at ImageCLEF 2019 visual question answering in the medical domain. In: CLEF (2019)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)
Google Scholar
Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.: Medical visual question answering via conditional reasoning. In: ACM MM, pp. 2345–2354. ACM (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Adelaide, Adelaide, Australia
Zheng Yu, Yutong Xie & Qi Wu
Northwestern Polytechnical University, Xi’an, China
Yong Xia

Authors

Zheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yong Xia
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Wu .

Editor information

Editors and Affiliations

Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
Jonghye Woo
Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
Alessa Hering
The Netherlands Cancer Institute, Amsterdam, The Netherlands
Wilson Silva
Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
Xiang Li
A*STAR, Institute of High Performance Computing, Singapore, Singapore
Huazhu Fu
Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
Xiaofeng Liu
Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
Fangxu Xing
University of Maryland Baltimore County, Baltimore, MD, USA
Sanjay Purushotham
National Institutes of Health, Bethesda, MD, USA
Tejas S. Mathai
National Institutes of Health, Bethesda, MD, USA
Pritam Mukherjee
Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
Max De Grauw
The Netherlands Cancer Institute, Amsterdam, The Netherlands
Regina Beets Tan
The Netherlands Cancer Institute, Amsterdam, The Netherlands
Valentina Corbetta
University Hospital Freiburg, Freiburg, Germany
Elmar Kotter
University of Bern, Bern, Switzerland
Mauricio Reyes
University of Tübingen, Tübingen, Germany
Christian F. Baumgartner
Harvard Medical School, Massachusetts General Hospital, Boston, MA, USA
Quanzheng Li
University of Southern California, Los Angeles, CA, USA
Richard Leahy
Peking University, Beijing, China
Bin Dong
Hong Kong University of Science and Technology, Kowloon, Hong Kong
Hao Chen
Vanderbilt University, Nashville, TN, USA
Yuankai Huo
University of Sydney, Camperdown, NSW, Australia
Jinglei Lv
Institute of High Performance Computing, Singapore, Singapore
Xinxing Xu
Hong Kong University of Science and Technology, Hong Kong, China
Xiaomeng Li
Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Dwarikanath Mahapatra
University of Alberta, Edmonton, AB, Canada
Li Cheng
LITIS, University of Rouen, Rouen, France
Caroline Petitjean
ImViA Laboratory, University of Burgundy, Dijon, France
Benoît Presles

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, Z., Xie, Y., Xia, Y., Wu, Q. (2023). PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data. In: Woo, J., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops. MICCAI 2023. Lecture Notes in Computer Science, vol 14394. Springer, Cham. https://doi.org/10.1007/978-3-031-47425-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-47425-5_32
Published: 03 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47424-8
Online ISBN: 978-3-031-47425-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

PLMVQA: Applying Pseudo Labels for Medical Visual Question Answering with Limited Data