Medical Visual Question Answering via Targeted Choice Contrast and Multimodal Entity Matching

Guo, Hui; Liu, Lei; Su, Xiangdong; Zhang, Haoran

doi:10.1007/978-3-031-30108-7_29

Hui Guo¹²,
Lei Liu¹²,
Xiangdong Su¹² &
…
Haoran Zhang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

International Conference on Neural Information Processing

624 Accesses

Abstract

Although current methods have advanced the development of medical visual question answering (Med-VQA) task, two aspects remain to be improved, namely extracting high-level medical visual features from small-scale data and exploiting external knowledge. To strengt-hen the performance of Med-VQA, we propose a pre-training model called Targeted Choice Contrast (TCC) and a Multimodal Entity Matc-hing (MEM) module, and integrate them into an end-to-end framework. Specifically, the TCC model extracts deep visual features on the small-scale medical dataset by contrastive learning. It improves model robustness by a targeted selection of negative samples. The MEM module is dedicated to embedding knowledge representation into the framework more accurately. Besides, we apply a mixup strategy for data augmentation during the framework training process to make full use of the small-scale images. Experimental results demonstrate our framework outperforms state-of-the-art methods.

L. Liu—Equal technical contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. arXiv preprint arXiv:2105.00136 (2021)
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. arXiv preprint arXiv:1805.07932 (2018)
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Chapter Google Scholar
Liu, B., Zhan, L., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)
Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Chapter Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Robinson, J., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592 (2020)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Yasunaga, M., Ren, H., Bosselut, A., Liang, P., Leskovec, J.: QA-GNN: reasoning with language models and knowledge graphs for question answering. arXiv preprint arXiv:2104.06378 (2021)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, W., Wang, H., Lai, Z., Hou, C.: Constrained contrastive representation: classification on chest x-rays with limited data. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021). https://doi.org/10.1109/ICME51207.2021.9428273

Download references

Acknowledgements

This work was funded by National Natural Science Foundation of China (Grant No. 61762069), Key Technology Research Program of Inner Mongolia Autonomous Region (Grant No. 2021GG0165), Key R &D and Achievement Transformation Program of Inner Mongolia Autonomous Region (Grant No. 2022YFHH0077), Big Data Lab of Inner Mongolia Discipline Inspection and Supervision Committee (Grant No. 21500–5206043).

Author information

Authors and Affiliations

College of Computer Science, Inner Mongolia University, Hohhot, China
Hui Guo, Lei Liu, Xiangdong Su & Haoran Zhang

Authors

Hui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Su
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangdong Su .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, H., Liu, L., Su, X., Zhang, H. (2023). Medical Visual Question Answering via Targeted Choice Contrast and Multimodal Entity Matching. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-30108-7_29
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30107-0
Online ISBN: 978-3-031-30108-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Medical Visual Question Answering via Targeted Choice Contrast and Multimodal Entity Matching