Conclusions
We introduce LLaVA-Endo, a large language and vision model designed for the field of GI endoscopy. Specifically, we generate a high-quality dataset for GI endoscopic medical language-image instruction tuning and introduce an innovative progressive transfer learning technique to fine-tune LLaVA. Experimental results show that LLaVA-Endo demonstrates powerful domain expertise and conversational capabilities, outperforming previous SoTA multimodal methods in the field of GI endoscopy data. In future, we intend to collect more data for training and evaluation, and integrate more functionalities such as report generation, and polyp segmentation.
References
Forrest J H, Finlayson N D C, Shearman D J C. Endoscopy in gastrointestinal bleeding. The Lancet, 1974, 304(7877): 394–397
Sharma P, Pante A, Gross S A. Artificial intelligence in endoscopy. Gastrointestinal Endoscopy, 2020, 91(4): 925–931
Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
Li J, Li D, Xiong C, Hoi S. BLIP: bootstrapping language-image pretraining for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888–12900
Ye Q, Xu H, Ye J, Yan M, Hu A, Liu H, Qian Q, Zhang J, Huang F, Zhou J. mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. 2023, arXiv preprint arXiv: 2311.04257
Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. 2023, arXiv preprint arXiv: 2308.02463
Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022
Mu Y, Zhang Q, Hu M, Wang W, Ding M, Jin J, Wang B, Dai J, Qiao Y, Luo P. Appendix for embodiedGPT: vision-language pre-training via embodied chain of thought. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62272468, 62003256, 62027813, U1801265, 62293543, 62322605, 62036005, 62202015, and U21B2048), the Key-Area Research and Development Program of Shaanxi Province (2023-ZDLSF-41), the Anhui Medical University (2022xkj105, 2023cy021), the Anhui Provincial Key R&D Program (2023s07020001), and the University Synergy Innovation Program of Anhui Province (GXXT-2022-52).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Yao, J., Li, X., Xie, Q. et al. LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy. Front. Comput. Sci. 19, 194331 (2025). https://doi.org/10.1007/s11704-024-40319-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-024-40319-8