LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy

Yao, Jieru; Li, Xueran; Xie, Qiang; Han, Longfei; Jia, Yiwen; Liu, Nian; Zhang, Dingwen; Han, Junwei

doi:10.1007/s11704-024-40319-8

LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy

Letter
Published: 22 November 2024

Volume 19, article number 194331, (2025)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jieru Yao¹^na1,
Xueran Li^2,3^na1,
Qiang Xie^2,4^na1,
Longfei Han⁵,
Yiwen Jia⁶,
Nian Liu⁷,
Dingwen Zhang^1,8 &
…
Junwei Han^1,2

111 Accesses
Explore all metrics

Conclusions

We introduce LLaVA-Endo, a large language and vision model designed for the field of GI endoscopy. Specifically, we generate a high-quality dataset for GI endoscopic medical language-image instruction tuning and introduce an innovative progressive transfer learning technique to fine-tune LLaVA. Experimental results show that LLaVA-Endo demonstrates powerful domain expertise and conversational capabilities, outperforming previous SoTA multimodal methods in the field of GI endoscopy data. In future, we intend to collect more data for training and evaluation, and integrate more functionalities such as report generation, and polyp segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Forrest J H, Finlayson N D C, Shearman D J C. Endoscopy in gastrointestinal bleeding. The Lancet, 1974, 304(7877): 394–397
Article Google Scholar
Sharma P, Pante A, Gross S A. Artificial intelligence in endoscopy. Gastrointestinal Endoscopy, 2020, 91(4): 925–931
Article Google Scholar
Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
Google Scholar
Li J, Li D, Xiong C, Hoi S. BLIP: bootstrapping language-image pretraining for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888–12900
Google Scholar
Ye Q, Xu H, Ye J, Yan M, Hu A, Liu H, Qian Q, Zhang J, Huang F, Zhou J. mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. 2023, arXiv preprint arXiv: 2311.04257
Google Scholar
Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. 2023, arXiv preprint arXiv: 2308.02463
Google Scholar
Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
Google Scholar
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022
Google Scholar
Mu Y, Zhang Q, Hu M, Wang W, Ding M, Jin J, Wang B, Dai J, Qiao Y, Luo P. Appendix for embodiedGPT: vision-language pre-training via embodied chain of thought. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62272468, 62003256, 62027813, U1801265, 62293543, 62322605, 62036005, 62202015, and U21B2048), the Key-Area Research and Development Program of Shaanxi Province (2023-ZDLSF-41), the Anhui Medical University (2022xkj105, 2023cy021), the Anhui Provincial Key R&D Program (2023s07020001), and the University Synergy Innovation Program of Anhui Province (GXXT-2022-52).

Author information

These authors contributed equally to this work.

Authors and Affiliations

School of Automation, Northwestern Polytechnical University, Xi’an, 710072, China
Jieru Yao, Dingwen Zhang & Junwei Han
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China
Xueran Li, Qiang Xie & Junwei Han
AHU-IAI AI Joint Laboratory, Anhui University, Hefei, 230039, China
Xueran Li
Institute of Advanced Technology, University of Science and Technology of China, Hefei, 230026, China
Qiang Xie
School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, 100048, China
Longfei Han
Department of Gastroenterology, The Third Affiliated Hospital of Anhui Medical University (Hefei First People’s Hospital), Hefei, 230061, China
Yiwen Jia
The Computer Vision Department, Mohamed Bin Zayed University of Artificial Intelligence, Masdar, Abu Dhabi, 200120, United Arab Emirates
Nian Liu
Xijing Hospital, The Fourth Military Medical University, Xi’an, 710032, China
Dingwen Zhang

Authors

Jieru Yao
View author publications
You can also search for this author inPubMed Google Scholar
Xueran Li
View author publications
You can also search for this author inPubMed Google Scholar
Qiang Xie
View author publications
You can also search for this author inPubMed Google Scholar
Longfei Han
View author publications
You can also search for this author inPubMed Google Scholar
Yiwen Jia
View author publications
You can also search for this author inPubMed Google Scholar
Nian Liu
View author publications
You can also search for this author inPubMed Google Scholar
Dingwen Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Junwei Han
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Longfei Han, Yiwen Jia or Nian Liu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Electronic supplementary material