Skip to main content

Advertisement

Log in

LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy

  • Letter
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Conclusions

We introduce LLaVA-Endo, a large language and vision model designed for the field of GI endoscopy. Specifically, we generate a high-quality dataset for GI endoscopic medical language-image instruction tuning and introduce an innovative progressive transfer learning technique to fine-tune LLaVA. Experimental results show that LLaVA-Endo demonstrates powerful domain expertise and conversational capabilities, outperforming previous SoTA multimodal methods in the field of GI endoscopy data. In future, we intend to collect more data for training and evaluation, and integrate more functionalities such as report generation, and polyp segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Forrest J H, Finlayson N D C, Shearman D J C. Endoscopy in gastrointestinal bleeding. The Lancet, 1974, 304(7877): 394–397

    Article  Google Scholar 

  2. Sharma P, Pante A, Gross S A. Artificial intelligence in endoscopy. Gastrointestinal Endoscopy, 2020, 91(4): 925–931

    Article  Google Scholar 

  3. Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36

    Google Scholar 

  4. Li J, Li D, Xiong C, Hoi S. BLIP: bootstrapping language-image pretraining for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888–12900

    Google Scholar 

  5. Ye Q, Xu H, Ye J, Yan M, Hu A, Liu H, Qian Q, Zhang J, Huang F, Zhou J. mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. 2023, arXiv preprint arXiv: 2311.04257

    Google Scholar 

  6. Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. 2023, arXiv preprint arXiv: 2308.02463

    Google Scholar 

  7. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36

    Google Scholar 

  8. Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

    Google Scholar 

  9. Mu Y, Zhang Q, Hu M, Wang W, Ding M, Jin J, Wang B, Dai J, Qiao Y, Luo P. Appendix for embodiedGPT: vision-language pre-training via embodied chain of thought. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62272468, 62003256, 62027813, U1801265, 62293543, 62322605, 62036005, 62202015, and U21B2048), the Key-Area Research and Development Program of Shaanxi Province (2023-ZDLSF-41), the Anhui Medical University (2022xkj105, 2023cy021), the Anhui Provincial Key R&D Program (2023s07020001), and the University Synergy Innovation Program of Anhui Province (GXXT-2022-52).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Longfei Han, Yiwen Jia or Nian Liu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, J., Li, X., Xie, Q. et al. LLaVA-Endo: a large language-and-vision assistant for gastrointestinal endoscopy. Front. Comput. Sci. 19, 194331 (2025). https://doi.org/10.1007/s11704-024-40319-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-024-40319-8