skip to main content
10.1145/3581783.3611898acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

Published: 27 October 2023 Publication History

Abstract

Surgical triplet recognition aims to recognize surgical activities as triplets (i.e.,<instrument, verb, target >), which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50. The code is available at https://github.com/southnx/CoLSurgical.

References

[1]
Blake C Alkire, Nakul P Raykar, Mark G Shrime, Thomas G Weiser, Stephen W Bickler, John A Rose, Cameron T Nutt, Sarah LM Greenberg, Meera Kotagal, Johanna N Riesel, et al. 2015. Global access to surgical care: a modelling study. The Lancet Global Health 3, 6 (2015), e316--e323.
[2]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073 (2022).
[3]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[4]
Stanford CRFM. 2022. BioMed Language Model. https://huggingface.co/stanford-crfm/BioMedLM (2022).
[5]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).
[6]
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 (2020).
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[8]
Tony Huang, Jack Chu, and Fangyun Wei. 2022. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649 (2022).
[9]
Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei-Fei. 2018. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 691--699.
[10]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[11]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.
[12]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX 16. Springer, 121--137.
[13]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
[14]
Benny PL Lo, Ara Darzi, and Guang-Zhong Yang. 2003. Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In International conference on medical image computing and computer-assisted intervention. Springer, 230--237.
[15]
George Mather, Andrea Pavan, Rosilari Bellacosa Marotti, Gianluca Campana, and Clara Casco. 2013. Interactions between motion and form processing in the human visual system. Frontiers in Computational Neuroscience 7 (2013), 65.
[16]
Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. 2020. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVIII. Springer, 336--352.
[17]
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019).
[18]
Chinedu Innocent Nwoye, Cristians Gonzalez, Tong Yu, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2020. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2020: 23rd International Conference, Lima, Peru, October 4-8, 2020, Proceedings, Part III 23. Springer, 364--374.
[19]
Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2022. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78 (2022), 102433.
[20]
Philip H Pucher, L Michael Brunt, Neil Davies, Ali Linsk, Amani Munshi, H Alejan- dro Rodriguez, Abe Fingerhut, Robert D Fanelli, Horacio Asbun, Rajesh Aggarwal, et al. 2018. Outcome trends and safety measures after 30 years of laparoscopic cholecystectomy: a systematic review and pooled data analysis. Surgical endoscopy 32 (2018), 2175--2183.
[21]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[22]
Rogério Richa, Marcin Balicki, Eric Meisner, Raphael Sznitman, Russell Taylor, and Gregory Hager. 2011. Visual tracking of surgical tools for proximity detection in retinal surgery. In International Conference on Information Processing in Computer-Assisted Interventions. Springer, 55--66.
[23]
Nicola Rieke, David Joseph Tan, Chiara Amat di San Filippo, Federico Tombari, Mohamed Alsheakhali, Vasileios Belagiannis, Abouzar Eslami, and Nassir Navab. 2016. Real-time localization of articulated surgical instruments in retinal micro-surgery. Medical image analysis 34 (2016), 82--100.
[24]
Raphael Sznitman, Anasuya Basu, Rogerio Richa, Jim Handa, Peter Gehlbach, Russell H Taylor, Bruno Jedynak, and Gregory D Hager. 2011. Unified detection and tracking in retinal microsurgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 1--8.
[25]
Raphael Sznitman, Carlos Becker, and Pascal Fua. 2014. Fast part-based classifi-cation for instrument detection in minimally invasive surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 692--699.
[26]
Armine Vardazaryan, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2018. Weakly-supervised learning for tool localization in laparoscopic videos. In Intravascular imaging and computer assisted stenting and large-scale annotation of biomedical data and expert label synthesis. Springer, 169--179.
[27]
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
[28]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
[29]
Thomas G Weiser, Scott E Regenbogen, Katherine D Thompson, Alex B Haynes, Stuart R Lipsitz, William R Berry, and Atul A Gawande. 2008. An estimation of the global volume of surgery: a modelling strategy based on available data. The Lancet 372, 9633 (2008), 139--144.
[30]
Nan Xi, Jingjing Meng, and Junsong Yuan. 2022. Forest Graph Convolutional Network for Surgical Action Triplet Recognition in Endoscopic Videos. IEEE Transactions on Circuits and Systems for Video Technology 32, 12 (2022), 8550--8561.
[31]
Jiaxuan You, Tianyu Du, and Jure Leskovec. 2022. ROLAND: graph learning framework for dynamic graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2358--2366.
[32]
Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017. Relationship proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5678--5686.
[33]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.
[34]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337--2348

Cited By

View all
  • (2025)Surgical video workflow analysis via visual-language learningnpj Health Systems10.1038/s44401-024-00010-32:1Online publication date: 25-Jan-2025
  • (2024)Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trendsArtificial Intelligence Review10.1007/s10462-024-10929-657:11Online publication date: 16-Sep-2024
  • (2024)Interaction-Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_24(419-435)Online publication date: 23-Nov-2024
  • Show More Cited By

Index Terms

  1. Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chain-of-look prompting
    2. endoscopic videos
    3. surgical triplet recognition
    4. verb-centric

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation and the Institute of Education Sciences U.S. Department of Educa

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)177
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Surgical video workflow analysis via visual-language learningnpj Health Systems10.1038/s44401-024-00010-32:1Online publication date: 25-Jan-2025
    • (2024)Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trendsArtificial Intelligence Review10.1007/s10462-024-10929-657:11Online publication date: 16-Sep-2024
    • (2024)Interaction-Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_24(419-435)Online publication date: 23-Nov-2024
    • (2024)Tail-Enhanced Representation Learning for Surgical Triplet RecognitionMedical Image Computing and Computer Assisted Intervention – MICCAI 202410.1007/978-3-031-72120-5_64(689-699)Online publication date: 3-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media