research-article

Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

Authors:

Junsong YuanAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5007 - 5016

https://doi.org/10.1145/3581783.3611898

Published: 27 October 2023 Publication History

Abstract

Surgical triplet recognition aims to recognize surgical activities as triplets (i.e.,<instrument, verb, target >), which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50. The code is available at https://github.com/southnx/CoLSurgical.

References

[1]

Blake C Alkire, Nakul P Raykar, Mark G Shrime, Thomas G Weiser, Stephen W Bickler, John A Rose, Cameron T Nutt, Sarah LM Greenberg, Meera Kotagal, Johanna N Riesel, et al. 2015. Global access to surgical care: a modelling study. The Lancet Global Health 3, 6 (2015), e316--e323.

[2]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073 (2022).

[3]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[4]

Stanford CRFM. 2022. BioMed Language Model. https://huggingface.co/stanford-crfm/BioMedLM (2022).

[5]

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).

[6]

Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 (2020).

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[8]

Tony Huang, Jack Chu, and Fangyun Wei. 2022. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649 (2022).

[9]

Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei-Fei. 2018. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 691--699.

[10]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[11]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.

[12]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX 16. Springer, 121--137.

[13]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).

[14]

Benny PL Lo, Ara Darzi, and Guang-Zhong Yang. 2003. Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In International conference on medical image computing and computer-assisted intervention. Springer, 230--237.

[15]

George Mather, Andrea Pavan, Rosilari Bellacosa Marotti, Gianluca Campana, and Clara Casco. 2013. Interactions between motion and form processing in the human visual system. Frontiers in Computational Neuroscience 7 (2013), 65.

[16]

Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. 2020. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVIII. Springer, 336--352.

[17]

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019).

[18]

Chinedu Innocent Nwoye, Cristians Gonzalez, Tong Yu, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2020. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2020: 23rd International Conference, Lima, Peru, October 4-8, 2020, Proceedings, Part III 23. Springer, 364--374.

Digital Library

[19]

Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2022. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78 (2022), 102433.

[20]

Philip H Pucher, L Michael Brunt, Neil Davies, Ali Linsk, Amani Munshi, H Alejan- dro Rodriguez, Abe Fingerhut, Robert D Fanelli, Horacio Asbun, Rajesh Aggarwal, et al. 2018. Outcome trends and safety measures after 30 years of laparoscopic cholecystectomy: a systematic review and pooled data analysis. Surgical endoscopy 32 (2018), 2175--2183.

[21]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[22]

Rogério Richa, Marcin Balicki, Eric Meisner, Raphael Sznitman, Russell Taylor, and Gregory Hager. 2011. Visual tracking of surgical tools for proximity detection in retinal surgery. In International Conference on Information Processing in Computer-Assisted Interventions. Springer, 55--66.

[23]

Nicola Rieke, David Joseph Tan, Chiara Amat di San Filippo, Federico Tombari, Mohamed Alsheakhali, Vasileios Belagiannis, Abouzar Eslami, and Nassir Navab. 2016. Real-time localization of articulated surgical instruments in retinal micro-surgery. Medical image analysis 34 (2016), 82--100.

[24]

Raphael Sznitman, Anasuya Basu, Rogerio Richa, Jim Handa, Peter Gehlbach, Russell H Taylor, Bruno Jedynak, and Gregory D Hager. 2011. Unified detection and tracking in retinal microsurgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 1--8.

[25]

Raphael Sznitman, Carlos Becker, and Pascal Fua. 2014. Fast part-based classifi-cation for instrument detection in minimally invasive surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 692--699.

[26]

Armine Vardazaryan, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2018. Weakly-supervised learning for tool localization in laparoscopic videos. In Intravascular imaging and computer assisted stenting and large-scale annotation of biomedical data and expert label synthesis. Springer, 169--179.

[27]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).

[28]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).

[29]

Thomas G Weiser, Scott E Regenbogen, Katherine D Thompson, Alex B Haynes, Stuart R Lipsitz, William R Berry, and Atul A Gawande. 2008. An estimation of the global volume of surgery: a modelling strategy based on available data. The Lancet 372, 9633 (2008), 139--144.

[30]

Nan Xi, Jingjing Meng, and Junsong Yuan. 2022. Forest Graph Convolutional Network for Surgical Action Triplet Recognition in Endoscopic Videos. IEEE Transactions on Circuits and Systems for Video Technology 32, 12 (2022), 8550--8561.

[31]

Jiaxuan You, Tianyu Du, and Jure Leskovec. 2022. ROLAND: graph learning framework for dynamic graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2358--2366.

Digital Library

[32]

Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017. Relationship proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5678--5686.

[33]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.

[34]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337--2348

Digital Library

Cited By

Li PShu XFeng CFeng YZuo WTang J(2025)Surgical video workflow analysis via visual-language learningnpj Health Systems10.1038/s44401-024-00010-32:1Online publication date: 25-Jan-2025
https://doi.org/10.1038/s44401-024-00010-3
Li YZhao ZLi RLi F(2024)Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trendsArtificial Intelligence Review10.1007/s10462-024-10929-657:11Online publication date: 16-Sep-2024
https://doi.org/10.1007/s10462-024-10929-6
Wang YXi NMeng JYuan J(2024)Interaction-Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_24(419-435)Online publication date: 23-Nov-2024
https://doi.org/10.1007/978-3-031-73411-3_24
Show More Cited By

Index Terms

Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Surgical Activity Triplet Recognition via Triplet Disentanglement
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023
Abstract
Including context-aware decision support in the operating room has the potential to improve surgical safety and efficiency by utilizing real-time feedback obtained from surgical workflow analysis. In this task, recognizing each surgical activity ...
TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks
Medical Image Computing and Computer Assisted Intervention – MICCAI 2020
Abstract
Automatic surgical phase recognition is a challenging and crucial task with the potential to improve patient safety and become an integral part of intra-operative decision-support systems. In this paper, we propose, for the first time in workflow ...
DeepPhase: Surgical Phase Recognition in CATARACTS Videos
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018
Abstract
Automated surgical workflow analysis and understanding can assist surgeons to standardize procedures and enhance post-surgical assessment and indexing, as well as, interventional monitoring. Computer-assisted interventional (CAI) systems based on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation and the Institute of Education Sciences U.S. Department of Educa

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
252
Total Downloads

Downloads (Last 12 months)177
Downloads (Last 6 weeks)13

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li PShu XFeng CFeng YZuo WTang J(2025)Surgical video workflow analysis via visual-language learningnpj Health Systems10.1038/s44401-024-00010-32:1Online publication date: 25-Jan-2025
https://doi.org/10.1038/s44401-024-00010-3
Li YZhao ZLi RLi F(2024)Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trendsArtificial Intelligence Review10.1007/s10462-024-10929-657:11Online publication date: 16-Sep-2024
https://doi.org/10.1007/s10462-024-10929-6
Wang YXi NMeng JYuan J(2024)Interaction-Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI RecognitionComputer Vision – ECCV 202410.1007/978-3-031-73411-3_24(419-435)Online publication date: 23-Nov-2024
https://doi.org/10.1007/978-3-031-73411-3_24
Gui SWang Z(2024)Tail-Enhanced Representation Learning for Surgical Triplet RecognitionMedical Image Computing and Computer Assisted Intervention – MICCAI 202410.1007/978-3-031-72120-5_64(689-699)Online publication date: 3-Oct-2024
https://doi.org/10.1007/978-3-031-72120-5_64

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten