Abstract
Purpose
Semantic segmentation and activity classification are key components to create intelligent surgical systems able to understand and assist clinical workflow. In the operating room, semantic segmentation is at the core of creating robots aware of clinical surroundings, whereas activity classification aims at understanding OR workflow at a higher level. State-of-the-art semantic segmentation and activity recognition approaches are fully supervised, which is not scalable. Self-supervision can decrease the amount of annotated data needed.
Methods
We propose a new 3D self-supervised task for OR scene understanding utilizing OR scene images captured with ToF cameras. Contrary to other self-supervised approaches, where handcrafted pretext tasks are focused on 2D image features, our proposed task consists of predicting relative 3D distance of image patches by exploiting the depth maps. By learning 3D spatial context, it generates discriminative features for our downstream tasks.
Results
Our approach is evaluated on two tasks and datasets containing multiview data captured from clinical scenarios. We demonstrate a noteworthy improvement in performance on both tasks, specifically on low-regime data where utility of self-supervised learning is the highest.
Conclusion
We propose a novel privacy-preserving self-supervised approach utilizing depth maps. Our proposed method shows performance on par with other self-supervised approaches and could be an interesting way to alleviate the burden of full supervision.
Similar content being viewed by others
References
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI 34(11):2274
Adam S, Aidean S, Helene H, Daniel O, Omid M (2021) Multi-view surgical video action detection via mixed global view attention. In: MICCAI
Aidean S, Helene H, Daniel O, Omid M (2020) Automatic operating room surgical activity recognition for robot-assisted surgery. In: MICCAI
Asano YM, Rupprecht C, Vedaldi A (2020) A critical analysis of self-supervision, or what we can learn from a single image. In: CVPR
Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, Loh A, Karthikesalingam A, Kornblith S, Chen T, Natarajan V, Norouzi M (2021) Big self-supervised models advance medical image classification (ICCV)
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2021) Unsupervised learning of visual features by contrasting cluster assignments
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV
Catchpole K, Perkins CE, Bresee C, Solnik MJ, Sherman B, Fritch JL, Gross B, Jagannathan S, Hakami-Majd N, Avenido RM, Anger JT (2015) Safety, efficiency and learning curves in robotic surgery: a human factors analysis. Surg Endosc 30:3749–3761
Chakraborty I, Elgammal A, Burd RS (2013) Video based activity recognition in trauma resuscitation. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp 1–8
Chen T, Kornblith S, Norouzi M, Hinton GE (2020) A simple framework for contrastive learning of visual representations. arXiv:2002.05709
Dias RD, Yule SJ, Zenati MA (2020) Augmented cognition in the operating room
Doersch C, Gupta AK, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: 2015 IEEE ICCV
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. In: ICLR
Grill JB, Strub F, Altch’e F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BÁ, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS
Hajj HA, Lamard M, Conze PH, Cochener B, Quellec G (2018) Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. MedIA 47:203–218
He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF CVPR, pp 9726–9735
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Computer vision—ECCV 2016, Springer International Publishing, pp 630–645
Issenhuth T, Srivastav VK, Gangi A, Padoy N (2019) Face detection in the operating room: comparison of state-of-the-art methods and a self-supervised approach. In: IJCARS
Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N A multi-view rgb-d approach for human pose estimation in operating rooms. In: 2017 IEEE WACV
Li Z, Shaban A, Simard JG, Rabindran D, DiMaio SP, Mohareri O (2020) A robotic 3d perception system for operating room environment awareness. In: IPCAI
Liu MY, Tuzel O, Ramalingam S, Chellappa R (2011) Entropy rate superpixel segmentation. In: CVPR 2011, pp 2097–2104
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE CVPR, pp 3431–3440
Luo Z, Hsieh JT, Balachandar N, Yeung S, Pusiol G, Luxenberg JS, Li G, Li LJ, Milstein A, Fei-Fei L (2018) Vision-based descriptive analytics of seniors—daily activities for long-term health monitoring
Newell A, Deng J (2020) How useful is self-supervised pretraining for visual tasks? In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV
Ouyang C, Biffi C, Chen C, Kart T, Qiu H, Rueckert D (2020) Self-supervision with superpixels: training few-shot medical image segmentation without annotation. In: ECCV
Roß T, Zimmerer D, Vemuri AS, Isensee F, Bodenstedt S, Both F, Kessler P, Wagner M, Müller-Stich BP, Kenngott H, Speidel S, Maier-Hein K, Maier-Hein L (2018) Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. In: IJCARS
Sheetz KH, Claflin J (2020) Trends in the adoption of robotic surgery for common surgical procedures. JAMA Netw Open 3:e1918911
Srivastav VK, Gangi A, Padoy N (2019) Human pose estimation on privacy-preserving low-resolution depth images. In: MICCAI. arXiv:2007.08340
Srivastav VK, Issenhuth T, Kadkhodamohammadi A, de Mathelin M, Gangi A, Padoy N (2018) Mvor: a multi-view RGB-d operating room dataset for 2d and 3d human pose estimation. arXiv:1808.08180
Taleb A, Loetzsch W, Danz N, Severin J thomas. gaertner, Bergner B, Lippert C (2020) 3d self-supervised methods for medical imaging. In: NeurIPS. arXiv:2006.03829
Twinanda AP, Shehata S, Mutter D, Marescaux J, de Mathelin M, Padoy N (2017) EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE TMI 36(1):86–97
Twinanda AP, Winata P, Gangi A, De M, Mathelin PN (2017) Multi-stream deep architecture for surgical phase recognition on multi-view RGBD videos
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748
Wang X, Zhang R, Shen C, Kong T, Li L (2021) Dense contrastive learning for self-supervised visual pre-training. In: 2021 IEEE/CVF CVPR, pp 3023–3032
Yu T, Mutter D, Marescaux J, Padoy N (2019) Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. In: IPCAI
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. In: ICML
Acknowledgements
This work is supported by a PhD fellowship from Intuitive Surgical and by French state funds managed within the “Plan Investissements d’Avenir” by the ANR (reference ANR-10-IAHU-02).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Idris Hamoud is funded by a research scholarship from Intuitive Surgical. Nicolas Padoy is a scientific advisor to Caresyntax on topics unrelated to this study. The other authors declare that they have no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.
Informed consent
Data have been collected within an Institutional Review Board (IRB)-approved study, and all participants’ informed consent has been obtained.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hamoud, I., Karargyris, A., Sharghi, A. et al. Self-supervised learning via cluster distance prediction for operating room context awareness. Int J CARS 17, 1469–1476 (2022). https://doi.org/10.1007/s11548-022-02629-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11548-022-02629-9