Skip to main content
Log in

Self-supervised learning via cluster distance prediction for operating room context awareness

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

Semantic segmentation and activity classification are key components to create intelligent surgical systems able to understand and assist clinical workflow. In the operating room, semantic segmentation is at the core of creating robots aware of clinical surroundings, whereas activity classification aims at understanding OR workflow at a higher level. State-of-the-art semantic segmentation and activity recognition approaches are fully supervised, which is not scalable. Self-supervision can decrease the amount of annotated data needed.

Methods

We propose a new 3D self-supervised task for OR scene understanding utilizing OR scene images captured with ToF cameras. Contrary to other self-supervised approaches, where handcrafted pretext tasks are focused on 2D image features, our proposed task consists of predicting relative 3D distance of image patches by exploiting the depth maps. By learning 3D spatial context, it generates discriminative features for our downstream tasks.

Results

Our approach is evaluated on two tasks and datasets containing multiview data captured from clinical scenarios. We demonstrate a noteworthy improvement in performance on both tasks, specifically on low-regime data where utility of self-supervised learning is the highest.

Conclusion

We propose a novel privacy-preserving self-supervised approach utilizing depth maps. Our proposed method shows performance on par with other self-supervised approaches and could be an interesting way to alleviate the burden of full supervision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI 34(11):2274

    Article  Google Scholar 

  2. Adam S, Aidean S, Helene H, Daniel O, Omid M (2021) Multi-view surgical video action detection via mixed global view attention. In: MICCAI

  3. Aidean S, Helene H, Daniel O, Omid M (2020) Automatic operating room surgical activity recognition for robot-assisted surgery. In: MICCAI

  4. Asano YM, Rupprecht C, Vedaldi A (2020) A critical analysis of self-supervision, or what we can learn from a single image. In: CVPR

  5. Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, Loh A, Karthikesalingam A, Kornblith S, Chen T, Natarajan V, Norouzi M (2021) Big self-supervised models advance medical image classification (ICCV)

  6. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2021) Unsupervised learning of visual features by contrasting cluster assignments

  7. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV

  8. Catchpole K, Perkins CE, Bresee C, Solnik MJ, Sherman B, Fritch JL, Gross B, Jagannathan S, Hakami-Majd N, Avenido RM, Anger JT (2015) Safety, efficiency and learning curves in robotic surgery: a human factors analysis. Surg Endosc 30:3749–3761

    Article  Google Scholar 

  9. Chakraborty I, Elgammal A, Burd RS (2013) Video based activity recognition in trauma resuscitation. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp 1–8

  10. Chen T, Kornblith S, Norouzi M, Hinton GE (2020) A simple framework for contrastive learning of visual representations. arXiv:2002.05709

  11. Dias RD, Yule SJ, Zenati MA (2020) Augmented cognition in the operating room

  12. Doersch C, Gupta AK, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: 2015 IEEE ICCV

  13. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. In: ICLR

  14. Grill JB, Strub F, Altch’e F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BÁ, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS

  15. Hajj HA, Lamard M, Conze PH, Cochener B, Quellec G (2018) Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. MedIA 47:203–218

    Google Scholar 

  16. He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF CVPR, pp 9726–9735

  17. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Computer vision—ECCV 2016, Springer International Publishing, pp 630–645

  18. Issenhuth T, Srivastav VK, Gangi A, Padoy N (2019) Face detection in the operating room: comparison of state-of-the-art methods and a self-supervised approach. In: IJCARS

  19. Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N A multi-view rgb-d approach for human pose estimation in operating rooms. In: 2017 IEEE WACV

  20. Li Z, Shaban A, Simard JG, Rabindran D, DiMaio SP, Mohareri O (2020) A robotic 3d perception system for operating room environment awareness. In: IPCAI

  21. Liu MY, Tuzel O, Ramalingam S, Chellappa R (2011) Entropy rate superpixel segmentation. In: CVPR 2011, pp 2097–2104

  22. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE CVPR, pp 3431–3440

  23. Luo Z, Hsieh JT, Balachandar N, Yeung S, Pusiol G, Luxenberg JS, Li G, Li LJ, Milstein A, Fei-Fei L (2018) Vision-based descriptive analytics of seniors—daily activities for long-term health monitoring

  24. Newell A, Deng J (2020) How useful is self-supervised pretraining for visual tasks? In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  25. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV

  26. Ouyang C, Biffi C, Chen C, Kart T, Qiu H, Rueckert D (2020) Self-supervision with superpixels: training few-shot medical image segmentation without annotation. In: ECCV

  27. Roß T, Zimmerer D, Vemuri AS, Isensee F, Bodenstedt S, Both F, Kessler P, Wagner M, Müller-Stich BP, Kenngott H, Speidel S, Maier-Hein K, Maier-Hein L (2018) Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. In: IJCARS

  28. Sheetz KH, Claflin J (2020) Trends in the adoption of robotic surgery for common surgical procedures. JAMA Netw Open 3:e1918911

    Article  Google Scholar 

  29. Srivastav VK, Gangi A, Padoy N (2019) Human pose estimation on privacy-preserving low-resolution depth images. In: MICCAI. arXiv:2007.08340

  30. Srivastav VK, Issenhuth T, Kadkhodamohammadi A, de Mathelin M, Gangi A, Padoy N (2018) Mvor: a multi-view RGB-d operating room dataset for 2d and 3d human pose estimation. arXiv:1808.08180

  31. Taleb A, Loetzsch W, Danz N, Severin J thomas. gaertner, Bergner B, Lippert C (2020) 3d self-supervised methods for medical imaging. In: NeurIPS. arXiv:2006.03829

  32. Twinanda AP, Shehata S, Mutter D, Marescaux J, de Mathelin M, Padoy N (2017) EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE TMI 36(1):86–97

    Google Scholar 

  33. Twinanda AP, Winata P, Gangi A, De M, Mathelin PN (2017) Multi-stream deep architecture for surgical phase recognition on multi-view RGBD videos

  34. van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605

    Google Scholar 

  35. van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748

  36. Wang X, Zhang R, Shen C, Kong T, Li L (2021) Dense contrastive learning for self-supervised visual pre-training. In: 2021 IEEE/CVF CVPR, pp 3023–3032

  37. Yu T, Mutter D, Marescaux J, Padoy N (2019) Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. In: IPCAI

  38. Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. In: ICML

Download references

Acknowledgements

This work is supported by a PhD fellowship from Intuitive Surgical and by French state funds managed within the “Plan Investissements d’Avenir” by the ANR (reference ANR-10-IAHU-02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Idris Hamoud.

Ethics declarations

Conflict of interest

Idris Hamoud is funded by a research scholarship from Intuitive Surgical. Nicolas Padoy is a scientific advisor to Caresyntax on topics unrelated to this study. The other authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

Informed consent

Data have been collected within an Institutional Review Board (IRB)-approved study, and all participants’ informed consent has been obtained.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hamoud, I., Karargyris, A., Sharghi, A. et al. Self-supervised learning via cluster distance prediction for operating room context awareness. Int J CARS 17, 1469–1476 (2022). https://doi.org/10.1007/s11548-022-02629-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-022-02629-9

Keywords

Navigation