Learning Representations of Endoscopic Videos to Detect Tool Presence Without Supervision

Li, David Z.; Ishii, Masaru; Taylor, Russell H.; Hager, Gregory D.; Sinha, Ayushi

doi:10.1007/978-3-030-60946-7_6

David Z. Li¹⁹,
Masaru Ishii²⁰,
Russell H. Taylor¹⁹,
Gregory D. Hager¹⁹ &
…
Ayushi Sinha²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12445))

Included in the following conference series:

1117 Accesses

Abstract

In this work, we explore whether it is possible to learn representations of endoscopic video frames to perform tasks such as identifying surgical tool presence without supervision. We use a maximum mean discrepancy (MMD) variational autoencoder (VAE) to learn low-dimensional latent representations of endoscopic videos and manipulate these representations to distinguish frames containing tools from those without tools. We use three different methods to manipulate these latent representations in order to predict tool presence in each frame. Our fully unsupervised methods can identify whether endoscopic video frames contain tools with average precision of 71.56, 73.93, and 76.18, respectively, comparable to supervised methods. Our code is available at https://github.com/zdavidli/tool-presence/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos

Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos

Fast machine learning annotation in the medical domain: a semi-automated video annotation tool for gastroenterologists

Article Open access 25 May 2022

References

Attia, M., Hossny, M., Nahavandi, S., Asadi, H.: Surgical tool segmentation using a hybrid deep CNN-RNN auto encoder-decoder. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3373–3378, October 2017
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Kai, L., Li, F.-F.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, June 2009. https://doi.org/10.1109/CVPR.2009.5206848
DiPietro, R., Hager, G.D.: Unsupervised learning for surgical motion by learning to predict the future. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 281–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_33
Chapter Google Scholar
DiPietro, R., et al.: Recognizing surgical activities with recurrent neural networks. In: Medical Image Computing & Computer-Assisted Intervention, pp. 551–558 (2016)
Google Scholar
Ephrat, M.: Acute sinusitis in HD (2013). www.youtube.com/watch?v=6niL7Poc_qQ
García-Peraza-Herrera, L.C., et al.: Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In: Computer-Assisted and Robotic Endoscopy (CARE), pp. 84–95 (2017)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.A.: Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hoffman, M.D., Gelman, A.: The No-U-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)
MathSciNet MATH Google Scholar
Jin, A., Yeung, S., Jopling, J., Krause, J., Azagury, D., Milstein, A., Fei-Fei, L.: Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In: IEEE Winter Conference on Applications of Computer Vision (2018)
Google Scholar
Karen Simonyan, A.Z.: Very deep convolutional networks for large-scale image recognition. ArXiv abs/1409.1556 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. arXiv:1312.6114 (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649, May 2016
Google Scholar
Liu, X., et al.: Self-supervised learning for dense depth estimation in monocular endoscopy. In: Computer Assisted Robotic Endoscopy (CARE), pp. 128–138 (2018)
Google Scholar
Malpani, A., Vedula, S.S., Chen, C.C.G., Hager, G.D.: A study of crowdsourced segment-level surgical skill assessment using pairwise rankings. Int. J. Comput. Assisted Radiol. Surg. 10(9), 1435–1447 (2015). https://doi.org/10.1007/s11548-015-1238-6
Article Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Google Scholar
Pakhomov, D., Premachandran, V., Allan, M., Azizian, M., Navab, N.: Deep Residual Learning for Instrument Segmentation in Robotic Surgery. arXiv:1703.08580 (2017)
Paszke, A., et al.: Automatic differentiation in pytorch. In: NIPS-W (2017)
Google Scholar
Raju, A., Wang, S., Huang, J.: M2cai surgical tool detection challenge report (2016)
Google Scholar
Sahu, M., Mukhopadhyay, A., Szengel, A., Zachow, S.: Tool and phase recognition using contextual CNN features. ArXiv abs/1610.08854 (2016)
Google Scholar
Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628 (2018)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMS. In: Proceedings 32nd International Conference on International Conference on Machine Learning. ICML 2015, vol. 37, pp. 843–852. JMLR.org (2015)
Google Scholar
Stan Development Team: PyStan: the Python interface to Stan, Version 2.17.1.0. (2018). http://mc-stan.org
Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015). http://arxiv.org/abs/1409.4842
Tsui, C., Klein, R., Garabrant, M.: Minimally invasive surgery: national trends in adoption and future directions for hospital strategy. Surgical Endoscopy 27(7), 2253–2257 (2013)
Article Google Scholar
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imag. 36, 86–97 (2016)
Article Google Scholar
Zhao, S., Song, J., Ermon, S.: InfoVAE: Information Maximizing Variational Autoencoders. arXiv:1706.02262 (2017)
Zhu, M.: Recall, precision and average precision. In: Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, p. 30 (2004)
Google Scholar

Download references

Acknowledgements

This work was supported by the Johns Hopkins University Provost’s Postdoctoral fellowship, NVIDIA GPU grant, and other Johns Hopkins University internal funds. We would also like to thank Daniel Malinsky and Robert DiPietro for their invaluable feedback. We would also like to acknowledge the JHU Department of Computer Science providing a research GPU cluster.

Author information

Authors and Affiliations

Department of Computer Science, The Johns Hopkins University, Baltimore, USA
David Z. Li, Russell H. Taylor & Gregory D. Hager
Johns Hopkins Medical Institutions, Baltimore, USA
Masaru Ishii
Laboratory for Computational Sensing and Robotics, The Johns Hopkins University, Baltimore, USA
Ayushi Sinha

Authors

David Z. Li
View author publications
You can also search for this author in PubMed Google Scholar
Masaru Ishii
View author publications
You can also search for this author in PubMed Google Scholar
Russell H. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Gregory D. Hager
View author publications
You can also search for this author in PubMed Google Scholar
Ayushi Sinha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Z. Li .

Editor information

Editors and Affiliations

IBM Almaden Research Center, San Jose, CA, USA
Tanveer Syeda-Mahmood
Aachen University of Applied Sciences, Aachen, Germany
Klaus Drechsler
Tel Aviv University, Ramat Aviv, Israel
Hayit Greenspan
Case Western Reserve University, Cleveland, OH, USA
Anant Madabhushi
IBM Almaden Research Center, San Jose, CA, USA
Alexandros Karargyris
Children’s National Hospital, Washington, DC, USA
Marius George Linguraru
Fraunhofer-Institute for Computer Graphics Research (IGD), Darmstadt, Germany
Cristina Oyarzun Laura
Children’s National Hospital, Washington, DC, USA
Raj Shekhar
Fraunhofer-Institute for Computer Graphics Research (IGD), Darmstadt, Germany
Stefan Wesarg
Universitat Pompeu Fabra, Barcelona, Spain
Miguel Ángel González Ballester
Fraunhofer Singapore, Singapore, Singapore
Marius Erdt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, D.Z., Ishii, M., Taylor, R.H., Hager, G.D., Sinha, A. (2020). Learning Representations of Endoscopic Videos to Detect Tool Presence Without Supervision. In: Syeda-Mahmood, T., et al. Multimodal Learning for Clinical Decision Support and Clinical Image-Based Procedures. CLIP ML-CDS 2020 2020. Lecture Notes in Computer Science(), vol 12445. Springer, Cham. https://doi.org/10.1007/978-3-030-60946-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-60946-7_6
Published: 01 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60945-0
Online ISBN: 978-3-030-60946-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)