Multi-view Surgical Video Action Detection via Mixed Global View Attention

Schmidt, Adam; Sharghi, Aidean; Haugerud, Helene; Oh, Daniel; Mohareri, Omid

doi:10.1007/978-3-030-87202-1_60

Multi-view Surgical Video Action Detection via Mixed Global View Attention

Adam Schmidt¹⁵,
Aidean Sharghi¹⁶,
Helene Haugerud¹⁶,
Daniel Oh¹⁶ &
…
Omid Mohareri¹⁶

Conference paper
First Online: 21 September 2021

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12904))

Abstract

Automatic surgical activity detection in the operating room can enable intelligent systems that potentially lead to more efficient surgical workflow. While real-world implementations of video activity detection in the OR most likely rely on multiple video feeds observing the environment from different view points to handle occlusion and clutter, the research on the matter has been left under-explored. This is perhaps due to the lack of a suitable dataset, thus, as our first contribution, we introduce the first large-scale multi-view surgical action detection dataset that includes over 120 temporally annotated robotic surgery operations, each recorded from 4 different viewpoints, resulting in 480 full-length surgical videos. As our second contribution, we design a novel model architecture that can detect surgical actions by utilizing multiple time-synchronized videos with shared field of view to better detect the activity that is taking place at any time. We explore early, hybrid, and late fusion methods for combining data from different views. We settle on a late fusion model that remains insensitive to sensor locations and feeding order, improving over single-view performance by using a mixing in the style of attention. Our model learns how to dynamically weight and fuse information across all views. We demonstrate improvements in mean Average Precision across the board using our new model.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

University of Central Florida-aerial camera, rooftop camera and ground camera dataset. https://www.crcv.ucf.edu/data/UCF-ARG.php
Al Hajj, H., et al.: CATARACTS: challenge on automatic tool annotation for cataract surgery. Med. Image Anal. 52, 24–41 (2019)
Article Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Feichtenhofer, C.: X3D: Expanding architectures for efficient video recognition. arXiv:2004.04730 [cs], April 2020
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction database. In: 2009 Conference for Visual Media Production, London, United Kingdom, pp. 159–168. IEEE, November 2009. https://doi.org/10.1109/CVMP.2009.19
Home Office Scientific Development Branch: Imagery library for intelligent detection systems (i-LIDS). In: 2006 IET Conference on Crime and Security, pp. 445–448, June 2006
Google Scholar
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)
Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Li, W., Wong, Y., Liu, A.A., Li, Y., Su, Y.T., Kankanhalli, M.: Multi-camera action dataset for cross-camera action recognition benchmarking. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 187–196, March 2017. https://doi.org/10.1109/WACV.2017.28
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3889–3898 (2019)
Google Scholar
Liu, A., Su, Y., Jia, P., Gao, Z., Hao, T., Yang, Z.: Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans. Cybern. 45(6), 1194–1208 (2015). https://doi.org/10.1109/TCYB.2014.2347057
Article Google Scholar
Liu, A., Xu, N., Nie, W., Su, Y., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017). https://doi.org/10.1109/TCYB.2016.2582918
Article Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020). https://doi.org/10.1109/TPAMI.2019.2916873
Article Google Scholar
Ma, A.J., et al.: Measuring patient mobility in the ICU using a novel noninvasive sensor. Crit. Care Med. 45(4), 630 (2017)
Article Google Scholar
Machado, G., Ferreira, E., Nogueira, K., Oliveira, H., Gama, P., dos Santos, J.A.: AiRound and CV-BrCT: novel multi-view datasets for scene classification. arXiv:2008.01133 [cs], August 2020
Murtaza, F., Yousaf, M.H., Velastin, S.A.: Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Comput. Vis. 10(7), 758–767 (2016). https://doi.org/10.1049/iet-cvi.2015.0416
Article Google Scholar
Rybok, L., Friedberger, S., Hanebeck, U.D., Stiefelhagen, R.: The KIT Robo-kitchen data set for the evaluation of view-based activity recognition systems. In: 2011 11th IEEE-RAS International Conference on Humanoid Robots, Bled, Slovenia, pp. 128–133. IEEE, October 2011. https://doi.org/10.1109/Humanoids.2011.6100854
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Sharghi, A., Haugerud, H., Oh, D., Mohareri, O.: Automatic operating room surgical activity recognition for robot-assisted surgery. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 385–395. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_37
Chapter Google Scholar
Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. arXiv:1804.09627 [cs], April 2018
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs], December 2017
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning, and recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 2649–2656. IEEE, June 2014. https://doi.org/10.1109/CVPR.2014.339
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006). https://doi.org/10.1016/j.cviu.2006.07.013
Article Google Scholar
Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: surgical phase recognition with less annotations through self-supervised pre-training of CNN-LSTM networks. arXiv preprint arXiv:1805.08569 (2018)
Yeung, S., et al.: A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit. Med. 2(1), 1–5 (2019)
Article Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)
Google Scholar
Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 273–280. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_32
Chapter Google Scholar

Download references

Acknowledgements

This work was completed while Adam Schmidt was an intern at Intuitive Surgical.

Author information

Authors and Affiliations

University of British Columbia, Vancouver, Canada
Adam Schmidt
Intuitive Surgical Inc., Sunnyvale, USA
Aidean Sharghi, Helene Haugerud, Daniel Oh & Omid Mohareri

Authors

Adam Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Aidean Sharghi
View author publications
You can also search for this author in PubMed Google Scholar
Helene Haugerud
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Oh
View author publications
You can also search for this author in PubMed Google Scholar
Omid Mohareri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Schmidt .

Editor information

Editors and Affiliations

Erasmus MC - University Medical Center Rotterdam, Rotterdam, The Netherlands
Marleen de Bruijne
University of Basel, Allschwil, Switzerland
Philippe C. Cattin
Inria Nancy Grand Est, Villers-lès-Nancy, France
Stéphane Cotin
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Nicolas Padoy
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Tencent Jarvis Lab, Shenzhen, China
Yefeng Zheng
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Caroline Essert

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2486 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schmidt, A., Sharghi, A., Haugerud, H., Oh, D., Mohareri, O. (2021). Multi-view Surgical Video Action Detection via Mixed Global View Attention. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12904. Springer, Cham. https://doi.org/10.1007/978-3-030-87202-1_60

Download citation

DOI: https://doi.org/10.1007/978-3-030-87202-1_60
Published: 21 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87201-4
Online ISBN: 978-3-030-87202-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)