Skip to main content

Multi-view Surgical Video Action Detection via Mixed Global View Attention

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12904))

Abstract

Automatic surgical activity detection in the operating room can enable intelligent systems that potentially lead to more efficient surgical workflow. While real-world implementations of video activity detection in the OR most likely rely on multiple video feeds observing the environment from different view points to handle occlusion and clutter, the research on the matter has been left under-explored. This is perhaps due to the lack of a suitable dataset, thus, as our first contribution, we introduce the first large-scale multi-view surgical action detection dataset that includes over 120 temporally annotated robotic surgery operations, each recorded from 4 different viewpoints, resulting in 480 full-length surgical videos. As our second contribution, we design a novel model architecture that can detect surgical actions by utilizing multiple time-synchronized videos with shared field of view to better detect the activity that is taking place at any time. We explore early, hybrid, and late fusion methods for combining data from different views. We settle on a late fusion model that remains insensitive to sensor locations and feeding order, improving over single-view performance by using a mixing in the style of attention. Our model learns how to dynamically weight and fuse information across all views. We demonstrate improvements in mean Average Precision across the board using our new model.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. University of Central Florida-aerial camera, rooftop camera and ground camera dataset. https://www.crcv.ucf.edu/data/UCF-ARG.php

  2. Al Hajj, H., et al.: CATARACTS: challenge on automatic tool annotation for cataract surgery. Med. Image Anal. 52, 24–41 (2019)

    Article  Google Scholar 

  3. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)

    Article  Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  5. Feichtenhofer, C.: X3D: Expanding architectures for efficient video recognition. arXiv:2004.04730 [cs], April 2020

  6. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  7. Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction database. In: 2009 Conference for Visual Media Production, London, United Kingdom, pp. 159–168. IEEE, November 2009. https://doi.org/10.1109/CVMP.2009.19

  8. Home Office Scientific Development Branch: Imagery library for intelligent detection systems (i-LIDS). In: 2006 IET Conference on Crime and Security, pp. 445–448, June 2006

    Google Scholar 

  9. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)

    Google Scholar 

  10. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  11. Li, W., Wong, Y., Liu, A.A., Li, Y., Su, Y.T., Kankanhalli, M.: Multi-camera action dataset for cross-camera action recognition benchmarking. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 187–196, March 2017. https://doi.org/10.1109/WACV.2017.28

  12. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3889–3898 (2019)

    Google Scholar 

  13. Liu, A., Su, Y., Jia, P., Gao, Z., Hao, T., Yang, Z.: Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans. Cybern. 45(6), 1194–1208 (2015). https://doi.org/10.1109/TCYB.2014.2347057

    Article  Google Scholar 

  14. Liu, A., Xu, N., Nie, W., Su, Y., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017). https://doi.org/10.1109/TCYB.2016.2582918

    Article  Google Scholar 

  15. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020). https://doi.org/10.1109/TPAMI.2019.2916873

    Article  Google Scholar 

  16. Ma, A.J., et al.: Measuring patient mobility in the ICU using a novel noninvasive sensor. Crit. Care Med. 45(4), 630 (2017)

    Article  Google Scholar 

  17. Machado, G., Ferreira, E., Nogueira, K., Oliveira, H., Gama, P., dos Santos, J.A.: AiRound and CV-BrCT: novel multi-view datasets for scene classification. arXiv:2008.01133 [cs], August 2020

  18. Murtaza, F., Yousaf, M.H., Velastin, S.A.: Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Comput. Vis. 10(7), 758–767 (2016). https://doi.org/10.1049/iet-cvi.2015.0416

    Article  Google Scholar 

  19. Rybok, L., Friedberger, S., Hanebeck, U.D., Stiefelhagen, R.: The KIT Robo-kitchen data set for the evaluation of view-based activity recognition systems. In: 2011 11th IEEE-RAS International Conference on Humanoid Robots, Bled, Slovenia, pp. 128–133. IEEE, October 2011. https://doi.org/10.1109/Humanoids.2011.6100854

  20. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

    Google Scholar 

  21. Sharghi, A., Haugerud, H., Oh, D., Mohareri, O.: Automatic operating room surgical activity recognition for robot-assisted surgery. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 385–395. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_37

    Chapter  Google Scholar 

  22. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. arXiv:1804.09627 [cs], April 2018

  23. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs], December 2017

  24. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

    Google Scholar 

  25. Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning, and recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 2649–2656. IEEE, June 2014. https://doi.org/10.1109/CVPR.2014.339

  26. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006). https://doi.org/10.1016/j.cviu.2006.07.013

    Article  Google Scholar 

  27. Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: surgical phase recognition with less annotations through self-supervised pre-training of CNN-LSTM networks. arXiv preprint arXiv:1805.08569 (2018)

  28. Yeung, S., et al.: A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit. Med. 2(1), 1–5 (2019)

    Article  Google Scholar 

  29. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)

    Google Scholar 

  30. Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 273–280. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_32

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was completed while Adam Schmidt was an intern at Intuitive Surgical.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Schmidt .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2486 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schmidt, A., Sharghi, A., Haugerud, H., Oh, D., Mohareri, O. (2021). Multi-view Surgical Video Action Detection via Mixed Global View Attention. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12904. Springer, Cham. https://doi.org/10.1007/978-3-030-87202-1_60

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87202-1_60

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87201-4

  • Online ISBN: 978-3-030-87202-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics