$$\textsf{GLSFormer}$$ : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

Shah, Nisarg A.; Sikder, Shameema; Vedula, S. Swaroop; Patel, Vishal M.

doi:10.1007/978-3-031-43996-4_37

Nisarg A. Shah¹⁴,
Shameema Sikder^15,16,
S. Swaroop Vedula¹⁶ &
…
Vishal M. Patel¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14228))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

4634 Accesses
2 Citations

Abstract

Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Global–local multi-stage temporal convolutional network for cataract surgery phase recognition

Article Open access 30 November 2022

Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

TRandAugment: temporal random augmentation strategy for surgical activity recognition from videos

Article Open access 22 March 2023

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Google Scholar
Blum, T., Feußner, H., Navab, N.: Modeling and segmentation of surgical workflow from laparoscopic video. In: Jiang, T., Navab, N., Pluim, J.P.W., Viergever, M.A. (eds.) MICCAI 2010. LNCS, vol. 6363, pp. 400–407. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15711-0_50
Chapter Google Scholar
Bricon-Souf, N., Newman, C.R.: Context awareness in health care: a review. Int. J. Med. Inform. 76(1), 2–12 (2007)
Google Scholar
Czempiel, T., et al.: TeCNO: surgical phase recognition with multi-stage temporal convolutional networks. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 343–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_33
Chapter Google Scholar
Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. Int. J. Comput. Assist. Radiol. Surg. 11(6), 1081–1089 (2016). https://doi.org/10.1007/s11548-016-1371-x
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth $16 \times 16$ words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Funke, I., Bodenstedt, S., Oehme, F., von Bechtolsheim, F., Weitz, J., Speidel, S.: Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11768, pp. 467–475. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32254-0_52
Chapter Google Scholar
Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 14, 1217–1225 (2019)
Article Google Scholar
Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.-A.: Trans-SVNet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12904, pp. 593–603. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87202-1_57
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huaulmé, A., Jannin, P., Reche, F., Faucheron, J.L., Moreau-Gaudry, A., Voros, S.: Offline identification of surgical deviations in laparoscopic rectopexy. Artif. Intell. Med. 104, 101837 (2020)
Article Google Scholar
Jin, Y., et al.: SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37(5), 1114–1126 (2017)
Article Google Scholar
Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Trans. Med. Imaging 40(7), 1911–1923 (2021)
Article Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Lalys, F., Bouget, D., Riffaud, L., Jannin, P.: Automatic knowledge-based recognition of low-level tasks in ophthalmological procedures. Int. J. Comput. Assist. Radiol. Surg. 8, 39–49 (2013)
Article Google Scholar
Lea, C., Hager, G.D., Vidal, R.: An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 1123–1129. IEEE (2015)
Google Scholar
Lecuyer, G., Ragot, M., Martin, N., Launay, L., Jannin, P.: Assisted phase and step annotation for surgical videos. Int. J. Comput. Assist. Radiol. Surg. 15(4), 673–680 (2020). https://doi.org/10.1007/s11548-019-02108-8
Article Google Scholar
Padoy, N.: Machine and deep learning for workflow recognition during surgery. Minim. Invasive Therapy Allied Technol. 28(2), 82–90 (2019)
Article Google Scholar
Schoeffmann, K., Taschwer, M., Sarny, S., Münzer, B., Primus, M.J., Putzgruber, D.: Cataract-101: video dataset of 101 cataract surgeries. In: Proceedings of the 9th ACM Multimedia Systems Conference, pp. 421–425 (2018)
Google Scholar
Tao, L., Zappella, L., Hager, G.D., Vidal, R.: Surgical gesture segmentation and recognition. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 339–346. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40760-4_43
Chapter Google Scholar
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36(1), 86–97 (2016)
Article Google Scholar
Yi, F., Jiang, T.: Hard frame detection and online mapping for surgical phase recognition. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11768, pp. 449–457. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32254-0_50
Chapter Google Scholar
Yu, F., et al.: Assessment of automated identification of phases in videos of cataract surgery using machine learning and deep learning techniques. JAMA Netw. Open 2(4), e191860–e191860 (2019)
Article Google Scholar
Zhang, J., et al.: Symmetric dilated convolution for surgical gesture recognition. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 409–418. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_39
Chapter Google Scholar
Zisimopoulos, O., et al.: DeepPhase: surgical phase recognition in CATARACTS videos. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 265–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_31
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by a grant from the National Institutes of Health, USA; R01EY033065. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, MD, 21218, USA
Nisarg A. Shah & Vishal M. Patel
Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, USA
Shameema Sikder
Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, USA
Shameema Sikder & S. Swaroop Vedula

Authors

Nisarg A. Shah
View author publications
You can also search for this author in PubMed Google Scholar
Shameema Sikder
View author publications
You can also search for this author in PubMed Google Scholar
S. Swaroop Vedula
View author publications
You can also search for this author in PubMed Google Scholar
Vishal M. Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nisarg A. Shah .

Editor information

Editors and Affiliations

Icahn School of Medicine, Mount Sinai, NYC, NY, USA, Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Emory University, Atlanta, GA, USA
Anant Madabhushi
Queen’s University, Kingston, ON, Canada
Parvin Mousavi
The University of British Columbia, Vancouver, BC, Canada
Septimiu Salcudean
Yale University, New Haven, CT, USA
James Duncan
IBM Research, San Jose, CA, USA
Tanveer Syeda-Mahmood
Johns Hopkins University, Baltimore, MD, USA
Russell Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shah, N.A., Sikder, S., Vedula, S.S., Patel, V.M. (2023). $\textsf{GLSFormer}$: Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14228. Springer, Cham. https://doi.org/10.1007/978-3-031-43996-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-43996-4_37
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43995-7
Online ISBN: 978-3-031-43996-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

\(\textsf{GLSFormer}\): Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Global–local multi-stage temporal convolutional network for cataract surgery phase recognition

Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

TRandAugment: temporal random augmentation strategy for surgical activity recognition from videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

\(\textsf{GLSFormer}\): Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Global–local multi-stage temporal convolutional network for cataract surgery phase recognition

Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

TRandAugment: temporal random augmentation strategy for surgical activity recognition from videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships