SGT: Scene Graph-Guided Transformer for Surgical Report Generation

Lin, Chen; Zheng, Shuai; Liu, Zhizhe; Li, Youru; Zhu, Zhenfeng; Zhao, Yao

doi:10.1007/978-3-031-16449-1_48

SGT: Scene Graph-Guided Transformer for Surgical Report Generation

Chen Lin^12,13,
Shuai Zheng^12,13,
Zhizhe Liu^12,13,
Youru Li^12,13,
Zhenfeng Zhu^12,13 &
…
Yao Zhao^12,13

Conference paper
First Online: 17 September 2022

5467 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13437))

Abstract

The robotic surgical report reflects the operations during surgery and relates to the subsequent treatment. Therefore, it is especially important to generate accurate surgical reports. Given that there are numerous interactions between instruments and tissue in the surgical scene, we propose a Scene Graph-guided Transformer (SGT) to solve the issue of surgical report generation. The model is based on the structure of transformer to understand the complex interactions between tissue and the instruments from both global and local perspectives. On the one hand, we propose a relation driven attention to facilitate the comprehensive description of the interaction in a generated report via sampling of numerous interactive relationships to form a diverse and representative augmented memory. On the other hand, to characterize the specific interactions in each surgical image, a simple yet ingenious approach is proposed for homogenizing the input heterogeneous scene graph, which plays an effective role in modeling the local interactions by injecting the graph-induced attention into the encoder. The dataset from clinical nephrectomy is utilized for performance evaluation and the experimental results show that our SGT model can significantly improve the quality of the generated surgical medical report, far exceeding the other state-of-the-art methods. The code is public available at: https://github.com/ccccchenllll/SGT_master.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1250–1258 (2010)
Google Scholar
Allan, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Hou, B., Kaissis, G., Summers, R.M., Kainz, B.: RATCHET: medical transformer for chest X-ray diagnosis and reporting. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12907, pp. 293–303. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87234-2_28
Chapter Google Scholar
Islam, M., Seenivasan, L., Ming, L.C., Ren, H.: Learning and reasoning with the graph structure representation in robotic surgery. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 627–636. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_60
Chapter Google Scholar
Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Kulesza, A., Taskar, B.: k-DPPs: fixed-size determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML) (2011)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 74–81 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., et al. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., Zhu, Z., Zheng, S., Liu, Y., Zhou, J., Zhao, Y.: Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation. IEEE J. Biomed. Health Inform. 26(2), 638–647 (2022). https://doi.org/10.1109/JBHI.2022.3140853
Article Google Scholar
Macchi, O.: The coincidence approach to stochastic point processes. Adv. Appl. Probab. 7(1), 83–122 (1975)
Article MathSciNet Google Scholar
Pan, J.Y., Yang, H.J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987–1990. IEEE (2004)
Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: text-image embedding network for common thorax disease classification and reporting in chest X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9049–9058 (2018)
Google Scholar
Xiong, Y., Du, B., Yan, P.: Reinforced transformer for medical image captioning. In: Suk, H.I., Liu, M., Yan, P., Lian, C. (eds.) MLMI 2019. LNCS, vol. 11861, pp. 673–680. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32692-0_77
Chapter Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Xu, M., Islam, M., Lim, C.M., Ren, H.: Class-incremental domain adaptation with smoothing and calibration for surgical report generation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12904, pp. 269–278. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87202-1_26
Chapter Google Scholar
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhang, W., et al.: Deep learning based torsional nystagmus detection for dizziness and vertigo diagnosis. Biomed. Signal Process. Control 68, 102616 (2021)
Article Google Scholar
Zheng, S., et al.: Multi-modal graph learning for disease prediction. IEEE Trans. Med. Imaging 41(9), 2207–2216 (2022). https://doi.org/10.1109/TMI.2022.3159264

Download references

Acknowledgement

This work was supported in part by Science and Technology Innovation 2030 – New Generation Artificial Intelligence Major Project under Grant 2018AAA0102100, National Natural Science Foundation of China under Grant No. 61976018, Beijing Natural Science Foundation under Grant No. 7222313.

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, China
Chen Lin, Shuai Zheng, Zhizhe Liu, Youru Li, Zhenfeng Zhu & Yao Zhao
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China
Chen Lin, Shuai Zheng, Zhizhe Liu, Youru Li, Zhenfeng Zhu & Yao Zhao

Authors

Chen Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhizhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Youru Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhenfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenfeng Zhu .

Editor information

Editors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Linwei Wang
Chinese University of Hong Kong, Hong Kong, Hong Kong
Qi Dou
University of Virginia, Charlottesville, VA, USA
P. Thomas Fletcher
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Case Western Reserve University, Cleveland, OH, USA
Shuo Li

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 113 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, C., Zheng, S., Liu, Z., Li, Y., Zhu, Z., Zhao, Y. (2022). SGT: Scene Graph-Guided Transformer for Surgical Report Generation. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13437. Springer, Cham. https://doi.org/10.1007/978-3-031-16449-1_48

Download citation

DOI: https://doi.org/10.1007/978-3-031-16449-1_48
Published: 17 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16448-4
Online ISBN: 978-3-031-16449-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)