Abstract
The robotic surgical report reflects the operations during surgery and relates to the subsequent treatment. Therefore, it is especially important to generate accurate surgical reports. Given that there are numerous interactions between instruments and tissue in the surgical scene, we propose a Scene Graph-guided Transformer (SGT) to solve the issue of surgical report generation. The model is based on the structure of transformer to understand the complex interactions between tissue and the instruments from both global and local perspectives. On the one hand, we propose a relation driven attention to facilitate the comprehensive description of the interaction in a generated report via sampling of numerous interactive relationships to form a diverse and representative augmented memory. On the other hand, to characterize the specific interactions in each surgical image, a simple yet ingenious approach is proposed for homogenizing the input heterogeneous scene graph, which plays an effective role in modeling the local interactions by injecting the graph-induced attention into the encoder. The dataset from clinical nephrectomy is utilized for performance evaluation and the experimental results show that our SGT model can significantly improve the quality of the generated surgical medical report, far exceeding the other state-of-the-art methods. The code is public available at: https://github.com/ccccchenllll/SGT_master.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1250–1258 (2010)
Allan, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Hou, B., Kaissis, G., Summers, R.M., Kainz, B.: RATCHET: medical transformer for chest X-ray diagnosis and reporting. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12907, pp. 293–303. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87234-2_28
Islam, M., Seenivasan, L., Ming, L.C., Ren, H.: Learning and reasoning with the graph structure representation in robotic surgery. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 627–636. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_60
Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Kulesza, A., Taskar, B.: k-DPPs: fixed-size determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML) (2011)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 74–81 (2004)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., et al. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Z., Zhu, Z., Zheng, S., Liu, Y., Zhou, J., Zhao, Y.: Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation. IEEE J. Biomed. Health Inform. 26(2), 638–647 (2022). https://doi.org/10.1109/JBHI.2022.3140853
Macchi, O.: The coincidence approach to stochastic point processes. Adv. Appl. Probab. 7(1), 83–122 (1975)
Pan, J.Y., Yang, H.J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987–1990. IEEE (2004)
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Vaswani, A., et al.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: text-image embedding network for common thorax disease classification and reporting in chest X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9049–9058 (2018)
Xiong, Y., Du, B., Yan, P.: Reinforced transformer for medical image captioning. In: Suk, H.I., Liu, M., Yan, P., Lian, C. (eds.) MLMI 2019. LNCS, vol. 11861, pp. 673–680. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32692-0_77
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Xu, M., Islam, M., Lim, C.M., Ren, H.: Class-incremental domain adaptation with smoothing and calibration for surgical report generation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12904, pp. 269–278. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87202-1_26
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Zhang, W., et al.: Deep learning based torsional nystagmus detection for dizziness and vertigo diagnosis. Biomed. Signal Process. Control 68, 102616 (2021)
Zheng, S., et al.: Multi-modal graph learning for disease prediction. IEEE Trans. Med. Imaging 41(9), 2207–2216 (2022). https://doi.org/10.1109/TMI.2022.3159264
Acknowledgement
This work was supported in part by Science and Technology Innovation 2030 – New Generation Artificial Intelligence Major Project under Grant 2018AAA0102100, National Natural Science Foundation of China under Grant No. 61976018, Beijing Natural Science Foundation under Grant No. 7222313.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, C., Zheng, S., Liu, Z., Li, Y., Zhu, Z., Zhao, Y. (2022). SGT: Scene Graph-Guided Transformer for Surgical Report Generation. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13437. Springer, Cham. https://doi.org/10.1007/978-3-031-16449-1_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-16449-1_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16448-4
Online ISBN: 978-3-031-16449-1
eBook Packages: Computer ScienceComputer Science (R0)