PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

He, Runlong; Xu, Mengya; Das, Adrito; Khan, Danyal Z.; Bano, Sophia; Marcus, Hani J.; Stoyanov, Danail; Clarkson, Matthew J.; Islam, Mobarakol

doi:10.1007/978-3-031-72089-5_46

Runlong He^14,15,
Mengya Xu¹⁶,
Adrito Das¹⁴,
Danyal Z. Khan^14,17,
Sophia Bano^14,18,
Hani J. Marcus^14,17,
Danail Stoyanov^14,18,
Matthew J. Clarkson^14,15 &
…
Mobarakol Islam^14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15006))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1587 Accesses

Abstract

Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at https://github.com/mobarakol/PitVQA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

CroMA: Cross-Modal Attention for Visual Question Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer

Notes

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015)
Google Scholar
Bai, L., Islam, M., Ren, H.: Cat-vil: Co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 397–407. Springer (2023)
Google Scholar
Bai, L., Islam, M., Seenivasan, L., Ren, H.: Surgical-vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery (2023)
Google Scholar
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2612–2620 (2017)
Google Scholar
Das, A., Khan, D.Z., Williams, S.C., Hanrahan, J.G., Borg, A., Dorward, N.L., Bano, S., Marcus, H.J., Stoyanov, D.: A multi-task network for anatomy identification in endoscopic pituitary surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 472–482. Springer (2023)
Google Scholar
Decker, H., Trang, K., Ramirez, J., Colley, A., Pierce, L., Coleman, M., Bongiovanni, T., Melton, G.B., Wick, E.: Large language model- based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Network Open 6(10), e2336997–e2336997 (2023)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
Google Scholar
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6700–6709 (2019)
Google Scholar
Khan, D.Z., Hanrahan, J.G., Baldeweg, S.E., Dorward, N.L., Stoyanov, D., Marcus, H.J.: Current and future advances in surgical therapy for pituitary adenoma. Endocrine Reviews (2023)
Google Scholar
Lawson McLean, A.: Artificial intelligence in surgical documentation: A critical review of the role of large language models. Annals of Biomedical Engineering pp. 1–2 (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., Liu, T.Y.: Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23(6), bbac409 (2022)
Google Scholar
Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M.D., Buettner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., et al.: Metrics reloaded: recommendations for image analysis validation. Nature methods pp. 1–18 (2024)
Google Scholar
Marcus, H.J., Khan, D.Z., Borg, A., Buchfelder, M., Cetas, J.S., Collins, J.W., Dorward, N.L., Fleseriu, M., Gurnell, M., Javadpour, M., et al.: Pituitary society expert delphi consensus: operative workflow in endoscopic transsphenoidal pituitary adenoma resection. Pituitary 24(6), 839–853 (2021)
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I. pp. 421–429. Springer (2018)
Google Scholar
Seenivasan, L., Islam, M., Kannan, G., Ren, H.: Surgicalgpt: End-to-end language-vision gpt for visual question answering in surgery. arXiv preprint arXiv:2304.09974 (2023)
Seenivasan, L., Islam, M., Krishna, A.K., Ren, H.: Surgical-vqa: Visual question answering in surgical scenes using transformer. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 33–43. Springer (2022)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 1821–1830 (2017)
Google Scholar
Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE transactions on neural networks and learning systems 29(12), 5947–5959 (2018)
Article Google Scholar
Yuan, K., Kattel, M., Lavanchy, J.L., Navab, N., Srivastav, V., Padoy, N.: Advancing surgical vqa with scene graph knowledge (2024)
Google Scholar

Download references

Acknowledgments

This work was supported in whole, or in part, by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) [203145/Z/16/Z] and the Engineering and Physical Sciences Research Council (EPSRC) [EP/W00805X/1, EP/Y01958X/1]; Horizon 2020 FET Open [863146]; the UCLH/UCL NIHR Biomedical Research Centre; the Department of Science, Innovation and Technology (DSIT); and the Royal Academy of Engineering Chair in Emerging Technologies Scheme. AD is supported by EPSRC [EP/S021612/1]. DZK is supported by an NIHR Academic Clinical Fellowship. With thanks to Digital Surgery Ltd, a Medtronic company, for access to Touch Surgery^TM Enterprise for both video recording and storage.

Author information

Authors and Affiliations

UCL Hawkes Institute, University College London, London, UK
Runlong He, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson & Mobarakol Islam
Department of Medical Physics and Biomedical Engineering, University College London, London, UK
Runlong He, Matthew J. Clarkson & Mobarakol Islam
Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
Mengya Xu
Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
Danyal Z. Khan & Hani J. Marcus
Department of Computer Science, University College London, London, UK
Sophia Bano & Danail Stoyanov

Authors

Runlong He
View author publications
You can also search for this author in PubMed Google Scholar
Mengya Xu
View author publications
You can also search for this author in PubMed Google Scholar
Adrito Das
View author publications
You can also search for this author in PubMed Google Scholar
Danyal Z. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Bano
View author publications
You can also search for this author in PubMed Google Scholar
Hani J. Marcus
View author publications
You can also search for this author in PubMed Google Scholar
Danail Stoyanov
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Clarkson
View author publications
You can also search for this author in PubMed Google Scholar
Mobarakol Islam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Runlong He or Mobarakol Islam .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, R. et al. (2024). PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15006. Springer, Cham. https://doi.org/10.1007/978-3-031-72089-5_46

Download citation

DOI: https://doi.org/10.1007/978-3-031-72089-5_46
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72088-8
Online ISBN: 978-3-031-72089-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

CroMA: Cross-Modal Attention for Visual Question Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

CroMA: Cross-Modal Attention for Visual Question Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation