Skip to main content

Recursive Visual Programming

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

  • 346 Accesses

Abstract

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods show advancement in leveraging Large Language Models (LLMs) for complex problem-solving. Despite their potential, existing VP methods generate all code in a single function, which does not fully utilize LLM’s reasoning capacity and the modular adaptability of code. This results in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which better harnesses the reasoning capacity of LLMs, provides modular code structure between code pieces, and assigns different return types for the sub-problems elegantly. RVP approaches VQA tasks with an top-down recursive code generation approach, allowing decomposition of complicated problems into smaller parts. We show RVP’s efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks. Our code is available at https://github.com/para-lost/RVP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    www.platform.openai.com.

References

  1. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 39–48 (2015). https://api.semanticscholar.org/CorpusID:5276660

  2. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. ArXiv abs/1601.01705 (2016). https://api.semanticscholar.org/CorpusID:3130692

  3. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  4. Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models (2024)

    Google Scholar 

  5. Bogin, B., Gupta, S., Gardner, M., Berant, J.: COVR: a test-bed for visually grounded compositional generalization with real images. ArXiv abs/2109.10613 (2021). https://api.semanticscholar.org/CorpusID:237592834

  6. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  7. Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “Video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  8. Chen, M., et al.: Evaluating large language models trained on code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472

  9. Chen, W., Ma, X., Wang, X., Cohen, W.W.: Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res. (2023)

    Google Scholar 

  10. Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)

  11. Cheng, Z., et al.: Binding language models in symbolic languages. ArXiv abs/2210.02875 (2022). https://api.semanticscholar.org/CorpusID:252734772

  12. Cho, J., Zala, A., Bansal, M.: Visual programming for text-to-image generation and evaluation. NeurIPS (2023)

    Google Scholar 

  13. Gao, L., et al.: PAL: program-aided language models. arXiv preprint arXiv:2211.10435 (2022)

  14. Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14953–14962 (2022). https://api.semanticscholar.org/CorpusID:253734854

  15. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 804–813 (2017). https://api.semanticscholar.org/CorpusID:18682

  16. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019). https://api.semanticscholar.org/CorpusID:152282269

  17. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019). https://api.semanticscholar.org/CorpusID:152282269

  18. Hudson, D.A., Manning, C.D.: Learning by abstraction: the neural state machine. In: Neural Information Processing Systems (2019). https://api.semanticscholar.org/CorpusID:195847902

  19. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:231839613

  20. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. ArXiv abs/1908.03557 (2019). https://api.semanticscholar.org/CorpusID:199528533

  21. Liang, J., et al.: Code as policies: language model programs for embodied control. ArXiv preprint arXiv:2209.07753 (2022)

  22. Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Trans. Assoc. Comput. Linguist. 11, 635–651 (2022). https://api.semanticscholar.org/CorpusID:248496506

  23. Lu, P., et al.: Chameleon: plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 (2023)

  24. Ma, Y.J., et al.: Eureka: human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023)

  25. Madaan, A., Zhou, S., Alon, U., Yang, Y., Neubig, G.: Language models of code are few-shot commonsense learners. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1384–1403. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://doi.org/10.18653/v1/2022.emnlp-main.90

  26. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199 (2019). https://api.semanticscholar.org/CorpusID:173991173

  27. OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815

  28. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and narrowing the compositionality gap in language models. ArXiv abs/2210.03350 (2022). https://api.semanticscholar.org/CorpusID:252762102

  29. Subramanian, S., et al.: Obtaining faithful interpretations from compositional neural networks. In: Annual Meeting of the Association for Computational Linguistics (2020). https://api.semanticscholar.org/CorpusID:218487535

  30. Subramanian, S., et al.: Modular visual question answering via code generation. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 747–761. Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-short.65, https://aclanthology.org/2023.acl-short.65

  31. Sur’is, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. ArXiv abs/2303.08128 (2023). https://api.semanticscholar.org/CorpusID:257505358

  32. Suzgun, M., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 (2022)

  33. Touvron, H., et al.: LLaMA: open and efficient foundation language models. ArXiv abs/2302.13971 (2023). https://api.semanticscholar.org/CorpusID:257219404

  34. Wang, X., Li, S., Ji, H.: Code4Struct: code generation for few-shot structured prediction from natural language. arXiv preprint arXiv:2210.12810 (2022)

  35. Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. Adv. Neural. Inf. Process. Syst. 35, 8483–8497 (2022)

    Google Scholar 

  36. Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-qa: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786 (2021)

    Google Scholar 

  37. Yang, K., Klein, D., Peng, N., Tian, Y.: DOC: improving long story coherence with detailed outline control. In: Annual Meeting of the Association for Computational Linguistics (2023). https://api.semanticscholar.org/CorpusID:254877751

  38. Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)

  39. Ye, Q., et al.: HiTeA: hierarchical temporal-aware video-language pre-training. ArXiv abs/2212.14546 (2022). https://api.semanticscholar.org/CorpusID:255340506

  40. Yu, W., et al.: Language to rewards for robotic skill synthesis. Arxiv preprint arXiv:2306.08647 (2023)

  41. Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5575–5584 (2021). https://api.semanticscholar.org/CorpusID:235692795

  42. Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models. ArXiv abs/2205.10625 (2022). https://api.semanticscholar.org/CorpusID:248986239

Download references

Acknowledgements

We would like to thank Ben Bogin for helping us test on the private COVR test set, and Guangyuan Jiang for helping us proof read the paper and providing valuable feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiaxin Ge .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1710 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ge, J., Subramanian, S., Shi, B., Herzig, R., Darrell, T. (2025). Recursive Visual Programming. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72775-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72774-0

  • Online ISBN: 978-3-031-72775-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics