Abstract
Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods show advancement in leveraging Large Language Models (LLMs) for complex problem-solving. Despite their potential, existing VP methods generate all code in a single function, which does not fully utilize LLM’s reasoning capacity and the modular adaptability of code. This results in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which better harnesses the reasoning capacity of LLMs, provides modular code structure between code pieces, and assigns different return types for the sub-problems elegantly. RVP approaches VQA tasks with an top-down recursive code generation approach, allowing decomposition of complicated problems into smaller parts. We show RVP’s efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks. Our code is available at https://github.com/para-lost/RVP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 39–48 (2015). https://api.semanticscholar.org/CorpusID:5276660
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. ArXiv abs/1601.01705 (2016). https://api.semanticscholar.org/CorpusID:3130692
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Besta, M., et al.: Graph of thoughts: solving elaborate problems with large language models (2024)
Bogin, B., Gupta, S., Gardner, M., Berant, J.: COVR: a test-bed for visually grounded compositional generalization with real images. ArXiv abs/2109.10613 (2021). https://api.semanticscholar.org/CorpusID:237592834
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “Video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Chen, M., et al.: Evaluating large language models trained on code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472
Chen, W., Ma, X., Wang, X., Cohen, W.W.: Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res. (2023)
Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)
Cheng, Z., et al.: Binding language models in symbolic languages. ArXiv abs/2210.02875 (2022). https://api.semanticscholar.org/CorpusID:252734772
Cho, J., Zala, A., Bansal, M.: Visual programming for text-to-image generation and evaluation. NeurIPS (2023)
Gao, L., et al.: PAL: program-aided language models. arXiv preprint arXiv:2211.10435 (2022)
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14953–14962 (2022). https://api.semanticscholar.org/CorpusID:253734854
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 804–813 (2017). https://api.semanticscholar.org/CorpusID:18682
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019). https://api.semanticscholar.org/CorpusID:152282269
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019). https://api.semanticscholar.org/CorpusID:152282269
Hudson, D.A., Manning, C.D.: Learning by abstraction: the neural state machine. In: Neural Information Processing Systems (2019). https://api.semanticscholar.org/CorpusID:195847902
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:231839613
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. ArXiv abs/1908.03557 (2019). https://api.semanticscholar.org/CorpusID:199528533
Liang, J., et al.: Code as policies: language model programs for embodied control. ArXiv preprint arXiv:2209.07753 (2022)
Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Trans. Assoc. Comput. Linguist. 11, 635–651 (2022). https://api.semanticscholar.org/CorpusID:248496506
Lu, P., et al.: Chameleon: plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 (2023)
Ma, Y.J., et al.: Eureka: human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931 (2023)
Madaan, A., Zhou, S., Alon, U., Yang, Y., Neubig, G.: Language models of code are few-shot commonsense learners. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1384–1403. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://doi.org/10.18653/v1/2022.emnlp-main.90
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199 (2019). https://api.semanticscholar.org/CorpusID:173991173
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and narrowing the compositionality gap in language models. ArXiv abs/2210.03350 (2022). https://api.semanticscholar.org/CorpusID:252762102
Subramanian, S., et al.: Obtaining faithful interpretations from compositional neural networks. In: Annual Meeting of the Association for Computational Linguistics (2020). https://api.semanticscholar.org/CorpusID:218487535
Subramanian, S., et al.: Modular visual question answering via code generation. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 747–761. Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-short.65, https://aclanthology.org/2023.acl-short.65
Sur’is, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. ArXiv abs/2303.08128 (2023). https://api.semanticscholar.org/CorpusID:257505358
Suzgun, M., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. ArXiv abs/2302.13971 (2023). https://api.semanticscholar.org/CorpusID:257219404
Wang, X., Li, S., Ji, H.: Code4Struct: code generation for few-shot structured prediction from natural language. arXiv preprint arXiv:2210.12810 (2022)
Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. Adv. Neural. Inf. Process. Syst. 35, 8483–8497 (2022)
Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-qa: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786 (2021)
Yang, K., Klein, D., Peng, N., Tian, Y.: DOC: improving long story coherence with detailed outline control. In: Annual Meeting of the Association for Computational Linguistics (2023). https://api.semanticscholar.org/CorpusID:254877751
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Ye, Q., et al.: HiTeA: hierarchical temporal-aware video-language pre-training. ArXiv abs/2212.14546 (2022). https://api.semanticscholar.org/CorpusID:255340506
Yu, W., et al.: Language to rewards for robotic skill synthesis. Arxiv preprint arXiv:2306.08647 (2023)
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5575–5584 (2021). https://api.semanticscholar.org/CorpusID:235692795
Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models. ArXiv abs/2205.10625 (2022). https://api.semanticscholar.org/CorpusID:248986239
Acknowledgements
We would like to thank Ben Bogin for helping us test on the private COVR test set, and Guangyuan Jiang for helping us proof read the paper and providing valuable feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ge, J., Subramanian, S., Shi, B., Herzig, R., Darrell, T. (2025). Recursive Visual Programming. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)