Abstract
As a prominent research area, visual reasoning plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. Our code is publicly available at https://mybearyzhang.github.io/projects/TwoStageReason.
M. Zhang and J. Cai—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Amizadeh, S., Palangi, H., Polozov, A., Huang, Y., Koishida, K.: Neuro-symbolic visual reasoning: disentangling. In: ICML, pp. 279–290. PMLR (2020)
Antol, S., et al.: VQA: visual question answering. In: ICCV, December 2015
Baradel, F., Neverova, N., Mille, J., Mori, G., Wolf, C.: CoPhy: counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000 (2019)
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Cornelio, C., Stuehmer, J., Hu, S.X., Hospedales, T.: Learning where and when to reason in neuro-symbolic inference. In: ICLR (2022)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255. IEEE (2009)
Duan, J., Yu, S., Poria, S., Wen, B., Tan, C.: PIP: physical interaction prediction via mental simulation with span selection. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 405–421. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_24
Duan, J., Yu, S., Tan, C.: Space: a simulator for physical interactions and causal learning in 3D environments. In: ICCV, pp. 2058–2063 (2021)
Funamizu, A., Kuhn, B., Doya, K.: Neural substrate of dynamic Bayesian inference in the cerebral cortex. Nat. Neurosci. 19(12), 1682–1689 (2016)
Garcez, A.D., et al.: Neural-symbolic learning and reasoning: contributions and challenges. In: 2015 AAAI (2015)
Garcez, A.D., et al.: Neural-symbolic learning and reasoning: a survey and interpretation. Neuro-Symbolic Artif. Intell. State Art 342(1), 327 (2022)
Gong, T., et al.: Multimodal-GPT: a vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 729–734. IEEE (2005)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, July 2017
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: CVPR, pp. 14953–14962 (2023)
Hamilton, K., Nayak, A., Božić, B., Longo, L.: Is neuro-symbolic AI meeting its promises in natural language processing? A structured review. Semant. Web (Preprint), 1–42 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hong, Y., Yi, L., Tenenbaum, J., Torralba, A., Gan, C.: PTR: a benchmark for part-based conceptual, relational, and physical reasoning. NeurIPS 34, 17427–17440 (2021)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, June 2019
Janny, S., Baradel, F., Neverova, N., Nadri, M., Mori, G., Wolf, C.: Filtered-CoPhy: unsupervised learning of counterfactual physics in pixel space. In: ICLR (2022)
Ji, Z., Tiezheng, Y., Xu, Y., Lee, N., Ishii, E., Fung, P.: Towards mitigating LLM hallucination via self reflection. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)
Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., Anandkumar, A.: Bongard-HOI: benchmarking few-shot visual reasoning for human-object interactions. In: CVPR, pp. 19056–19065 (2022)
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: CVPR (2020)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, July 2017
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kroshchanka, A., Golovko, V., Mikhno, E., Kovalev, M., Zahariev, V., Zagorskij, A.: A neural-symbolic approach to computer vision. In: Golenkov, V., Krasnoproshin, V., Golovko, V., Shunkevich, D. (eds.) Open Semantic Technologies for Intelligent Systems, OSTIS 2021. CCIS, vol. 1625, pp. 282–309. Springer, Cham (2021). https://doi.org/10.1007/978-3-031-15882-7_15
Lemos, H., Avelar, P., Prates, M., Garcez, A., Lamb, L.: Neural-symbolic relational reasoning on graph models: effective link inference and computation from knowledge bases. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 647–659. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_51
Li, T., Dubout, C., Wampler, E.K., Yantis, S., Geman, D., et al.: Comparing machines and humans on a visual categorization test (2011)
Li, Y.L., et al.: HAKE: a knowledge engine foundation for human activity understanding. TPAMI 45(7), 8494–8506 (2022)
Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning. In: ICCV (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, Z., Wang, Z., Lin, Y., Li, H.: A neural-symbolic approach to natural language understanding. arXiv preprint arXiv:2203.10557 (2022)
Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: ICLR (2019). https://openreview.net/forum?id=rJgMlhRctm
McDuff, D., et al.: CausalCity: complex simulations with agency for causal discovery and reasoning. In: Conference on Causal Learning and Reasoning, pp. 559–575. PMLR (2022)
Messina, N., Amato, G., Carrara, F., Gennaro, C., Falchi, F.: Recurrent vision transformer for solving visual reasoning problems. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) Image Analysis and Processing – ICIAP 2022. LNCS, vol. 13233, pp. 50–61. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06433-3_5
Nie, W., Yu, Z., Mao, L., Patel, A.B., Zhu, Y., Anandkumar, A.: BONGARD-LOGO: a new benchmark for human-level concept learning and reasoning. In: NeurIPS (2020)
Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books, New York (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Raghuraman, N., Harley, A.W., Guibas, L.: Cross-image context matters for Bongard problems (2023)
Shu, M., et al.: Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS 35, 14274–14289 (2022)
Spratley, S., Ehinger, K., Miller, T.: A closer look at generalisation in RAVEN. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 601–616. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_36
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Tsai, C.F., Zhou, X., Liu, S.S., Li, J., Yu, M., Mei, H.: Can large language models play text games well? Current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Wen, Z., Peng, Y.: Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits Syst. Video Technol. 31(3), 1042–1054 (2020)
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017). https://doi.org/10.1016/j.cviu.2017.05.001, Language in Vision
Wu, X., Li, Y.L., Sun, J., Lu, C.: Symbol-LLM: leverage language models for symbolic system in visual human activity reasoning. In: NeurIPS (2023)
Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. arXiv preprint arXiv:2401.11817 (2024)
Yang, L., et al.: Neural prediction errors enable analogical visual reasoning in human standard intelligence tests (2023)
Yao, J.Y., Ning, K.P., Liu, Z.H., Ning, M.N., Yuan, L.: LLM lies: hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469 (2023)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS, vol. 31 (2018)
Yu, D., Yang, B., Liu, D., Wang, H., Pan, S.: A survey on neural-symbolic learning systems. Neural Networks 166, 105–126 (2023)
Yu, D., Yang, B., Wei, Q., Li, A., Pan, S.: A probabilistic graphical model based on neural-symbolic reasoning for visual relationship detection. In: CVPR, pp. 10609–10618 (2022)
Zerroug, A., Vaishnav, M., Colin, J., Musslick, S., Serre, T.: A benchmark for compositional visual reasoning. arXiv preprint arXiv:2206.05379 (2022)
Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: RAVEN: a dataset for relational and analogical visual reasoning. In: CVPR (2019)
Zhang, J., Chen, B., Zhang, L., Ke, X., Ding, H.: Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2, 14–35 (2021)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grants No. 62306175.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, M., Cai, J., Liu, M., Xu, Y., Lu, C., Li, YL. (2025). Take a Step Back: Rethinking the Two Stages in Visual Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)