Skip to main content

Take a Step Back: Rethinking the Two Stages in Visual Reasoning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

As a prominent research area, visual reasoning plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. Our code is publicly available at https://mybearyzhang.github.io/projects/TwoStageReason.

M. Zhang and J. Cai—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)

    Google Scholar 

  2. Amizadeh, S., Palangi, H., Polozov, A., Huang, Y., Koishida, K.: Neuro-symbolic visual reasoning: disentangling. In: ICML, pp. 279–290. PMLR (2020)

    Google Scholar 

  3. Antol, S., et al.: VQA: visual question answering. In: ICCV, December 2015

    Google Scholar 

  4. Baradel, F., Neverova, N., Mille, J., Mori, G., Wolf, C.: CoPhy: counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000 (2019)

  5. Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  6. Cornelio, C., Stuehmer, J., Hu, S.X., Hospedales, T.: Learning where and when to reason in neuro-symbolic inference. In: ICLR (2022)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255. IEEE (2009)

    Google Scholar 

  8. Duan, J., Yu, S., Poria, S., Wen, B., Tan, C.: PIP: physical interaction prediction via mental simulation with span selection. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 405–421. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_24

  9. Duan, J., Yu, S., Tan, C.: Space: a simulator for physical interactions and causal learning in 3D environments. In: ICCV, pp. 2058–2063 (2021)

    Google Scholar 

  10. Funamizu, A., Kuhn, B., Doya, K.: Neural substrate of dynamic Bayesian inference in the cerebral cortex. Nat. Neurosci. 19(12), 1682–1689 (2016)

    Article  Google Scholar 

  11. Garcez, A.D., et al.: Neural-symbolic learning and reasoning: contributions and challenges. In: 2015 AAAI (2015)

    Google Scholar 

  12. Garcez, A.D., et al.: Neural-symbolic learning and reasoning: a survey and interpretation. Neuro-Symbolic Artif. Intell. State Art 342(1), 327 (2022)

    Google Scholar 

  13. Gong, T., et al.: Multimodal-GPT: a vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)

  14. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 729–734. IEEE (2005)

    Google Scholar 

  15. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, July 2017

    Google Scholar 

  16. Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: CVPR, pp. 14953–14962 (2023)

    Google Scholar 

  17. Hamilton, K., Nayak, A., Božić, B., Longo, L.: Is neuro-symbolic AI meeting its promises in natural language processing? A structured review. Semant. Web (Preprint), 1–42 (2022)

    Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  19. Hong, Y., Yi, L., Tenenbaum, J., Torralba, A., Gan, C.: PTR: a benchmark for part-based conceptual, relational, and physical reasoning. NeurIPS 34, 17427–17440 (2021)

    Google Scholar 

  20. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, June 2019

    Google Scholar 

  21. Janny, S., Baradel, F., Neverova, N., Nadri, M., Mori, G., Wolf, C.: Filtered-CoPhy: unsupervised learning of counterfactual physics in pixel space. In: ICLR (2022)

    Google Scholar 

  22. Ji, Z., Tiezheng, Y., Xu, Y., Lee, N., Ishii, E., Fung, P.: Towards mitigating LLM hallucination via self reflection. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)

    Google Scholar 

  23. Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., Anandkumar, A.: Bongard-HOI: benchmarking few-shot visual reasoning for human-object interactions. In: CVPR, pp. 19056–19065 (2022)

    Google Scholar 

  24. Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: CVPR (2020)

    Google Scholar 

  25. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, July 2017

    Google Scholar 

  26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  27. Kroshchanka, A., Golovko, V., Mikhno, E., Kovalev, M., Zahariev, V., Zagorskij, A.: A neural-symbolic approach to computer vision. In: Golenkov, V., Krasnoproshin, V., Golovko, V., Shunkevich, D. (eds.) Open Semantic Technologies for Intelligent Systems, OSTIS 2021. CCIS, vol. 1625, pp. 282–309. Springer, Cham (2021). https://doi.org/10.1007/978-3-031-15882-7_15

  28. Lemos, H., Avelar, P., Prates, M., Garcez, A., Lamb, L.: Neural-symbolic relational reasoning on graph models: effective link inference and computation from knowledge bases. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 647–659. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_51

    Chapter  Google Scholar 

  29. Li, T., Dubout, C., Wampler, E.K., Yantis, S., Geman, D., et al.: Comparing machines and humans on a visual categorization test (2011)

    Google Scholar 

  30. Li, Y.L., et al.: HAKE: a knowledge engine foundation for human activity understanding. TPAMI 45(7), 8494–8506 (2022)

    Google Scholar 

  31. Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning. In: ICCV (2023)

    Google Scholar 

  32. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  33. Liu, Z., Wang, Z., Lin, Y., Li, H.: A neural-symbolic approach to natural language understanding. arXiv preprint arXiv:2203.10557 (2022)

  34. Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: ICLR (2019). https://openreview.net/forum?id=rJgMlhRctm

  35. McDuff, D., et al.: CausalCity: complex simulations with agency for causal discovery and reasoning. In: Conference on Causal Learning and Reasoning, pp. 559–575. PMLR (2022)

    Google Scholar 

  36. Messina, N., Amato, G., Carrara, F., Gennaro, C., Falchi, F.: Recurrent vision transformer for solving visual reasoning problems. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) Image Analysis and Processing – ICIAP 2022. LNCS, vol. 13233, pp. 50–61. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06433-3_5

  37. Nie, W., Yu, Z., Mao, L., Patel, A.B., Zhu, Y., Anandkumar, A.: BONGARD-LOGO: a new benchmark for human-level concept learning and reasoning. In: NeurIPS (2020)

    Google Scholar 

  38. Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books, New York (2018)

    Google Scholar 

  39. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  40. Raghuraman, N., Harley, A.W., Guibas, L.: Cross-image context matters for Bongard problems (2023)

    Google Scholar 

  41. Shu, M., et al.: Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS 35, 14274–14289 (2022)

    Google Scholar 

  42. Spratley, S., Ehinger, K., Miller, T.: A closer look at generalisation in RAVEN. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 601–616. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_36

  43. Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)

  44. Tsai, C.F., Zhou, X., Liu, S.S., Li, J., Yu, M., Mei, H.: Can large language models play text games well? Current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868 (2023)

  45. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  46. Wen, Z., Peng, Y.: Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits Syst. Video Technol. 31(3), 1042–1054 (2020)

    Article  Google Scholar 

  47. Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017). https://doi.org/10.1016/j.cviu.2017.05.001, Language in Vision

  48. Wu, X., Li, Y.L., Sun, J., Lu, C.: Symbol-LLM: leverage language models for symbolic system in visual human activity reasoning. In: NeurIPS (2023)

    Google Scholar 

  49. Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. arXiv preprint arXiv:2401.11817 (2024)

  50. Yang, L., et al.: Neural prediction errors enable analogical visual reasoning in human standard intelligence tests (2023)

    Google Scholar 

  51. Yao, J.Y., Ning, K.P., Liu, Z.H., Ning, M.N., Yuan, L.: LLM lies: hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469 (2023)

  52. Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)

  53. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS, vol. 31 (2018)

    Google Scholar 

  54. Yu, D., Yang, B., Liu, D., Wang, H., Pan, S.: A survey on neural-symbolic learning systems. Neural Networks 166, 105–126 (2023)

    Article  Google Scholar 

  55. Yu, D., Yang, B., Wei, Q., Li, A., Pan, S.: A probabilistic graphical model based on neural-symbolic reasoning for visual relationship detection. In: CVPR, pp. 10609–10618 (2022)

    Google Scholar 

  56. Zerroug, A., Vaishnav, M., Colin, J., Musslick, S., Serre, T.: A benchmark for compositional visual reasoning. arXiv preprint arXiv:2206.05379 (2022)

  57. Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: RAVEN: a dataset for relational and analogical visual reasoning. In: CVPR (2019)

    Google Scholar 

  58. Zhang, J., Chen, B., Zhang, L., Ke, X., Ding, H.: Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2, 14–35 (2021)

    Article  Google Scholar 

  59. Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

  60. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grants No. 62306175.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong-Lu Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1251 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, M., Cai, J., Liu, M., Xu, Y., Lu, C., Li, YL. (2025). Take a Step Back: Rethinking the Two Stages in Visual Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72775-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72774-0

  • Online ISBN: 978-3-031-72775-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics