Take a Step Back: Rethinking the Two Stages in Visual Reasoning

Zhang, Mingyu; Cai, Jiting; Liu, Mingyu; Xu, Yue; Lu, Cewu; Li, Yong-Lu

doi:10.1007/978-3-031-72775-7_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

European Conference on Computer Vision

428 Accesses

Abstract

As a prominent research area, visual reasoning plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. Our code is publicly available at https://mybearyzhang.github.io/projects/TwoStageReason.

M. Zhang and J. Cai—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Visual-Guided Reasoning Path Generation for Visual Question Answering

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

Learning differentiable logic programs for abstract visual reasoning

Article Open access 26 October 2024

References

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Google Scholar
Amizadeh, S., Palangi, H., Polozov, A., Huang, Y., Koishida, K.: Neuro-symbolic visual reasoning: disentangling. In: ICML, pp. 279–290. PMLR (2020)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV, December 2015
Google Scholar
Baradel, F., Neverova, N., Mille, J., Mori, G., Wolf, C.: CoPhy: counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000 (2019)
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Cornelio, C., Stuehmer, J., Hu, S.X., Hospedales, T.: Learning where and when to reason in neuro-symbolic inference. In: ICLR (2022)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Duan, J., Yu, S., Poria, S., Wen, B., Tan, C.: PIP: physical interaction prediction via mental simulation with span selection. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 405–421. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_24
Duan, J., Yu, S., Tan, C.: Space: a simulator for physical interactions and causal learning in 3D environments. In: ICCV, pp. 2058–2063 (2021)
Google Scholar
Funamizu, A., Kuhn, B., Doya, K.: Neural substrate of dynamic Bayesian inference in the cerebral cortex. Nat. Neurosci. 19(12), 1682–1689 (2016)
Article Google Scholar
Garcez, A.D., et al.: Neural-symbolic learning and reasoning: contributions and challenges. In: 2015 AAAI (2015)
Google Scholar
Garcez, A.D., et al.: Neural-symbolic learning and reasoning: a survey and interpretation. Neuro-Symbolic Artif. Intell. State Art 342(1), 327 (2022)
Google Scholar
Gong, T., et al.: Multimodal-GPT: a vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 729–734. IEEE (2005)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, July 2017
Google Scholar
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: CVPR, pp. 14953–14962 (2023)
Google Scholar
Hamilton, K., Nayak, A., Božić, B., Longo, L.: Is neuro-symbolic AI meeting its promises in natural language processing? A structured review. Semant. Web (Preprint), 1–42 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hong, Y., Yi, L., Tenenbaum, J., Torralba, A., Gan, C.: PTR: a benchmark for part-based conceptual, relational, and physical reasoning. NeurIPS 34, 17427–17440 (2021)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, June 2019
Google Scholar
Janny, S., Baradel, F., Neverova, N., Nadri, M., Mori, G., Wolf, C.: Filtered-CoPhy: unsupervised learning of counterfactual physics in pixel space. In: ICLR (2022)
Google Scholar
Ji, Z., Tiezheng, Y., Xu, Y., Lee, N., Ishii, E., Fung, P.: Towards mitigating LLM hallucination via self reflection. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)
Google Scholar
Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., Anandkumar, A.: Bongard-HOI: benchmarking few-shot visual reasoning for human-object interactions. In: CVPR, pp. 19056–19065 (2022)
Google Scholar
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: CVPR (2020)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, July 2017
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kroshchanka, A., Golovko, V., Mikhno, E., Kovalev, M., Zahariev, V., Zagorskij, A.: A neural-symbolic approach to computer vision. In: Golenkov, V., Krasnoproshin, V., Golovko, V., Shunkevich, D. (eds.) Open Semantic Technologies for Intelligent Systems, OSTIS 2021. CCIS, vol. 1625, pp. 282–309. Springer, Cham (2021). https://doi.org/10.1007/978-3-031-15882-7_15
Lemos, H., Avelar, P., Prates, M., Garcez, A., Lamb, L.: Neural-symbolic relational reasoning on graph models: effective link inference and computation from knowledge bases. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 647–659. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_51
Chapter Google Scholar
Li, T., Dubout, C., Wampler, E.K., Yantis, S., Geman, D., et al.: Comparing machines and humans on a visual categorization test (2011)
Google Scholar
Li, Y.L., et al.: HAKE: a knowledge engine foundation for human activity understanding. TPAMI 45(7), 8494–8506 (2022)
Google Scholar
Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning. In: ICCV (2023)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, Z., Wang, Z., Lin, Y., Li, H.: A neural-symbolic approach to natural language understanding. arXiv preprint arXiv:2203.10557 (2022)
Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: ICLR (2019). https://openreview.net/forum?id=rJgMlhRctm
McDuff, D., et al.: CausalCity: complex simulations with agency for causal discovery and reasoning. In: Conference on Causal Learning and Reasoning, pp. 559–575. PMLR (2022)
Google Scholar
Messina, N., Amato, G., Carrara, F., Gennaro, C., Falchi, F.: Recurrent vision transformer for solving visual reasoning problems. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) Image Analysis and Processing – ICIAP 2022. LNCS, vol. 13233, pp. 50–61. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06433-3_5
Nie, W., Yu, Z., Mao, L., Patel, A.B., Zhu, Y., Anandkumar, A.: BONGARD-LOGO: a new benchmark for human-level concept learning and reasoning. In: NeurIPS (2020)
Google Scholar
Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books, New York (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Raghuraman, N., Harley, A.W., Guibas, L.: Cross-image context matters for Bongard problems (2023)
Google Scholar
Shu, M., et al.: Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS 35, 14274–14289 (2022)
Google Scholar
Spratley, S., Ehinger, K., Miller, T.: A closer look at generalisation in RAVEN. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 601–616. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_36
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Tsai, C.F., Zhou, X., Liu, S.S., Li, J., Yu, M., Mei, H.: Can large language models play text games well? Current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Wen, Z., Peng, Y.: Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits Syst. Video Technol. 31(3), 1042–1054 (2020)
Article Google Scholar
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017). https://doi.org/10.1016/j.cviu.2017.05.001, Language in Vision
Wu, X., Li, Y.L., Sun, J., Lu, C.: Symbol-LLM: leverage language models for symbolic system in visual human activity reasoning. In: NeurIPS (2023)
Google Scholar
Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. arXiv preprint arXiv:2401.11817 (2024)
Yang, L., et al.: Neural prediction errors enable analogical visual reasoning in human standard intelligence tests (2023)
Google Scholar
Yao, J.Y., Ning, K.P., Liu, Z.H., Ning, M.N., Yuan, L.: LLM lies: hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469 (2023)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS, vol. 31 (2018)
Google Scholar
Yu, D., Yang, B., Liu, D., Wang, H., Pan, S.: A survey on neural-symbolic learning systems. Neural Networks 166, 105–126 (2023)
Article Google Scholar
Yu, D., Yang, B., Wei, Q., Li, A., Pan, S.: A probabilistic graphical model based on neural-symbolic reasoning for visual relationship detection. In: CVPR, pp. 10609–10618 (2022)
Google Scholar
Zerroug, A., Vaishnav, M., Colin, J., Musslick, S., Serre, T.: A benchmark for compositional visual reasoning. arXiv preprint arXiv:2206.05379 (2022)
Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: RAVEN: a dataset for relational and analogical visual reasoning. In: CVPR (2019)
Google Scholar
Zhang, J., Chen, B., Zhang, L., Ke, X., Ding, H.: Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2, 14–35 (2021)
Article Google Scholar
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grants No. 62306175.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Mingyu Zhang, Jiting Cai, Yue Xu, Cewu Lu & Yong-Lu Li
Zhejiang University, Hangzhou, China
Mingyu Liu

Authors

Mingyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiting Cai
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Xu
View author publications
You can also search for this author in PubMed Google Scholar
Cewu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Lu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong-Lu Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1251 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M., Cai, J., Liu, M., Xu, Y., Lu, C., Li, YL. (2025). Take a Step Back: Rethinking the Two Stages in Visual Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72775-7_8
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Take a Step Back: Rethinking the Two Stages in Visual Reasoning