skip to main content
10.1145/3664647.3681246acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual Reasoning

Published: 28 October 2024 Publication History

Abstract

Advances in computer vision research enable human-like high-dimensional perceptual induction over analogical visual reasoning problems, such as Raven's Progressive Matrices (RPMs). In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP^2AI), consisting of three major components that tackle key challenges of RPM problems. Firstly, in view of the limited receptive fields of shallow networks in most existing RPM solvers, a perceptual encoder is proposed, consisting of a series of hierarchically coupled Patch Attention and Local Context (PALC) blocks, which could capture local attributes at early stages and capture the global panel layout at deep stages. Secondly, most methods seek for object-level similarities to map the context images directly to the answer image, while failing to extract the underlying analogies. The proposed reasoning module, Predictive Analogy-Inference (PredAI), consists of a set of Analogy-Inference Blocks (AIBs) to model and exploit the inherent analogical reasoning rules instead of object similarity. Lastly, the Squeeze-and-Excitation Channel-wise Attention (SECA) in the proposed PredAI discriminates essential attributes and analogies from irrelevant ones. Extensive experiments over four benchmark RPM datasets show that the proposed HP^2AI achieves significant performance gains over all the state-of-the-art methods consistently on all four datasets.

References

[1]
David G Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. 2018. Measuring Abstract Reasoning in Neural Networks. In Proceedings of the International Conference on Machine Learning, Vol. 80. 511--520.
[2]
Yaniv Benny, Niv Pekar, and Lior Wolf. 2021. Scale-Localized Abstract Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12557--12565.
[3]
Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2021. RegionViT: Regional-to-Local Attention for Vision Transformers. In Proceedings of the International Conference on Learning Representations.
[4]
Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A Smith, and Joshua B Tenenbaum. 2023. Are Deep Neural Networks SMARTer than Second Graders?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10834--10844.
[5]
Wang-Zhou Dai, Qiuling Xu, Yang Yu, and Zhi-Hua Zhou. 2019. Bridging Machine Learning and Logical Reasoning by Abductive Learning. In Advances in Neural Information Processing Systems, Vol. 32.
[6]
Dedre Gentner and Francisco Maravilla. 2017. Analogical Reasoning. In International Handbook of Thinking and Reasoning. Routledge, 186--203.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.
[8]
Wentao He, Jianfeng Ren, and Ruibin Bai. 2024. Data Augmentation by Morphological Mixup for Solving Raven's Progressive Matrices. The Visual Computer, Vol. 40, 4 (2024), 2457--2470.
[9]
Wentao He, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2021. Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction. arXiv preprint arXiv:2111.12301 (2021).
[10]
Wentao He, Jialu Zhang, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2023. Hierarchical ConViT with Attention-based Relational Reasoner for Visual Analogical Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 22--30.
[11]
Felix Hill, Adam Santoro, David Barrett, Ari Morcos, and Timothy Lillicrap. 2018. Learning to Make Analogies by Contrasting Abstract Relational Structure. In Proceedings of the International Conference on Learning Representations.
[12]
Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, and Song-Chun Zhu. 2021. Learning by Fixing: Solving Math Word Problems with Weak Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4959--4967.
[13]
Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 2023. 3D Concept Learning and Reasoning from Multi-View Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9202--9212.
[14]
Yining Hong, Kaichun Mo, Li Yi, Leonidas J Guibas, Antonio Torralba, Joshua B Tenenbaum, and Chuang Gan. 2022. Fixing malfunctional objects with learned physical simulation and functional prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1413--1423.
[15]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7132--7141.
[16]
Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, and Shihao Bai. 2021. Stratified Rule-Aware Network for Abstract Visual Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1567--1574.
[17]
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. 2024. Language is not All You Need: Aligning Perception with Language Models. In Advances in Neural Information Processing Systems, Vol. 36. 72096--72109.
[18]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2901--2910.
[19]
Weikai Kong, Shuhong Ye, Chenglin Yao, and Jianfeng Ren. 2023. Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
[20]
Jiangtong Li, Li Niu, and Liqing Zhang. 2022. From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21273--21282.
[21]
Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, and Song-Chun Zhu. 2020. Closed Loop Neural-symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning. In Proceedings of the International Conference on Machine Learning. 5884--5894.
[22]
Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan Kankanhalli. 2023. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing, Vol. 32 (2023), 3836--3846.
[23]
Chen Liang, Wenguan Wang, Tianfei Zhou, and Yi Yang. 2022. Visual Abductive Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15565--15575.
[24]
Yang Liu, Guanbin Li, and Liang Lin. 2023. Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 10 (2023), 11624--11641.
[25]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[26]
Hongjing Lu, Ying Nian Wu, and Keith J Holyoak. 2019. Emergence of Analogy from Relation Learning. Proceedings of the National Academy of Sciences, Vol. 116, 10 (2019), 4176--4181.
[27]
Mikołaj Małki'nski and Jacek Ma'ndziuk. 2023. A Review of Emerging Research Directions in Abstract Visual Reasoning. Information Fusion, Vol. 91 (2023), 713--736.
[28]
Shanka Subhra Mondal, Taylor Webb, and Jonathan Cohen. 2022. Learning to Reason over Visual Objects. In Proceedings of the International Conference on Learning Representations.
[29]
Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. 2020. Bongard-LOGO: A New Benchmark for Human-level Concept Learning and Reasoning. In Advances in Neural Information Processing Systems, Vol. 33. 16468--16480.
[30]
Gaetano Rossiello, Alfio Gliozzo, Robert Farrell, Nicolas R Fauceglia, and Michael Glass. 2019. Learning Relational Representations by Analogy using Hierarchical Siamese Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 3235--3245.
[31]
Xindi Shang, Junbin Xiao, Donglin Di, and Tat-Seng Chua. 2019. Relation Understanding in Videos: A Grand Challenge Overview. In Proceedings of the 27th ACM International Conference on Multimedia. 2652--2656.
[32]
Xingke Song, Jiahuan Jin, Chenglin Yao, Shihe Wang, Jianfeng Ren, and Ruibin Bai. 2023. Siamese-discriminant Deep Reinforcement Learning for Solving Jigsaw Puzzles with Large Eroded Gaps. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2303--2311.
[33]
Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. 2018. Explore Multi-step Reasoning in Video Question Answering. In Proceedings of the 26th ACM International Conference on Multimedia. 239--247.
[34]
Xingke Song, Xiaoying Yang, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2023. Solving Jigsaw Puzzle of Large Eroded Gaps using Puzzlet Discriminant Network. In Proceedings of the IEEE international conference on acoustics, speech and signal processing.
[35]
Steven Spratley, Krista Ehinger, and Tim Miller. 2020. A Closer Look at Generalisation in RAVEN. In Proceedings of the European Conference on Computer Vision. 601--616.
[36]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).
[37]
Jiawei Wang, Zhanchang Ma, Da Cao, Yuquan Le, Junbin Xiao, and Tat-Seng Chua. 2023. Deconfounded Multimodal Learning for Spatio-temporal Video Grounding. In Proceedings of the 31st ACM International Conference on Multimedia. 7521--7529.
[38]
Taylor Webb, Shuhao Fu, Trevor Bihl, Keith J Holyoak, and Hongjing Lu. 2023. Zero-shot Visual Reasoning through Probabilistic Analogical Mapping. Nature Communications, Vol. 14, 1 (2023), 5144.
[39]
Taylor Webb, Keith J Holyoak, and Hongjing Lu. 2023. Emergent Analogical Reasoning in Large Language Models. Nature Human Behaviour, Vol. 7, 9 (2023), 1526--1541.
[40]
Jingyi Xu, Tushar Vaidya, Yufei Wu, Saket Chandra, Zhangsheng Lai, and Kai Fong Ernest Chong. 2023. Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6715--6724.
[41]
Lingxiao Yang, Hongzhi You, Zonglei Zhen, Dahui Wang, Xiaohong Wan, Xiaohua Xie, and Ru-Yuan Zhang. 2023. Neural Prediction Errors enable Analogical Visual Reasoning in Human Standard Intelligence Tests. In Proceedings of the International Conference on Machine Learning, Vol. 202. 39572--39583.
[42]
Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2023. Video Question Answering using CLIP-guided Visual-text Attention. In Proceedings of the IEEE International Conference on Image Processing. 81--85.
[43]
Jing Yu, Weifeng Zhang, Yuhang Lu, Zengchang Qin, Yue Hu, Jianlong Tan, and Qi Wu. 2020. Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-modal Retrieval. IEEE Transactions on Multimedia, Vol. 22, 12 (2020), 3196--3209.
[44]
Weijiang Yu, Jian Liang, Lei Ji, Lu Li, Yuejian Fang, Nong Xiao, and Nan Duan. 2021. Hybrid Reasoning Network for Video-based Commonsense Captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5213--5221.
[45]
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.
[46]
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. 2019. RAVEN: A Dataset for Relational and Analogical Visual rEasoNing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5317--5327.
[47]
Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, HongJing Lu, and Song-Chun Zhu. 2019. Learning Perceptual Inference by Contrasting. In Advances in Neural Information Processing Systems, Vol. 32. 1075--1087.
[48]
Jialu Zhang, Xinyi Wang, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2024. Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge. In Proceedings of the 32nd ACM International Conference on Multimedia.
[49]
Wenliang Zhao, Yongming Rao, Yansong Tang, Jie Zhou, and Jiwen Lu. 2022. VideoABC: A Real-world Video Dataset for Abductive Visual Reasoning. IEEE Transactions on Image Processing, Vol. 31 (2022), 6048--6061.
[50]
Tao Zhuo and Mohan Kankanhalli. 2021. Effective Abstract Reasoning with Dual-Contrast Network. In Proceedings of the International Conference on Learning Representations.

Cited By

View all
  • (2025)Two-stage Rule-induction visual reasoning on RPMs with an application to video predictionPattern Recognition10.1016/j.patcog.2024.111151160(111151)Online publication date: Apr-2025

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. analogical visual reasoning
  2. intelligence quotient test
  3. predicting-and-verifying
  4. raven's progressive matrix
  5. transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)72
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Two-stage Rule-induction visual reasoning on RPMs with an application to video predictionPattern Recognition10.1016/j.patcog.2024.111151160(111151)Online publication date: Apr-2025

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media