research-article

Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual Reasoning

Authors:

Xudong JiangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 4841 - 4850

https://doi.org/10.1145/3664647.3681246

Published: 28 October 2024 Publication History

Abstract

Advances in computer vision research enable human-like high-dimensional perceptual induction over analogical visual reasoning problems, such as Raven's Progressive Matrices (RPMs). In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP^2AI), consisting of three major components that tackle key challenges of RPM problems. Firstly, in view of the limited receptive fields of shallow networks in most existing RPM solvers, a perceptual encoder is proposed, consisting of a series of hierarchically coupled Patch Attention and Local Context (PALC) blocks, which could capture local attributes at early stages and capture the global panel layout at deep stages. Secondly, most methods seek for object-level similarities to map the context images directly to the answer image, while failing to extract the underlying analogies. The proposed reasoning module, Predictive Analogy-Inference (PredAI), consists of a set of Analogy-Inference Blocks (AIBs) to model and exploit the inherent analogical reasoning rules instead of object similarity. Lastly, the Squeeze-and-Excitation Channel-wise Attention (SECA) in the proposed PredAI discriminates essential attributes and analogies from irrelevant ones. Extensive experiments over four benchmark RPM datasets show that the proposed HP^2AI achieves significant performance gains over all the state-of-the-art methods consistently on all four datasets.

References

[1]

David G Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. 2018. Measuring Abstract Reasoning in Neural Networks. In Proceedings of the International Conference on Machine Learning, Vol. 80. 511--520.

[2]

Yaniv Benny, Niv Pekar, and Lior Wolf. 2021. Scale-Localized Abstract Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12557--12565.

[3]

Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2021. RegionViT: Regional-to-Local Attention for Vision Transformers. In Proceedings of the International Conference on Learning Representations.

[4]

Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A Smith, and Joshua B Tenenbaum. 2023. Are Deep Neural Networks SMARTer than Second Graders?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10834--10844.

[5]

Wang-Zhou Dai, Qiuling Xu, Yang Yu, and Zhi-Hua Zhou. 2019. Bridging Machine Learning and Logical Reasoning by Abductive Learning. In Advances in Neural Information Processing Systems, Vol. 32.

[6]

Dedre Gentner and Francisco Maravilla. 2017. Analogical Reasoning. In International Handbook of Thinking and Reasoning. Routledge, 186--203.

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.

[8]

Wentao He, Jianfeng Ren, and Ruibin Bai. 2024. Data Augmentation by Morphological Mixup for Solving Raven's Progressive Matrices. The Visual Computer, Vol. 40, 4 (2024), 2457--2470.

Digital Library

[9]

Wentao He, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2021. Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction. arXiv preprint arXiv:2111.12301 (2021).

[10]

Wentao He, Jialu Zhang, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2023. Hierarchical ConViT with Attention-based Relational Reasoner for Visual Analogical Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 22--30.

Digital Library

[11]

Felix Hill, Adam Santoro, David Barrett, Ari Morcos, and Timothy Lillicrap. 2018. Learning to Make Analogies by Contrasting Abstract Relational Structure. In Proceedings of the International Conference on Learning Representations.

[12]

Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, and Song-Chun Zhu. 2021. Learning by Fixing: Solving Math Word Problems with Weak Supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4959--4967.

[13]

Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. 2023. 3D Concept Learning and Reasoning from Multi-View Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9202--9212.

[14]

Yining Hong, Kaichun Mo, Li Yi, Leonidas J Guibas, Antonio Torralba, Joshua B Tenenbaum, and Chuang Gan. 2022. Fixing malfunctional objects with learned physical simulation and functional prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1413--1423.

[15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7132--7141.

[16]

Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, and Shihao Bai. 2021. Stratified Rule-Aware Network for Abstract Visual Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1567--1574.

[17]

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. 2024. Language is not All You Need: Aligning Perception with Language Models. In Advances in Neural Information Processing Systems, Vol. 36. 72096--72109.

[18]

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2901--2910.

[19]

Weikai Kong, Shuhong Ye, Chenglin Yao, and Jianfeng Ren. 2023. Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]

Jiangtong Li, Li Niu, and Liqing Zhang. 2022. From Representation to Reasoning: Towards Both Evidence and Commonsense Reasoning for Video Question-Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21273--21282.

[21]

Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, and Song-Chun Zhu. 2020. Closed Loop Neural-symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning. In Proceedings of the International Conference on Machine Learning. 5884--5894.

[22]

Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan Kankanhalli. 2023. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing, Vol. 32 (2023), 3836--3846.

Digital Library

[23]

Chen Liang, Wenguan Wang, Tianfei Zhou, and Yi Yang. 2022. Visual Abductive Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15565--15575.

[24]

Yang Liu, Guanbin Li, and Liang Lin. 2023. Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 10 (2023), 11624--11641.

Digital Library

[25]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[26]

Hongjing Lu, Ying Nian Wu, and Keith J Holyoak. 2019. Emergence of Analogy from Relation Learning. Proceedings of the National Academy of Sciences, Vol. 116, 10 (2019), 4176--4181.

[27]

Mikołaj Małki'nski and Jacek Ma'ndziuk. 2023. A Review of Emerging Research Directions in Abstract Visual Reasoning. Information Fusion, Vol. 91 (2023), 713--736.

Digital Library

[28]

Shanka Subhra Mondal, Taylor Webb, and Jonathan Cohen. 2022. Learning to Reason over Visual Objects. In Proceedings of the International Conference on Learning Representations.

[29]

Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. 2020. Bongard-LOGO: A New Benchmark for Human-level Concept Learning and Reasoning. In Advances in Neural Information Processing Systems, Vol. 33. 16468--16480.

[30]

Gaetano Rossiello, Alfio Gliozzo, Robert Farrell, Nicolas R Fauceglia, and Michael Glass. 2019. Learning Relational Representations by Analogy using Hierarchical Siamese Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 3235--3245.

[31]

Xindi Shang, Junbin Xiao, Donglin Di, and Tat-Seng Chua. 2019. Relation Understanding in Videos: A Grand Challenge Overview. In Proceedings of the 27th ACM International Conference on Multimedia. 2652--2656.

Digital Library

[32]

Xingke Song, Jiahuan Jin, Chenglin Yao, Shihe Wang, Jianfeng Ren, and Ruibin Bai. 2023. Siamese-discriminant Deep Reinforcement Learning for Solving Jigsaw Puzzles with Large Eroded Gaps. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2303--2311.

Digital Library

[33]

Xiaomeng Song, Yucheng Shi, Xin Chen, and Yahong Han. 2018. Explore Multi-step Reasoning in Video Question Answering. In Proceedings of the 26th ACM International Conference on Multimedia. 239--247.

Digital Library

[34]

Xingke Song, Xiaoying Yang, Jianfeng Ren, Ruibin Bai, and Xudong Jiang. 2023. Solving Jigsaw Puzzle of Large Eroded Gaps using Puzzlet Discriminant Network. In Proceedings of the IEEE international conference on acoustics, speech and signal processing.

[35]

Steven Spratley, Krista Ehinger, and Tim Miller. 2020. A Closer Look at Generalisation in RAVEN. In Proceedings of the European Conference on Computer Vision. 601--616.

Digital Library

[36]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).

[37]

Jiawei Wang, Zhanchang Ma, Da Cao, Yuquan Le, Junbin Xiao, and Tat-Seng Chua. 2023. Deconfounded Multimodal Learning for Spatio-temporal Video Grounding. In Proceedings of the 31st ACM International Conference on Multimedia. 7521--7529.

Digital Library

[38]

Taylor Webb, Shuhao Fu, Trevor Bihl, Keith J Holyoak, and Hongjing Lu. 2023. Zero-shot Visual Reasoning through Probabilistic Analogical Mapping. Nature Communications, Vol. 14, 1 (2023), 5144.

[39]

Taylor Webb, Keith J Holyoak, and Hongjing Lu. 2023. Emergent Analogical Reasoning in Large Language Models. Nature Human Behaviour, Vol. 7, 9 (2023), 1526--1541.

[40]

Jingyi Xu, Tushar Vaidya, Yufei Wu, Saket Chandra, Zhangsheng Lai, and Kai Fong Ernest Chong. 2023. Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6715--6724.

[41]

Lingxiao Yang, Hongzhi You, Zonglei Zhen, Dahui Wang, Xiaohong Wan, Xiaohua Xie, and Ru-Yuan Zhang. 2023. Neural Prediction Errors enable Analogical Visual Reasoning in Human Standard Intelligence Tests. In Proceedings of the International Conference on Machine Learning, Vol. 202. 39572--39583.

[42]

Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2023. Video Question Answering using CLIP-guided Visual-text Attention. In Proceedings of the IEEE International Conference on Image Processing. 81--85.

[43]

Jing Yu, Weifeng Zhang, Yuhang Lu, Zengchang Qin, Yue Hu, Jianlong Tan, and Qi Wu. 2020. Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-modal Retrieval. IEEE Transactions on Multimedia, Vol. 22, 12 (2020), 3196--3209.

Digital Library

[44]

Weijiang Yu, Jian Liang, Lei Ji, Lu Li, Yuejian Fang, Nong Xiao, and Nan Duan. 2021. Hybrid Reasoning Network for Video-based Commonsense Captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5213--5221.

Digital Library

[45]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.

[46]

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. 2019. RAVEN: A Dataset for Relational and Analogical Visual rEasoNing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5317--5327.

[47]

Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, HongJing Lu, and Song-Chun Zhu. 2019. Learning Perceptual Inference by Contrasting. In Advances in Neural Information Processing Systems, Vol. 32. 1075--1087.

[48]

Jialu Zhang, Xinyi Wang, Chenglin Yao, Jianfeng Ren, and Xudong Jiang. 2024. Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge. In Proceedings of the 32nd ACM International Conference on Multimedia.

Digital Library

[49]

Wenliang Zhao, Yongming Rao, Yansong Tang, Jie Zhou, and Jiwen Lu. 2022. VideoABC: A Real-world Video Dataset for Abductive Visual Reasoning. IEEE Transactions on Image Processing, Vol. 31 (2022), 6048--6061.

Digital Library

[50]

Tao Zhuo and Mohan Kankanhalli. 2021. Effective Abstract Reasoning with Dual-Contrast Network. In Proceedings of the International Conference on Learning Representations.

Cited By

He WRen JBai RJiang X(2025)Two-stage Rule-induction visual reasoning on RPMs with an application to video predictionPattern Recognition10.1016/j.patcog.2024.111151160(111151)Online publication date: Apr-2025
https://doi.org/10.1016/j.patcog.2024.111151

Index Terms

Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual Reasoning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Scene understanding
    2. Philosophical/theoretical foundations of artificial intelligence
      1. Cognitive science

Recommendations

Learning differentiable logic programs for abstract visual reasoning
Abstract
Visual reasoning is essential for building intelligent agents that understand the world and perform problem-solving beyond perception. Differentiable forward reasoning has been developed to integrate reasoning with gradient-based machine learning ...
Visual analogy: Viewing analogical retrieval and mapping as constraint satisfaction problems

The core issue of analogical reasoning is the transfer of relational knowledge from a source case to a target problem. Visual analogical reasoning pertains to problems containing only visual knowledge. Holyoak and Thagard proposed that the retrieval and ...
An analogy ontology for integrating analogical processing and first-principles reasoning
IAAI'02: Proceedings of the 14th conference on Innovative applications of artificial intelligence - Volume 1

This paper describes an analogy ontology, a formal representation of some key ideas in analogical processing, that supports the integration of analogical processing with first-principles reasoners. The ontology is based on Gentner's structure-mapping ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Ningbo Municipal Bureau of Science and Technology

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)72

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

He WRen JBai RJiang X(2025)Two-stage Rule-induction visual reasoning on RPMs with an application to video predictionPattern Recognition10.1016/j.patcog.2024.111151160(111151)Online publication date: Apr-2025
https://doi.org/10.1016/j.patcog.2024.111151

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten