research-article

Modeling Event-level Causal Representation for Video Classification

Authors:

Xiangxu MengAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 3936 - 3944

https://doi.org/10.1145/3664647.3681547

Published: 28 October 2024 Publication History

Abstract

Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. The source codes have been released at https://github.com/wyqcrystal/ECRL.

References

[1]

Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, et al. 2020. Counterfactual vision and language learning. In CVPR. 10044--10054.

[2]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, et al. 2021. Vivit: A video vision transformer. In CVPR. 6836--6846.

[3]

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. arXiv preprint arXiv:1203.6402 (2012).

[4]

Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. 100--108.

[5]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.

[6]

Michel Besserve, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2018. Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253 (2018).

[7]

Fabian Caba Heilbron and Bernard Escorcia, et al. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961--970.

[8]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.

[9]

Zitan Chen, Zhuang Qi, Xiao Cao, Xiangxian Li, Xiangxu Meng, and Lei Meng. 2023. Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning. In ACM MM. 2964--2972.

[10]

Alexey Dosovitskiy, Beyer, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[11]

Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In CVPR. 203--213.

[12]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.

[13]

Qing-Ling Guan, Yuze Zheng, Lei Meng, Li-Quan Dong, and Qun Hao. 2023. Improving the generalization of visual classification models across IoT cameras via cross-modal inference and fusion. IEEE Internet of Things Journal, Vol. 10, 18 (2023), 15835--15846.

[14]

Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology (2024).

[15]

Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group contextualization for video recognition. In CVPR. 928--938.

[16]

Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Andrew Brown, and Marcel Worring. 2022. Causal video summarizer for video exploration. In ICME. 1--6.

[17]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. 2022. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552 (2022).

[18]

Xiangxian Li, Yuze Zheng, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2024. Cross-modal learning using privileged information for long-tailed image classification. CVM (2024), 1--12.

[19]

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In CVPR. 909--918.

[20]

Chuang Lin, Sicheng Zhao, Lei Meng, and Tat-Seng Chua. 2020. Multi-source domain adaptation for visual sentiment classification. In AAAI. 2661--2668.

[21]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083--7093.

[22]

Jinxing Liu, Junjin Xiao, Haokai Ma, Xiangxian Li, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022 d. Prompt learning with cross-modal feature alignment for visual domain adaptation. In CICAI. 416--428.

[23]

Jiang-Jiang Liu, Qibin Hou, and Ming-Ming Cheng, et al. 2019. A simple pooling-based design for real-time salient object detection. In CVPR. 3917--3926.

[24]

Ruyang Liu, Hao Liu, Ge Li, et al. 2022. Contextual debiasing for visual recognition with causal mechanisms. In CVPR. 12755--12765.

[25]

Yang Liu, Guanbin Li, and Liang Lin. 2023. Cross-modal causal relational reasoning for event-level visual question answering. TPAMI, Vol. 45, 10 (2023), 11624--11641.

Digital Library

[26]

Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, and Liang Lin. 2022. Causal reasoning meets visual representation learning: A prospective study. MIR, Vol. 19, 6 (2022), 485--511.

[27]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In CVPR. 3202--3211.

[28]

Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021. Tam: Temporal adaptive module for video recognition. In ICCV. 13708--13718.

[29]

Haokai Ma, Xiangxian Li, Lei Meng, and Xiangxu Meng. 2021. Comparative Study of Adversarial Training Methods for Cold-Start Recommendation. In Proceedings of the 1st International Workshop on Adversarial Learning for Multimedia. 28--34.

Digital Library

[30]

Haokai Ma, Ruobing Xie, and Lei Meng, et al. 2024. Plug-in diffusion model for sequential recommendation. In AAAI, Vol. 38. 8886--8894.

Digital Library

[31]

Lei Meng, Long Chen, Xun Yang, Dacheng Tao, Hanwang Zhang, Chunyan Miao, and Tat-Seng Chua. 2019. Learning using privileged information for food recognition. In ACM MM. 557--565.

[32]

Lei Meng, Fuli Feng, Xiangnan He, Xiaoyan Gao, and Tat-Seng Chua. 2020. Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In ACM MM. 3460--3468.

[33]

Lei Meng, Zhuang Qi, Lei Wu, Xiaoyu Du, Li, et al. 2024. Improving Global Generalization and Local Personalization for Federated Learning. TNNLS (2024).

[34]

Lei Meng, Ah-Hwee Tan, Cyril Leung, Liqiang Nie, Tat-Seng Chua, and Chunyan Miao. 2015. Online multimodal co-indexing and retrieval of weakly labeled web image collections. In ICMR. 219--226.

[35]

Guoshun Nan, Rui Qiao, Yao Xiao, et al. 2021. Interventional video grounding with dual contrastive learning. In CVPR. 2765--2775.

[36]

Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).

[37]

Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.

[38]

Zhuang Qi, Lei Meng, and Zitan Chen, et al. 2023. Cross-silo prototypical calibration for federated learning with non-iid data. In ACM MM. 3099--3107.

[39]

Zhuang Qi, Yuqing Wang, Zitan Chen, Ran Wang, Xiangxu Meng, and Lei Meng. 2022. Clustering-based curriculum construction for sample-balanced federated learning. In CICAI. 155--166.

[40]

Wei Qin, Hanwang Zhang, Richang Hong, Ee-Peng Lim, and Qianru Sun. 2021. Causal interventional training for image recognition. TMM, Vol. 25 (2021), 1033--1044.

[41]

Sam Roweis. 1997. EM algorithms for PCA and SPCA. Advances in neural information processing systems, Vol. 10 (1997).

[42]

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. 618--626.

[43]

Wenjie Shi, Gao Huang, Shiji Song, and Cheng Wu. 2021. Temporal-spatial causal interpretations for vision-based reinforcement learning. TPAMI, Vol. 44, 12 (2021), 10222--10235.

[44]

Yi Tan, Yanbin Hao, Hao Zhang, et al. 2022. Hierarchical Hourglass Convolutional Network for Efficient Video Classification. In ACM MM. 5880--5891.

[45]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[46]

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904.

[47]

Limin Wang, Yuanjun Xiong, and Zhe Wang, et al. 2018. Temporal segment networks for action recognition in videos. TPAMI (2018), 2740--2755.

[48]

Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In CVPR. 10760--10770.

[49]

Tao Wang, Yu Li, et al. 2020. The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV. 728--744.

[50]

Xiaolong Wang and other. 2018. Non-local neural networks. In CVPR. 7794--7803.

[51]

Yuqing Wang, Xiangxian Li, Yannan Liu, Xiao Cao, Xiangxu Meng, and Lei Meng. 2024. Causal inference for out-of-distribution recognition via sample balancing. CAAI Transactions on Intelligence Technology (2024).

[52]

Yuqing Wang, Xiangxian Li, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022. Causal inference with sample balancing for out-of-distribution detection in visual classification. In CICAI. 572--583.

[53]

Yuqing Wang, Xiangxian Li, Zhuang Qi, et al. 2022. Meta-causal feature learning for out-of-distribution generalization. In ECCV. 530--545.

[54]

Yuqing Wang, Zhuang Qi, Xiangxian Li, Jinxing Liu, Xiangxu Meng, and Lei Meng. 2023. Multi-channel attentive weighting of visual frames for multimodal video classification. In IJCNN. 1--8.

[55]

Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, and Liang Lin. 2023. Visual causal scene refinement for video question answering. In ACM MM. 377--386.

[56]

Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature discrimination for anomaly detection. TIP, Vol. 30 (2021), 3513--3527.

Digital Library

[57]

Saining Xie, Chen Sun, et al. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV. 305--321.

[58]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.

[59]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. 2048--2057.

[60]

Xun Yang, Tianyu Chang, et al. 2024. Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition. International Journal of Computer Vision (2024), 1--27.

[61]

Xu Yang, Hanwang Zhang, and Jianfei Cai. 2021. Deconfounded image captioning: A causal retrospect. TPAMI, Vol. 45, 11 (2021), 12996--13010.

[62]

Chuanqi Zang and Hanqing Wang, et al. 2023. Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR. 19027--19036.

[63]

Hao Zhang, Lechao Cheng, et al. 2022. Long-term leap attention, short-term periodic shift for video classification. In ACM MM. 5773--5782.

[64]

Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In ACM MM. 917--925.

[65]

Xiheng Zhang, Yongkang Wong, Xiaofei Wu, Juwei Lu, et al. 2021. Learning causal representation for training cross-domain pose estimator via generative interventions. In ICCV. 11270--11280.

[66]

Yuze Zheng, Zixuan Li, Xiangxian Li, Jinxing Liu, Yuqing Wang, Xiangxu Meng, and Lei Meng. 2024. Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment. arxiv: 2407.18854 [cs.CV] https://arxiv.org/abs/2407.18854

Index Terms

Modeling Event-level Causal Representation for Video Classification
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Causal representation for few-shot text classification
Abstract
Few-Shot Text Classification (FSTC) is a fundamental natural language processing problem that aims to classify small amounts of text with high accuracy. Mainstream methods model the superficial statistical relationships between text and labels. ...
Causal event extraction using causal event element-oriented neural network

Causal event extraction plays an important role in natural language processing (NLP) such as question answering, decision making and event prediction. Previous work extracts causal events using template-matching methods, machine-learning methods, or deep-...
Causal Attention for Interpretable and Generalizable Graph Classification
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

In graph classification, attention- and pooling-based graph neural networks (GNNs) prevail to extract the critical features from the input graph and support the prediction. They mostly follow the paradigm of learning to attend, which maximizes the mutual ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The Oversea Innovation Team Project of the 20 Regulations for New Universities funding program of Jinan
Shandong Province Excellent Young Scientists Fund Program (Overseas)

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
147
Total Downloads

Downloads (Last 12 months)147
Downloads (Last 6 weeks)78

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten