skip to main content
10.1145/3664647.3681547acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Modeling Event-level Causal Representation for Video Classification

Published: 28 October 2024 Publication History

Abstract

Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. The source codes have been released at https://github.com/wyqcrystal/ECRL.

References

[1]
Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, et al. 2020. Counterfactual vision and language learning. In CVPR. 10044--10054.
[2]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, et al. 2021. Vivit: A video vision transformer. In CVPR. 6836--6846.
[3]
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. arXiv preprint arXiv:1203.6402 (2012).
[4]
Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. 100--108.
[5]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
[6]
Michel Besserve, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2018. Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253 (2018).
[7]
Fabian Caba Heilbron and Bernard Escorcia, et al. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961--970.
[8]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.
[9]
Zitan Chen, Zhuang Qi, Xiao Cao, Xiangxian Li, Xiangxu Meng, and Lei Meng. 2023. Class-level Structural Relation Modeling and Smoothing for Visual Representation Learning. In ACM MM. 2964--2972.
[10]
Alexey Dosovitskiy, Beyer, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In CVPR. 203--213.
[12]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.
[13]
Qing-Ling Guan, Yuze Zheng, Lei Meng, Li-Quan Dong, and Qun Hao. 2023. Improving the generalization of visual classification models across IoT cameras via cross-modal inference and fusion. IEEE Internet of Things Journal, Vol. 10, 18 (2023), 15835--15846.
[14]
Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology (2024).
[15]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group contextualization for video recognition. In CVPR. 928--938.
[16]
Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Andrew Brown, and Marcel Worring. 2022. Causal video summarizer for video exploration. In ICME. 1--6.
[17]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. 2022. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552 (2022).
[18]
Xiangxian Li, Yuze Zheng, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2024. Cross-modal learning using privileged information for long-tailed image classification. CVM (2024), 1--12.
[19]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In CVPR. 909--918.
[20]
Chuang Lin, Sicheng Zhao, Lei Meng, and Tat-Seng Chua. 2020. Multi-source domain adaptation for visual sentiment classification. In AAAI. 2661--2668.
[21]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083--7093.
[22]
Jinxing Liu, Junjin Xiao, Haokai Ma, Xiangxian Li, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022 d. Prompt learning with cross-modal feature alignment for visual domain adaptation. In CICAI. 416--428.
[23]
Jiang-Jiang Liu, Qibin Hou, and Ming-Ming Cheng, et al. 2019. A simple pooling-based design for real-time salient object detection. In CVPR. 3917--3926.
[24]
Ruyang Liu, Hao Liu, Ge Li, et al. 2022. Contextual debiasing for visual recognition with causal mechanisms. In CVPR. 12755--12765.
[25]
Yang Liu, Guanbin Li, and Liang Lin. 2023. Cross-modal causal relational reasoning for event-level visual question answering. TPAMI, Vol. 45, 10 (2023), 11624--11641.
[26]
Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, and Liang Lin. 2022. Causal reasoning meets visual representation learning: A prospective study. MIR, Vol. 19, 6 (2022), 485--511.
[27]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In CVPR. 3202--3211.
[28]
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021. Tam: Temporal adaptive module for video recognition. In ICCV. 13708--13718.
[29]
Haokai Ma, Xiangxian Li, Lei Meng, and Xiangxu Meng. 2021. Comparative Study of Adversarial Training Methods for Cold-Start Recommendation. In Proceedings of the 1st International Workshop on Adversarial Learning for Multimedia. 28--34.
[30]
Haokai Ma, Ruobing Xie, and Lei Meng, et al. 2024. Plug-in diffusion model for sequential recommendation. In AAAI, Vol. 38. 8886--8894.
[31]
Lei Meng, Long Chen, Xun Yang, Dacheng Tao, Hanwang Zhang, Chunyan Miao, and Tat-Seng Chua. 2019. Learning using privileged information for food recognition. In ACM MM. 557--565.
[32]
Lei Meng, Fuli Feng, Xiangnan He, Xiaoyan Gao, and Tat-Seng Chua. 2020. Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In ACM MM. 3460--3468.
[33]
Lei Meng, Zhuang Qi, Lei Wu, Xiaoyu Du, Li, et al. 2024. Improving Global Generalization and Local Personalization for Federated Learning. TNNLS (2024).
[34]
Lei Meng, Ah-Hwee Tan, Cyril Leung, Liqiang Nie, Tat-Seng Chua, and Chunyan Miao. 2015. Online multimodal co-indexing and retrieval of weakly labeled web image collections. In ICMR. 219--226.
[35]
Guoshun Nan, Rui Qiao, Yao Xiao, et al. 2021. Interventional video grounding with dual contrastive learning. In CVPR. 2765--2775.
[36]
Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).
[37]
Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.
[38]
Zhuang Qi, Lei Meng, and Zitan Chen, et al. 2023. Cross-silo prototypical calibration for federated learning with non-iid data. In ACM MM. 3099--3107.
[39]
Zhuang Qi, Yuqing Wang, Zitan Chen, Ran Wang, Xiangxu Meng, and Lei Meng. 2022. Clustering-based curriculum construction for sample-balanced federated learning. In CICAI. 155--166.
[40]
Wei Qin, Hanwang Zhang, Richang Hong, Ee-Peng Lim, and Qianru Sun. 2021. Causal interventional training for image recognition. TMM, Vol. 25 (2021), 1033--1044.
[41]
Sam Roweis. 1997. EM algorithms for PCA and SPCA. Advances in neural information processing systems, Vol. 10 (1997).
[42]
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. 618--626.
[43]
Wenjie Shi, Gao Huang, Shiji Song, and Cheng Wu. 2021. Temporal-spatial causal interpretations for vision-based reinforcement learning. TPAMI, Vol. 44, 12 (2021), 10222--10235.
[44]
Yi Tan, Yanbin Hao, Hao Zhang, et al. 2022. Hierarchical Hourglass Convolutional Network for Efficient Video Classification. In ACM MM. 5880--5891.
[45]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[46]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904.
[47]
Limin Wang, Yuanjun Xiong, and Zhe Wang, et al. 2018. Temporal segment networks for action recognition in videos. TPAMI (2018), 2740--2755.
[48]
Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In CVPR. 10760--10770.
[49]
Tao Wang, Yu Li, et al. 2020. The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV. 728--744.
[50]
Xiaolong Wang and other. 2018. Non-local neural networks. In CVPR. 7794--7803.
[51]
Yuqing Wang, Xiangxian Li, Yannan Liu, Xiao Cao, Xiangxu Meng, and Lei Meng. 2024. Causal inference for out-of-distribution recognition via sample balancing. CAAI Transactions on Intelligence Technology (2024).
[52]
Yuqing Wang, Xiangxian Li, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022. Causal inference with sample balancing for out-of-distribution detection in visual classification. In CICAI. 572--583.
[53]
Yuqing Wang, Xiangxian Li, Zhuang Qi, et al. 2022. Meta-causal feature learning for out-of-distribution generalization. In ECCV. 530--545.
[54]
Yuqing Wang, Zhuang Qi, Xiangxian Li, Jinxing Liu, Xiangxu Meng, and Lei Meng. 2023. Multi-channel attentive weighting of visual frames for multimodal video classification. In IJCNN. 1--8.
[55]
Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, and Liang Lin. 2023. Visual causal scene refinement for video question answering. In ACM MM. 377--386.
[56]
Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature discrimination for anomaly detection. TIP, Vol. 30 (2021), 3513--3527.
[57]
Saining Xie, Chen Sun, et al. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV. 305--321.
[58]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.
[59]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. 2048--2057.
[60]
Xun Yang, Tianyu Chang, et al. 2024. Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition. International Journal of Computer Vision (2024), 1--27.
[61]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2021. Deconfounded image captioning: A causal retrospect. TPAMI, Vol. 45, 11 (2021), 12996--13010.
[62]
Chuanqi Zang and Hanqing Wang, et al. 2023. Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR. 19027--19036.
[63]
Hao Zhang, Lechao Cheng, et al. 2022. Long-term leap attention, short-term periodic shift for video classification. In ACM MM. 5773--5782.
[64]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In ACM MM. 917--925.
[65]
Xiheng Zhang, Yongkang Wong, Xiaofei Wu, Juwei Lu, et al. 2021. Learning causal representation for training cross-domain pose estimator via generative interventions. In ICCV. 11270--11280.
[66]
Yuze Zheng, Zixuan Li, Xiangxian Li, Jinxing Liu, Yuqing Wang, Xiangxu Meng, and Lei Meng. 2024. Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment. arxiv: 2407.18854 [cs.CV] https://arxiv.org/abs/2407.18854

Index Terms

  1. Modeling Event-level Causal Representation for Video Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. causal intervention
    2. event-level modeling
    3. video classification

    Qualifiers

    • Research-article

    Funding Sources

    • The Oversea Innovation Team Project of the 20 Regulations for New Universities funding program of Jinan
    • Shandong Province Excellent Young Scientists Fund Program (Overseas)

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 147
      Total Downloads
    • Downloads (Last 12 months)147
    • Downloads (Last 6 weeks)78
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media