research-article

Exploring Motion Cues for Video Test-Time Adaptation

Authors:

Shuaicheng Niu,

Jian ChenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 1840 - 1850

https://doi.org/10.1145/3581783.3612153

Published: 27 October 2023 Publication History

Abstract

Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.

References

[1]

Fatemeh Azimi, Sebastian Palacio, Federico Raue, Jörn Hees, Luca Bertinetto, and Andreas Dengel. 2022. Self-supervised test-time adaptation on video data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3439--3448.

[2]

Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022a. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8344--8353.

[3]

Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022b. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8344--8353.

[4]

Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. 2022. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 295--305.

[5]

Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).

[6]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019a. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202--6211.

[7]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019b. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202--6211.

[8]

Francc ois Fleuret et al. 2021a. Test time adaptation through perturbation robustness. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.

[9]

Francc ois Fleuret et al. 2021b. Test time adaptation through perturbation robustness. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.

[10]

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. 2022. Test-time training with masked autoencoders. In Advances in Neural Information Processing Systems, Vol. 35. 29374--29385.

[11]

Robert Geirhos, David HJ Janssen, Heiko H Schütt, Jonas Rauber, Matthias Bethge, and Felix A Wichmann. 2017. Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv preprint arXiv:1706.06969 (2017).

[12]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The ?Something Something" Video Database for Learning and Evaluating Visual Common Sense. In IEEE International Conference on Computer Vision, ICCV.

[13]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[15]

De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. 2018. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7366--7375.

[16]

Yusuke Iwasawa and Yutaka Matsuo. 2021a. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, Vol. 34. 2427--2440.

[17]

Yusuke Iwasawa and Yutaka Matsuo. 2021b. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, Vol. 34. 2427--2440.

[18]

Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2000--2009.

[19]

Oug uzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 2022. 3d common corruptions and data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18963--18974.

[20]

Ansh Khurana, Sujoy Paul, Piyush Rai, Soma Biswas, and Gaurav Aggarwal. 2021. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355 (2021).

[21]

Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, and Sangyoun Lee. 2022. Exploring Temporally Dynamic Data Augmentation for Video Recognition. In The Eleventh International Conference on Learning Representations.

[22]

Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, and Sangyoun Lee. 2020. Learning temporally invariant and localizable features via data augmentation for video recognition. In Computer Vision--ECCV 2020 Workshops: Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 386--403.

Digital Library

[23]

Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. 2019. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 101--110.

[24]

Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. 2018. Motion feature network: Fixed motion filter for action recognition. In Proceedings of the European Conference on Computer Vision. 387--403.

Digital Library

[25]

Jie Li, Mingqiang Yang, Yupeng Liu, Yanyan Wang, Qinghe Zheng, and Deqiang Wang. 2019. Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks. Engineering Letters, Vol. 27, 3 (2019).

[26]

Jian Liang, Dapeng Hu, and Jiashi Feng. 2020a. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028--6039.

[27]

Jian Liang, Dapeng Hu, and Jiashi Feng. 2020b. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028--6039.

[28]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In IEEE/CVF International Conference on Computer Vision. 7082--7092.

[29]

Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. 2023. Video Test-Time Adaptation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22952--22961.

[30]

Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. 2021a. TTT: When does self-supervised test-time training fail or thrive?. In Advances in Neural Information Processing Systems, Vol. 34. 21808--21820.

[31]

Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021b. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13708--13718.

[32]

M Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof. 2022. The norm must go on: dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14765--14775.

[33]

Zachary Nado, Shreyas Padhy, D Sculley, Alexander D'Amour, Balaji Lakshminarayanan, and Jasper Snoek. 2020. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963 (2020).

[34]

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. 2022. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning. PMLR, 16888--16905.

[35]

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. 2023. Towards Stable Test-time Adaptation in Dynamic Wild World. In The Eleventh International Conference on Learning Representations.

[36]

Madeline Chantry Schiappa, Naman Biyani, Prudvi Kamtam, Shruti Vyas, Hamid Palangi, Vibhav Vineet, and Yogesh S Rawat. 2023. A Large-Scale Robustness Analysis of Video Action Recognition Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14698--14708.

[37]

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. 2020. Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, Vol. 33. 11539--11551.

[38]

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618--626.

[39]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[40]

Yongyi Su, Xun Xu, and Kui Jia. 2022. Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering. In Advances in Neural Information Processing Systems.

[41]

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning. PMLR, 9229--9248.

[42]

Dequan Wang, Shaoteng Liu, Sayna Ebrahimi, Evan Shelhamer, and Trevor Darrell. 2021b. On-target adaptation. arXiv preprint arXiv:2109.01087 (2021).

[43]

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2021c. Tent: Fully test-time adaptation by entropy minimization. International Conference on Learning Representations.

[44]

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021a. Removing the background by adding the background: Towards background robust self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11804--11813.

[45]

Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. 2019. Feature Denoising for Improving Adversarial Robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]

Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, and Zhenghua Chen. 2022. Source-free video domain adaptation by learning temporal consistency for action recognition. In European Conference on Computer Vision. 147--164.

Digital Library

[47]

Chenyu Yi, Siyuan Yang, Haoliang Li, Yap-peng Tan, and Alex Kot. 2021. Benchmarking the robustness of spatial-temporal models against corruptions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

[48]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023--6032.

[49]

Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. Videomix: Rethinking data augmentation for video classification. arXiv preprint arXiv:2012.03457 (2020).

[50]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.

[51]

Marvin Zhang, Sergey Levine, and Chelsea Finn. 2022. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, Vol. 35. 38629--38642.

[52]

Yumeng Zhang, Gaoguo Jia, Li Chen, Mingrui Zhang, and Junhai Yong. 2020. Self-paced video data augmentation by generative adversarial networks with insufficient samples. In Proceedings of the 28th ACM International Conference on Multimedia. 1652--1660.

Digital Library

[53]

Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13066--13073.

Cited By

Niu SMiao CChen GWu PZhao PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Test-time model adaptation with only forward passesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693623(38298-38315)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693623
Xu YCao HXie LLi XChen ZYang J(2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3679010
Liang JHe RTan T(2024)A Comprehensive Survey on Test-Time Adaptation Under Distribution ShiftsInternational Journal of Computer Vision10.1007/s11263-024-02181-w133:1(31-64)Online publication date: 18-Jul-2024
https://doi.org/10.1007/s11263-024-02181-w

Index Terms

Exploring Motion Cues for Video Test-Time Adaptation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Multi-source fully test-time adaptation
Abstract
Deep neural networks have significantly advanced various fields. However, these models often encounter difficulties in achieving effective generalization when the distribution of test samples varies from that of the training samples. Recently, ...
Highlights
- New direction: propose the setting of multi-source fully test-time adaptation for the first time.
- Novel method: propose an effective baseline MUTE via weighted aggregation schemes for adaptation.
- Extensive evaluation: achieve ...
Learning frame relevance for video classification
MM '11: Proceedings of the 19th ACM international conference on Multimedia

Traditional video classification methods typically require a large number of labeled training video frames to achieve satisfactory performance. However, in the real world, we usually only have sufficient labeled video clips (such as tagged online videos)...
Prototypical class-wise test-time adaptation
Abstract
Test-time adaptation (TTA) refines pre-trained models during deployment, enabling them to effectively manage new, previously unseen data. However, existing TTA methods focus mainly on global domain alignment, which reduces domain-level gaps but ...
Highlights
- Our method proposes class-level alignment as well as global alignment in TTA.
- Our class-wise alignment uses pre-trained weights as source domain class prototypes.
- Our method boosts TTA performance across baselines on three ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Excellent Science and Technology Creative Talent Training Program of Shenzhen Municipality
Shenzhen Natural Science Foundation (the Stable Support Plan Program)
Guangdong Basic and Applied Basic Research Foundation

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
240
Total Downloads

Downloads (Last 12 months)118
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Niu SMiao CChen GWu PZhao PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Test-time model adaptation with only forward passesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693623(38298-38315)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693623
Xu YCao HXie LLi XChen ZYang J(2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3679010
Liang JHe RTan T(2024)A Comprehensive Survey on Test-Time Adaptation Under Distribution ShiftsInternational Journal of Computer Vision10.1007/s11263-024-02181-w133:1(31-64)Online publication date: 18-Jul-2024
https://doi.org/10.1007/s11263-024-02181-w

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten