skip to main content
10.1145/3581783.3612153acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Exploring Motion Cues for Video Test-Time Adaptation

Published: 27 October 2023 Publication History

Abstract

Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.

References

[1]
Fatemeh Azimi, Sebastian Palacio, Federico Raue, Jörn Hees, Luca Bertinetto, and Andreas Dengel. 2022. Self-supervised test-time adaptation on video data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3439--3448.
[2]
Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022a. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8344--8353.
[3]
Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022b. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8344--8353.
[4]
Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. 2022. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 295--305.
[5]
Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
[6]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019a. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202--6211.
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019b. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202--6211.
[8]
Francc ois Fleuret et al. 2021a. Test time adaptation through perturbation robustness. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.
[9]
Francc ois Fleuret et al. 2021b. Test time adaptation through perturbation robustness. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.
[10]
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. 2022. Test-time training with masked autoencoders. In Advances in Neural Information Processing Systems, Vol. 35. 29374--29385.
[11]
Robert Geirhos, David HJ Janssen, Heiko H Schütt, Jonas Rauber, Matthias Bethge, and Felix A Wichmann. 2017. Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv preprint arXiv:1706.06969 (2017).
[12]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The ?Something Something" Video Database for Learning and Evaluating Visual Common Sense. In IEEE International Conference on Computer Vision, ICCV.
[13]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[15]
De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. 2018. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7366--7375.
[16]
Yusuke Iwasawa and Yutaka Matsuo. 2021a. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, Vol. 34. 2427--2440.
[17]
Yusuke Iwasawa and Yutaka Matsuo. 2021b. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, Vol. 34. 2427--2440.
[18]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2000--2009.
[19]
Oug uzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 2022. 3d common corruptions and data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18963--18974.
[20]
Ansh Khurana, Sujoy Paul, Piyush Rai, Soma Biswas, and Gaurav Aggarwal. 2021. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355 (2021).
[21]
Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, and Sangyoun Lee. 2022. Exploring Temporally Dynamic Data Augmentation for Video Recognition. In The Eleventh International Conference on Learning Representations.
[22]
Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, and Sangyoun Lee. 2020. Learning temporally invariant and localizable features via data augmentation for video recognition. In Computer Vision--ECCV 2020 Workshops: Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 386--403.
[23]
Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. 2019. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 101--110.
[24]
Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. 2018. Motion feature network: Fixed motion filter for action recognition. In Proceedings of the European Conference on Computer Vision. 387--403.
[25]
Jie Li, Mingqiang Yang, Yupeng Liu, Yanyan Wang, Qinghe Zheng, and Deqiang Wang. 2019. Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks. Engineering Letters, Vol. 27, 3 (2019).
[26]
Jian Liang, Dapeng Hu, and Jiashi Feng. 2020a. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028--6039.
[27]
Jian Liang, Dapeng Hu, and Jiashi Feng. 2020b. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028--6039.
[28]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In IEEE/CVF International Conference on Computer Vision. 7082--7092.
[29]
Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, and Horst Bischof. 2023. Video Test-Time Adaptation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22952--22961.
[30]
Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. 2021a. TTT: When does self-supervised test-time training fail or thrive?. In Advances in Neural Information Processing Systems, Vol. 34. 21808--21820.
[31]
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021b. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13708--13718.
[32]
M Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof. 2022. The norm must go on: dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14765--14775.
[33]
Zachary Nado, Shreyas Padhy, D Sculley, Alexander D'Amour, Balaji Lakshminarayanan, and Jasper Snoek. 2020. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963 (2020).
[34]
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. 2022. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning. PMLR, 16888--16905.
[35]
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. 2023. Towards Stable Test-time Adaptation in Dynamic Wild World. In The Eleventh International Conference on Learning Representations.
[36]
Madeline Chantry Schiappa, Naman Biyani, Prudvi Kamtam, Shruti Vyas, Hamid Palangi, Vibhav Vineet, and Yogesh S Rawat. 2023. A Large-Scale Robustness Analysis of Video Action Recognition Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14698--14708.
[37]
Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. 2020. Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, Vol. 33. 11539--11551.
[38]
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618--626.
[39]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[40]
Yongyi Su, Xun Xu, and Kui Jia. 2022. Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering. In Advances in Neural Information Processing Systems.
[41]
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning. PMLR, 9229--9248.
[42]
Dequan Wang, Shaoteng Liu, Sayna Ebrahimi, Evan Shelhamer, and Trevor Darrell. 2021b. On-target adaptation. arXiv preprint arXiv:2109.01087 (2021).
[43]
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2021c. Tent: Fully test-time adaptation by entropy minimization. International Conference on Learning Representations.
[44]
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021a. Removing the background by adding the background: Towards background robust self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11804--11813.
[45]
Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. 2019. Feature Denoising for Improving Adversarial Robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[46]
Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, and Zhenghua Chen. 2022. Source-free video domain adaptation by learning temporal consistency for action recognition. In European Conference on Computer Vision. 147--164.
[47]
Chenyu Yi, Siyuan Yang, Haoliang Li, Yap-peng Tan, and Alex Kot. 2021. Benchmarking the robustness of spatial-temporal models against corruptions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[48]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023--6032.
[49]
Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. Videomix: Rethinking data augmentation for video classification. arXiv preprint arXiv:2012.03457 (2020).
[50]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
[51]
Marvin Zhang, Sergey Levine, and Chelsea Finn. 2022. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, Vol. 35. 38629--38642.
[52]
Yumeng Zhang, Gaoguo Jia, Li Chen, Mingrui Zhang, and Junhai Yong. 2020. Self-paced video data augmentation by generative adversarial networks with insufficient samples. In Proceedings of the 28th ACM International Conference on Multimedia. 1652--1660.
[53]
Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13066--13073.

Cited By

View all
  • (2024)Test-time model adaptation with only forward passesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693623(38298-38315)Online publication date: 21-Jul-2024
  • (2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 1-Oct-2024
  • (2024)A Comprehensive Survey on Test-Time Adaptation Under Distribution ShiftsInternational Journal of Computer Vision10.1007/s11263-024-02181-w133:1(31-64)Online publication date: 18-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. motion encoding
  2. test-time adaptation
  3. video classification

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Excellent Science and Technology Creative Talent Training Program of Shenzhen Municipality
  • Shenzhen Natural Science Foundation (the Stable Support Plan Program)
  • Guangdong Basic and Applied Basic Research Foundation

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)118
  • Downloads (Last 6 weeks)9
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Test-time model adaptation with only forward passesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693623(38298-38315)Online publication date: 21-Jul-2024
  • (2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 1-Oct-2024
  • (2024)A Comprehensive Survey on Test-Time Adaptation Under Distribution ShiftsInternational Journal of Computer Vision10.1007/s11263-024-02181-w133:1(31-64)Online publication date: 18-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media