skip to main content
10.1145/3474085.3475510acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Representation Learning with Graph Contrastive Augmentation

Published: 17 October 2021 Publication History

Abstract

Contrastive-based self-supervised learning for image representations has significantly closed the gap with supervised learning. A natural extension of image-based contrastive learning methods to the video domain is to fully exploit the temporal structure presented in videos. We propose a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos. Specifically, we construct the temporal graph in the video by leveraging the relational knowledge behind the correlated sequence video features. Afterwards, we apply the proposed graph augmentation to generate another graph view by cooperating random corruption of the original graph to enhance the diversity of the intrinsic structure of the temporal graph. To this end, we provide two different kinds of contrastive learning methods to train our framework using temporal relationships concealed in videos as self-supervised signals. We perform empirical experiments on downstream tasks, action recognition and video retrieval, using the learned video representation, and the results demonstrate that with the graph view of temporal structure, our proposed GCA remarkably improves performance against or on par with the recent methods.

References

[1]
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the Speediness in Videos. In Conference on Computer Vision and Pattern Recognition. 9919--9928.
[2]
Uta Büchler, Biagio Brattoli, and Björn Ommer. 2018. Improving Spatiotemporal Self-supervision by Deep Reinforcement Learning. In European Conference on Computer Vision, Vol. 11219. 797--814.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning, Vol. 119. 1597--1607.
[4]
Xinlei Chen and Kaiming He. 2021. Exploring Simple Siamese Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Ima- geNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[6]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised Visual Representation Learning by Context Prediction. In IEEE International Conference on Computer Vision. 1422--1430.
[7]
Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. 2018. Ge- ometry guided convolutional neural networks for self-supervised video represen- tation learning. In IEEE Conference on Computer Vision and Pattern Recognition. 5589--5597.
[8]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems.
[9]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video Representation Learning by Dense Predictive Coding. In International Conference on Computer Vision Workshops. 1483--1492.
[10]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Memory-Augmented Dense Predictive Coding for Video Representation Learning. In European Conference of Computer Vision, Vol. 12348. 312--329.
[11]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised Co-Training for Video Representation Learning. In Advances in Neural Information Processing Systems.
[12]
Kaveh Hassani and Amir Hosein Khas Ahmadi. 2020. Contrastive Multi-View Representation Learning on Graphs. In International Conference on Machine Learning, Vol. 119. 4116--4126.
[13]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Con- ference on Computer Vision and Pattern Recognition. 9726--9735.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[15]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations.
[16]
Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018).
[17]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv:1705.06950 (2017).
[18]
Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In AAAI Conference on Artificial Intelligence. 8545--8552.
[19]
Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In (AAAI) Conference on Artificial Intelligence. 8545--8552.
[20]
Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, and Tomokazu Mu- rakami. 2020. Cycle-Contrast for Self-Supervised Video Representation Learning. In Advances in Neural Information Processing Systems.
[21]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In International Conference on Computer Vision. 2556--2563.
[22]
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised Representation Learning by Sorting Sequences. In International Conference on Computer Vision. 667--676.
[23]
Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In AAAI Conference on Artificial Intelligence. 11701--11708.
[24]
Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and Learn: Un- supervised Learning Using Temporal Order Verification. In European Conference on Computer Vision, Vol. 9905. 527--544.
[25]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised Learning of Visual Repre- sentations by Solving Jigsaw Puzzles. In European Conference on Computer Vision,Vol. 9910. 69--84.
[26]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised Learning of Visual Rep- resentations by Solving Jigsaw Puzzles. In European Conference on Computer Science, Vol. 9910. 69--84.
[27]
Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In Conference on Knowledge Discovery and Data Mining. 1150--1160.
[28]
Nawid Sayed, Biagio Brattoli, and Björn Ommer. 2018. Cross and learn: Cross- modal self-supervision. In German Conference on Pattern Recognition. 228--243.
[29]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross- modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020), 10.1109/TKDE.2020.2970050.
[30]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012).
[31]
Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In ACM International Conference on Multimedia. 2193--2201.
[32]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. In European Conference of Computer Vision, Vol. 12356. 776--794.
[33]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:arXiv:1807.03748
[34]
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Rongrong Ji, and Xing Sun. 2021. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition.
[35]
Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. 2020. Self-supervised Video Rep- resentation Learning by Pace Prediction. In European Conference on Computer Vision, Vol. 12362. 504--521.
[36]
Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. 2018. Learning and Using the Arrow of Time. In IEEE Conference on Computer Vision and Pattern Recognition. 8052--8060.
[37]
Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In Conference on Computer Vision and Pattern Recognition. 3733--3742.
[38]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In European Conference on Computer Vision, Vol. 11219. 318--335.
[39]
Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. In IEEE Conference on Computer Vision and Pattern Recognition. 10334--10343.
[40]
Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. In Conference on Computer Vision and Pattern Recognition. 10334--10343.
[41]
Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans. Image Processing 26, 5 (2017), 2494--2507.
[42]
Ceyuan Yang, Yinghao Xu, Bo Dai, and Bolei Zhou. 2020. Video Representation Learning with Visual Tempo Consistency. arXiv:arXiv:2006.15489
[43]
Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. 2020. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In Conference on Computer Vision and Pattern Recognition. 6547--6556.
[44]
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems.
[45]
Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016. Colorful Image Coloriza- tion. In European Conference on Computer Vision, Vol. 9907. 649--666.
[46]
Tong Zhao, Yozen Liu, Leonardo Neves, Oliver J. Woodford, Meng Jiang, and Neil Shah. 2021. Data Augmentation for Graph Neural Networks. In (AAAI) Conference on Artificial Intelligence.
[47]
Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In IEEE Conference on Computer Vision and Pattern Recognition. 2921--2929.

Cited By

View all
  • (2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
  • (2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
  • (2024)Subclassified Loss: Rethinking Data Imbalance From Subclass Perspective for Semantic SegmentationIEEE Transactions on Intelligent Vehicles10.1109/TIV.2023.33253439:1(1547-1558)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. graph augmentation
  3. self-supervised learning
  4. video representation learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
  • (2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
  • (2024)Subclassified Loss: Rethinking Data Imbalance From Subclass Perspective for Semantic SegmentationIEEE Transactions on Intelligent Vehicles10.1109/TIV.2023.33253439:1(1547-1558)Online publication date: Jan-2024
  • (2024)Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic SegmentationComputer Vision – ECCV 202410.1007/978-3-031-72907-2_22(371-388)Online publication date: 31-Oct-2024
  • (2023)Contrastive Transformer Hashing for Compact Video RepresentationIEEE Transactions on Image Processing10.1109/TIP.2023.332699432(5992-6003)Online publication date: 1-Jan-2023
  • (2022)Simple Self-supervised Multiplex Graph Representation LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547949(3301-3309)Online publication date: 10-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media