research-article

Video Representation Learning with Graph Contrastive Augmentation

Authors:

Xiaofeng ZhuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3043 - 3051

https://doi.org/10.1145/3474085.3475510

Published: 17 October 2021 Publication History

Abstract

Contrastive-based self-supervised learning for image representations has significantly closed the gap with supervised learning. A natural extension of image-based contrastive learning methods to the video domain is to fully exploit the temporal structure presented in videos. We propose a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos. Specifically, we construct the temporal graph in the video by leveraging the relational knowledge behind the correlated sequence video features. Afterwards, we apply the proposed graph augmentation to generate another graph view by cooperating random corruption of the original graph to enhance the diversity of the intrinsic structure of the temporal graph. To this end, we provide two different kinds of contrastive learning methods to train our framework using temporal relationships concealed in videos as self-supervised signals. We perform empirical experiments on downstream tasks, action recognition and video retrieval, using the learned video representation, and the results demonstrate that with the graph view of temporal structure, our proposed GCA remarkably improves performance against or on par with the recent methods.

References

[1]

Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the Speediness in Videos. In Conference on Computer Vision and Pattern Recognition. 9919--9928.

[2]

Uta Büchler, Biagio Brattoli, and Björn Ommer. 2018. Improving Spatiotemporal Self-supervision by Deep Reinforcement Learning. In European Conference on Computer Vision, Vol. 11219. 797--814.

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning, Vol. 119. 1597--1607.

[4]

Xinlei Chen and Kaiming He. 2021. Exploring Simple Siamese Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Ima- geNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255.

[6]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised Visual Representation Learning by Context Prediction. In IEEE International Conference on Computer Vision. 1422--1430.

Digital Library

[7]

Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. 2018. Ge- ometry guided convolutional neural networks for self-supervised video represen- tation learning. In IEEE Conference on Computer Vision and Pattern Recognition. 5589--5597.

[8]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems.

[9]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video Representation Learning by Dense Predictive Coding. In International Conference on Computer Vision Workshops. 1483--1492.

[10]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Memory-Augmented Dense Predictive Coding for Video Representation Learning. In European Conference of Computer Vision, Vol. 12348. 312--329.

[11]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised Co-Training for Video Representation Learning. In Advances in Neural Information Processing Systems.

[12]

Kaveh Hassani and Amir Hosein Khas Ahmadi. 2020. Contrastive Multi-View Representation Learning on Graphs. In International Conference on Machine Learning, Vol. 119. 4116--4126.

[13]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Con- ference on Computer Vision and Pattern Recognition. 9726--9735.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[15]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations.

[16]

Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018).

[17]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv:1705.06950 (2017).

[18]

Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In AAAI Conference on Artificial Intelligence. 8545--8552.

[19]

Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In (AAAI) Conference on Artificial Intelligence. 8545--8552.

[20]

Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, and Tomokazu Mu- rakami. 2020. Cycle-Contrast for Self-Supervised Video Representation Learning. In Advances in Neural Information Processing Systems.

[21]

Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In International Conference on Computer Vision. 2556--2563.

Digital Library

[22]

Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised Representation Learning by Sorting Sequences. In International Conference on Computer Vision. 667--676.

[23]

Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In AAAI Conference on Artificial Intelligence. 11701--11708.

[24]

Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and Learn: Un- supervised Learning Using Temporal Order Verification. In European Conference on Computer Vision, Vol. 9905. 527--544.

[25]

Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised Learning of Visual Repre- sentations by Solving Jigsaw Puzzles. In European Conference on Computer Vision,Vol. 9910. 69--84.

[26]

Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised Learning of Visual Rep- resentations by Solving Jigsaw Puzzles. In European Conference on Computer Science, Vol. 9910. 69--84.

[27]

Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In Conference on Knowledge Discovery and Data Mining. 1150--1160.

Digital Library

[28]

Nawid Sayed, Biagio Brattoli, and Björn Ommer. 2018. Cross and learn: Cross- modal self-supervision. In German Conference on Pattern Recognition. 228--243.

[29]

Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross- modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020), 10.1109/TKDE.2020.2970050.

[30]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012).

[31]

Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In ACM International Conference on Multimedia. 2193--2201.

Digital Library

[32]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. In European Conference of Computer Vision, Vol. 12356. 776--794.

[33]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:arXiv:1807.03748

[34]

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Rongrong Ji, and Xing Sun. 2021. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition.

[35]

Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. 2020. Self-supervised Video Rep- resentation Learning by Pace Prediction. In European Conference on Computer Vision, Vol. 12362. 504--521.

[36]

Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. 2018. Learning and Using the Arrow of Time. In IEEE Conference on Computer Vision and Pattern Recognition. 8052--8060.

[37]

Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In Conference on Computer Vision and Pattern Recognition. 3733--3742.

[38]

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In European Conference on Computer Vision, Vol. 11219. 318--335.

[39]

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. In IEEE Conference on Computer Vision and Pattern Recognition. 10334--10343.

[40]

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. In Conference on Computer Vision and Pattern Recognition. 10334--10343.

[41]

Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans. Image Processing 26, 5 (2017), 2494--2507.

Digital Library

[42]

Ceyuan Yang, Yinghao Xu, Bo Dai, and Bolei Zhou. 2020. Video Representation Learning with Visual Tempo Consistency. arXiv:arXiv:2006.15489

[43]

Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. 2020. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In Conference on Computer Vision and Pattern Recognition. 6547--6556.

[44]

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems.

[45]

Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016. Colorful Image Coloriza- tion. In European Conference on Computer Vision, Vol. 9907. 649--666.

[46]

Tong Zhao, Yozen Liu, Leonardo Neves, Oliver J. Woodford, Meng Jiang, and Neil Shah. 2021. Data Augmentation for Graph Neural Networks. In (AAAI) Conference on Artificial Intelligence.

[47]

Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In IEEE Conference on Computer Vision and Pattern Recognition. 2921--2929.

Cited By

Lin JZheng YChen XRen YPu XHe JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681193
Gui JChen TZhang JCao QSun ZLuo HTao D(2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3415112
Qiu SCheng XLu HZhang HWan RXue XPu J(2024)Subclassified Loss: Rethinking Data Imbalance From Subclass Perspective for Semantic SegmentationIEEE Transactions on Intelligent Vehicles10.1109/TIV.2023.33253439:1(1547-1558)Online publication date: Jan-2024
https://doi.org/10.1109/TIV.2023.3325343
Show More Cited By

Index Terms

Video Representation Learning with Graph Contrastive Augmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
      2. Computer vision tasks
        Activity recognition and understanding

Recommendations

Graph Self-supervised Learning with Augmentation-aware Contrastive Learning
WWW '23: Proceedings of the ACM Web Conference 2023

Graph self-supervised learning aims to mine useful information from unlabeled graph data, and has been successfully applied to pre-train graph representations. Many existing approaches use contrastive learning to learn powerful embeddings by learning ...
Self-supervised Graph-level Representation Learning with Adversarial Contrastive Learning
The recently developed unsupervised graph representation learning approaches apply contrastive learning into graph-structured data and achieve promising performance. However, these methods mainly focus on graph augmentation for positive samples, while the ...
Adaptive Graph Augmentation for Graph Contrastive Learning
Advanced Intelligent Computing Technology and Applications
Abstract
Graph contrastive learning emerged as a promising method for graph representation learning. The traditional graph contrastive methods utilize data augmentations for original graphs and train models during pre-training, and for different downstream ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Sichuan Science and Technology Program
Fundamental Research Funds for the Central Universities

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin JZheng YChen XRen YPu XHe JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681193
Gui JChen TZhang JCao QSun ZLuo HTao D(2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3415112
Qiu SCheng XLu HZhang HWan RXue XPu J(2024)Subclassified Loss: Rethinking Data Imbalance From Subclass Perspective for Semantic SegmentationIEEE Transactions on Intelligent Vehicles10.1109/TIV.2023.33253439:1(1547-1558)Online publication date: Jan-2024
https://doi.org/10.1109/TIV.2023.3325343
Qiu SChen JLi XWan RXue XPu J(2024)Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic SegmentationComputer Vision – ECCV 202410.1007/978-3-031-72907-2_22(371-388)Online publication date: 31-Oct-2024
https://doi.org/10.1007/978-3-031-72907-2_22
Shen XZhou YYuan YYang XLan LZheng Y(2023)Contrastive Transformer Hashing for Compact Video RepresentationIEEE Transactions on Image Processing10.1109/TIP.2023.332699432(5992-6003)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3326994
Mo YChen YPeng LShi XZhu XMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Simple Self-supervised Multiplex Graph Representation LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547949(3301-3309)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547949

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten