Abstract:
Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive learning...Show MoreMetadata
Abstract:
Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive learning with mining negative samples or adapting reconstruction from the image domain, which requires dense affinity across multiple frames or optical flow constraints. Moreover, video correspondence prediction models need to uncover more inherent properties of the video, such as structural information. In this work, we propose HiGraph+, a sophisticated space-time correspondence framework based on learnable graph kernels. By treating videos as a spatial-temporal graph, the learning objective of HiGraph+ is issued in a self-supervised manner, predicting the unobserved hidden graph via graph kernel methods. First, we learn the structural consistency of sub-graphs in graph-level correspondence learning. Furthermore, we introduce a spatio-temporal hidden graph loss through contrastive learning that facilitates learning temporal coherence across frames of sub-graphs and spatial diversity within the same frame. Therefore, we can predict long-term correspondences and drive the hidden graph to acquire distinct local structural representations. Then, we learn a refined representation across frames on the node-level via a dense graph kernel. The structural and temporal consistency of the graph forms the self-supervision of model training. HiGraph+ achieves excellent performance and demonstrates robustness in benchmark tests involving object, semantic part, keypoint, and instance labeling propagation tasks. Our algorithm implementations have been made publicly available at https://github.com/zyqin19/HiGraph.
Published in: IEEE Transactions on Image Processing ( Volume: 32)