Abstract
Few-shot video object segmentation (FSVOS) aims to segment a specific object throughout a video sequence when only the first-frame annotation is given. In this study, we develop a fast target-aware learning approach for FSVOS, where the proposed approach adapts to new video sequences from its first-frame annotation through a lightweight procedure. The proposed network comprises two models. First, the meta knowledge model learns the general semantic features for the input video image and up-samples the coarse predicted mask to the original image size. Second, the target model adapts quickly from the limited support set. Concretely, during the online inference for testing the video, we first employ fast optimization techniques to train a powerful target model by minimizing the segmentation error in the first frame and then use it to predict the subsequent frames. During the offline training, we use a bilevel-optimization strategy to mimic the full testing procedure to train the meta knowledge model across multiple video sequences. The proposed method is trained only on an individual public video object segmentation (VOS) benchmark without additional training sets and compared favorably with state-of-the-art methods on DAVIS-2017, with a \({\cal J} \& {\cal F}\) overall score of 71.6%, and on YouTubeVOS-2018, with a \({\cal J} \& {\cal F}\) overall score of 75.4%. Meanwhile, a high inference speed of approximately 0.13 s per frame is maintained.
Similar content being viewed by others
References
Wu W M, Wang Q, Yuan C Z, et al. Rapid dynamical pattern recognition for sampling sequences. Sci China Inf Sci, 2021, 64: 132201
Gu Y F, Liu H, Wang T F, et al. Deep feature extraction and motion representation for satellite video scene classification. Sci China Inf Sci, 2020, 63: 140307
Chen Y D, Hao C Y, Wu W, et al. Robust dense reconstruction by range merging based on confidence estimation. Sci China Inf Sci, 2016, 59: 092103
Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5320–5329
Lu X K, Wang W G, Shen J B, et al. Learning video object segmentation from unlabeled videos. In: Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, 2020. 8957–8967
Luiten J, Voigtlaender P, Leibe B. PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Proceedings of the 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, 2018
Maninis K K, Caelles S, Chen Y, et al. Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 1515–1530
Khoreva A, Benenson R, Ilg E, et al. Lucid data dreaming for video object segmentation. Int J Comput Vis, 2019, 127: 1175–1197
Oh S W, Lee J, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7376–7385
Xiao H, Feng J, Lin G, et al. MoNet: deep motion exploitation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1140–1148
Johnander J, Danelljan M, Brissman E, et al. A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8945–8954
Xie H Z, Yao H X, Zhou S C, et al. Efficient regional memory network for video object segmentation. 2021. ArXiv:2103.12934
Hu Y T, Huang J B, Schwing A G. VideoMatch: matching based video object segmentation. In: Proceedings of the 2018 European Conference on Computer Vision, 2018
Voigtlaender P, Chai Y, Schroff F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 9473–9482
Lin H, Qi X, Jia J. AGSS-VOS: attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 3948–3956
Yang Z X, Wei Y C, Yang Y. Collaborative video object segmentation by foreground-background integration. In: Proceedings of the European Conference on Computer Vision, 2020
Vaswani A, Shazeera N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. 6000–6010
Oh S W, Lee J, Xu N, et al. Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 9225–9234
Li Y, Shen Z R, Shan Y. Fast video object segmentation using the global context module. In: Proceedings of the European Conference on Computer Vision, 2020
Liang Y Q, Li X, Jafari N, et al. Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the Conference on Neural Information Processing Systems, 2020
Wang H C, Jiang X L, Ren H B, et al. SwiftNet: real-time video object segmentation. 2021. ArXiv:2102.04604
Hu L, Zhang P, Zhang B, et al. Learning position and target consistency for memory-based video object segmentation. 2021. ArXiv:2104.04329
Duke B, Ahmed A, Wolf C, et al. SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021
Chen Y D, Hao C Y, Liu A X, et al. Multilevel model for video object segmentation based on supervision optimization. IEEE Trans Multimedia, 2019, 21: 1934–1945
Hao C Y, Chen Y D, Yang Z X, et al. Higher-order potentials for video object segmentation in bilateral space. Neurocomputing, 2020, 401: 28–35
Chen Y D, Hao C Y, Liu A X, et al. Appearance-consistent video object segmentation based on a multinomial event model. ACM Trans Multimedia Comput Commun Appl, 2019, 15: 1–15
Pont-Tuset J, Perazzi F, Caelles S, et al. The 2017 DAVIS challenge on video object segmentation. 2017. ArXiv:1704.00675
Xu N, Yang L J, Fan Y C, et al. YouTube-VOS: a large-scale video object segmentation benchmark. 2018. ArXiv:1809.03327
Voigtlaender P, Leibe B. Online adaptation of convolutional neural networks for video object segmentation. In: Proceedings of the British Machine Vision Conference, 2017
Li X X, Loy C C. Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision, 2018
Griffin B A, Corso J J. BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8906–8915
Tian Z, He T, Shen C. Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3121–3130
Bao L C, Wu B Y, Liu W. CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018
Zhang Y, Wu Z, Peng H, et al. A transductive approach for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 6947–6956
Zhang K H, Wang L, Liu D, et al. Dual temporal memory network for efficient video object segmentation. 2020. ArXiv:2003.06125
Chen Y, Pont-Tuset J, Montes A, et al. Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1189–1198
Hospedales T, Antoniou A, Micaelli P, et al. Meta-learning in neural networks: a survey. 2020. ArXiv:2004.05439
Yang L, Wang Y, Xiong X, et al. Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6499–6507
Tang L L, Chen K, Wu C, et al. Improving semantic analysis on point clouds via auxiliary supervision of local geometric priors. IEEE Trans Cybern, 2020, 12: 1–11
Robinson A, Lawin A J, Danelljan M, et al. Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 7404–7413
Bhat G, Lawin F G, Danelljan M, et al. Learning what to learn for video object segmentation. In: Proceedings of the European Conference on Computer Vision, 2020
Behl H S, Najafi M, Arnab A, et al. Meta learning deep visual words for fast video object segmentation. In: Proceedings of the Conference on Neural Information Processing Systems Machine Learning for Autonomous Driving Workshop, 2019
Pinheiro P, Lin T Y, Collobert R, et al. Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, 2016. 75–91
Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834–848
He K M, Zhang X, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the Machine Learning Research, 2017. 1126–1135
He K M, Zhang X, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1026–1034
Acknowledgements
This work was partially supported by National Natural Science Foundation of China (Grant Nos. 62072449, 61802197), Science and Technology Development Fund, Macao SAR (Grant Nos. 0018/2019/AKP, SKL-IOTSC(UM)-2021-2023), Guangdong Science and Technology Department (Grant No. 2018B030324002), and Zhuhai Science and Technology Innovation Bureau Zhuhai-Hong Kong-Macau Special Cooperation Project (Grant No. ZH22017002200001PWC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Supporting information
Appendixes A—C. The supporting information is available online at www.info.scichina.com and www.link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.
Supplementary File
Rights and permissions
About this article
Cite this article
Chen, Y., Hao, C., Yang, ZX. et al. Fast target-aware learning for few-shot video object segmentation. Sci. China Inf. Sci. 65, 182104 (2022). https://doi.org/10.1007/s11432-021-3396-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-021-3396-7