Skip to main content
Log in

Fast target-aware learning for few-shot video object segmentation

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Few-shot video object segmentation (FSVOS) aims to segment a specific object throughout a video sequence when only the first-frame annotation is given. In this study, we develop a fast target-aware learning approach for FSVOS, where the proposed approach adapts to new video sequences from its first-frame annotation through a lightweight procedure. The proposed network comprises two models. First, the meta knowledge model learns the general semantic features for the input video image and up-samples the coarse predicted mask to the original image size. Second, the target model adapts quickly from the limited support set. Concretely, during the online inference for testing the video, we first employ fast optimization techniques to train a powerful target model by minimizing the segmentation error in the first frame and then use it to predict the subsequent frames. During the offline training, we use a bilevel-optimization strategy to mimic the full testing procedure to train the meta knowledge model across multiple video sequences. The proposed method is trained only on an individual public video object segmentation (VOS) benchmark without additional training sets and compared favorably with state-of-the-art methods on DAVIS-2017, with a \({\cal J} \& {\cal F}\) overall score of 71.6%, and on YouTubeVOS-2018, with a \({\cal J} \& {\cal F}\) overall score of 75.4%. Meanwhile, a high inference speed of approximately 0.13 s per frame is maintained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Wu W M, Wang Q, Yuan C Z, et al. Rapid dynamical pattern recognition for sampling sequences. Sci China Inf Sci, 2021, 64: 132201

    Article  MathSciNet  Google Scholar 

  2. Gu Y F, Liu H, Wang T F, et al. Deep feature extraction and motion representation for satellite video scene classification. Sci China Inf Sci, 2020, 63: 140307

    Article  Google Scholar 

  3. Chen Y D, Hao C Y, Wu W, et al. Robust dense reconstruction by range merging based on confidence estimation. Sci China Inf Sci, 2016, 59: 092103

    Article  Google Scholar 

  4. Perazzi F, Khoreva A, Benenson R, et al. Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  5. Caelles S, Maninis K K, Pont-Tuset J, et al. One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5320–5329

  6. Lu X K, Wang W G, Shen J B, et al. Learning video object segmentation from unlabeled videos. In: Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, 2020. 8957–8967

  7. Luiten J, Voigtlaender P, Leibe B. PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Proceedings of the 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, 2018

  8. Maninis K K, Caelles S, Chen Y, et al. Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 1515–1530

    Article  Google Scholar 

  9. Khoreva A, Benenson R, Ilg E, et al. Lucid data dreaming for video object segmentation. Int J Comput Vis, 2019, 127: 1175–1197

    Article  Google Scholar 

  10. Oh S W, Lee J, Sunkavalli K, et al. Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7376–7385

  11. Xiao H, Feng J, Lin G, et al. MoNet: deep motion exploitation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1140–1148

  12. Johnander J, Danelljan M, Brissman E, et al. A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8945–8954

  13. Xie H Z, Yao H X, Zhou S C, et al. Efficient regional memory network for video object segmentation. 2021. ArXiv:2103.12934

  14. Hu Y T, Huang J B, Schwing A G. VideoMatch: matching based video object segmentation. In: Proceedings of the 2018 European Conference on Computer Vision, 2018

  15. Voigtlaender P, Chai Y, Schroff F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 9473–9482

  16. Lin H, Qi X, Jia J. AGSS-VOS: attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 3948–3956

  17. Yang Z X, Wei Y C, Yang Y. Collaborative video object segmentation by foreground-background integration. In: Proceedings of the European Conference on Computer Vision, 2020

  18. Vaswani A, Shazeera N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. 6000–6010

  19. Oh S W, Lee J, Xu N, et al. Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 9225–9234

  20. Li Y, Shen Z R, Shan Y. Fast video object segmentation using the global context module. In: Proceedings of the European Conference on Computer Vision, 2020

  21. Liang Y Q, Li X, Jafari N, et al. Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the Conference on Neural Information Processing Systems, 2020

  22. Wang H C, Jiang X L, Ren H B, et al. SwiftNet: real-time video object segmentation. 2021. ArXiv:2102.04604

  23. Hu L, Zhang P, Zhang B, et al. Learning position and target consistency for memory-based video object segmentation. 2021. ArXiv:2104.04329

  24. Duke B, Ahmed A, Wolf C, et al. SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

  25. Chen Y D, Hao C Y, Liu A X, et al. Multilevel model for video object segmentation based on supervision optimization. IEEE Trans Multimedia, 2019, 21: 1934–1945

    Article  Google Scholar 

  26. Hao C Y, Chen Y D, Yang Z X, et al. Higher-order potentials for video object segmentation in bilateral space. Neurocomputing, 2020, 401: 28–35

    Article  Google Scholar 

  27. Chen Y D, Hao C Y, Liu A X, et al. Appearance-consistent video object segmentation based on a multinomial event model. ACM Trans Multimedia Comput Commun Appl, 2019, 15: 1–15

    Google Scholar 

  28. Pont-Tuset J, Perazzi F, Caelles S, et al. The 2017 DAVIS challenge on video object segmentation. 2017. ArXiv:1704.00675

  29. Xu N, Yang L J, Fan Y C, et al. YouTube-VOS: a large-scale video object segmentation benchmark. 2018. ArXiv:1809.03327

  30. Voigtlaender P, Leibe B. Online adaptation of convolutional neural networks for video object segmentation. In: Proceedings of the British Machine Vision Conference, 2017

  31. Li X X, Loy C C. Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision, 2018

  32. Griffin B A, Corso J J. BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8906–8915

  33. Tian Z, He T, Shen C. Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3121–3130

  34. Bao L C, Wu B Y, Liu W. CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

  35. Zhang Y, Wu Z, Peng H, et al. A transductive approach for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 6947–6956

  36. Zhang K H, Wang L, Liu D, et al. Dual temporal memory network for efficient video object segmentation. 2020. ArXiv:2003.06125

  37. Chen Y, Pont-Tuset J, Montes A, et al. Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1189–1198

  38. Hospedales T, Antoniou A, Micaelli P, et al. Meta-learning in neural networks: a survey. 2020. ArXiv:2004.05439

  39. Yang L, Wang Y, Xiong X, et al. Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6499–6507

  40. Tang L L, Chen K, Wu C, et al. Improving semantic analysis on point clouds via auxiliary supervision of local geometric priors. IEEE Trans Cybern, 2020, 12: 1–11

    Google Scholar 

  41. Robinson A, Lawin A J, Danelljan M, et al. Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 7404–7413

  42. Bhat G, Lawin F G, Danelljan M, et al. Learning what to learn for video object segmentation. In: Proceedings of the European Conference on Computer Vision, 2020

  43. Behl H S, Najafi M, Arnab A, et al. Meta learning deep visual words for fast video object segmentation. In: Proceedings of the Conference on Neural Information Processing Systems Machine Learning for Autonomous Driving Workshop, 2019

  44. Pinheiro P, Lin T Y, Collobert R, et al. Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, 2016. 75–91

  45. Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834–848

    Article  Google Scholar 

  46. He K M, Zhang X, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778

  47. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the Machine Learning Research, 2017. 1126–1135

  48. He K M, Zhang X, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1026–1034

Download references

Acknowledgements

This work was partially supported by National Natural Science Foundation of China (Grant Nos. 62072449, 61802197), Science and Technology Development Fund, Macao SAR (Grant Nos. 0018/2019/AKP, SKL-IOTSC(UM)-2021-2023), Guangdong Science and Technology Department (Grant No. 2018B030324002), and Zhuhai Science and Technology Innovation Bureau Zhuhai-Hong Kong-Macau Special Cooperation Project (Grant No. ZH22017002200001PWC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Xin Yang.

Additional information

Supporting information

Appendixes A—C. The supporting information is available online at www.info.scichina.com and www.link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.

Supplementary File

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Hao, C., Yang, ZX. et al. Fast target-aware learning for few-shot video object segmentation. Sci. China Inf. Sci. 65, 182104 (2022). https://doi.org/10.1007/s11432-021-3396-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3396-7

Keywords