Few-shot video object segmentation with prototype evolution

Mao, Binjie; Liu, Xiyan; Shi, Linsu; Yu, Jiazhong; Li, Fei; Xiang, Shiming

doi:10.1007/s00521-023-09325-y

Few-shot video object segmentation with prototype evolution

Original Article
Published: 03 January 2024

Volume 36, pages 5367–5382, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Binjie Mao ORCID: orcid.org/0000-0002-9546-2747^1,2,
Xiyan Liu^1,2,
Linsu Shi³,
Jiazhong Yu³,
Fei Li³ &
…
Shiming Xiang^1,2

301 Accesses
1 Altmetric
Explore all metrics

Abstract

As a challenging task, few-shot video object segmentation attempts to segment objects of novel categories in the video while providing only a few annotated images. Current methods for this task only explore the relationship between support images and target query video ignoring the rich temporal information in the query video itself. To address this problem, we propose a simple yet effective framework named prototype evolution network (PENet) for few-shot video object segmentation in this paper. PENet first adopts a prototype-based structure which efficiently constructs and exploits the correlation between support images and target query video. Then a prototype evolution module is designed to summarize and propagate temporal information through the evolution process of the video prototype. The feature representation adopted by the module is of fixed size and does not increase memory burden as the video frame moves forward. Along with the category prototype extracted from the support set, the global video prototype provides guidance for the current frame segmentation. Additionally, the approach of utilizing the high-level features is introduced as an optional solution that trades a small amount of speed for higher accuracy. Experimental results on the Youtube-VIS dataset of 2019 version and 2021 version demonstrate that our PENet outperforms the previous methods with a sizable margin, validating the superiority of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability

The Youtube-VIS 2019 and Youtube-VIS 2021 datasets analyzed during the current study are available in the https://youtube-vos.org/dataset/vis/.

References

Fragkiadaki K, Arbelaez P, Felsen P, Malik J (2015) Learning to segment moving objects in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4083–4090
Tsai Y, Yang M, Black MJ (2016) Video segmentation via object flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3899–3908
Wang W, Zhou T, Porikli F, Crandall DJ, Gool LV (2021) A survey on deep learning technique for video segmentation. arXiv preprint arXiv: 2107.01153
Caelles S, Maninis K, Pont-Tuset J, Leal-Taixé L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5320–5329
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3491–3500
Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9225–9234
Chen H, Wu H, Zhao N, Ren S, He S (2021) Delving deep into many-to-many attention for few-shot video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14040–14049
Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K, Wierstra D (2016) Matching networks for one shot learning. In: Proceedings of advances in neural information processing systems, pp 3630–3638
Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. In: Proccedings of advances in neural information processing systems, pp 4077–4087
Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1199–1208
Zhang C, Lin G, Liu F, Yao R, Shen C (2019) Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5217–5226
Wang K, Liew JH, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In:Proceedings of the IEEE/CVF international conference on computer vision, pp 9196–9205
Liu Y, Zhang X, Zhang S, He X (2020) Part-aware prototype network for few-shot semantic segmentation. In: Proceedings of European conference computer vision, pp 142–158
Liu Y, Zhang X, Zhang S, He X (2020) Part-aware prototype network for few-shot semantic segmentation. In: Proceedings of European conference computer vision, pp 142–158
Tian Z, Zhao H, Shu M, Yang Z, Li R, Jia J (2022) Prior guided feature enrichment network for few-shot segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1050–1065
Article PubMed Google Scholar
Mao B, Wang L, Xiang S, Pan C (2021) LTAF-Net: learning task-aware adaptive features and refining mask for few-shot semantic segmentation. In: Proccedings of IEEE international conference on acoustics, speech and signal processing, pp 2320–2324
Zhu L, Yang Y (2022) Label independent memory for semi-supervised few-shot video classification. IEEE Trans Pattern Anal Mach Intell 44(1):273–285
PubMed Google Scholar
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video object segmentation. In: British machine vision conference 2017
Cheng J, Tsai Y, Wang S, Yang M (2017) Segflow: Joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE international conference on computer vision, pp 686–695
Oh SW, Lee J-Y, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7376–7385
Jampani V, Gadde R, Gehler PV (2017) Video propagation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3154–3164
Xiao H, Feng J, Lin G, Liu Y, Zhang M (2018) Monet: deep motion exploitation for video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1140–1148
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid data dreaming for video object segmentation. Int J Comput Vis 127(9):1175–1197
Article Google Scholar
Hu P, Wang G, Kong X, Kuen J, Tan Y (2020) Motion-guided cascaded refinement network for video object segmentation. IEEE Trans Pattern Anal Mach Intell 42(8):1957–1967
Article PubMed Google Scholar
Gui Y, Tian Y, Zeng D, Xie Z, Cai Y (2020) Reliable and dynamic appearance modeling and label consistency enforcing for fast and coherent video object segmentation with the bilateral grid. IEEE Trans Circuits Syst Video Technol 30(12):4781–4795
Article Google Scholar
Liu W, Lin G, Zhang T, Liu Z (2021) Guided co-segmentation network for fast video object segmentation. IEEE Trans Circuits Syst Video Technol 31(4):1607–1617
Article Google Scholar
Tan Z, Liu B, Chu Q, Zhong H, Wu Y, Li W, Yu N (2021) Real time video object segmentation in compressed domain. IEEE Trans Circuits Syst Video Technol 31(1):175–188
Article Google Scholar
Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: Proceedings of European conference on computer vision, pp 735–750
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV (2020) Video object segmentation with episodic graph memory networks. In: Proceedings of European conference on computer vision, pp 661–679
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Proceedings of European conference on computer vision, pp 629–645
Yoon JS, Rameau F, Kim J, Lee S, Shin S, Kweon IS (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2186–2195
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9481–9490
Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: ranking attention network for fast video object segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 3977–3986
Zhang Y, Wu Z, Peng H, Lin S (2020) A transductive approach for video object segmentation. In: Proceedings of the IEEE international conference on computer vision, pp. 6947–6956
Zhu W, Li J, Lu J, Zhou J (2022) Separable structure modeling for semi-supervised video object segmentation. IEEE Trans Circuits Syst Video Technol 32(1):330–344
Article Google Scholar
Shaban A, Bansal S, Liu Z, Essa I, Boots B (2017) One-shot learning for semantic segmentation. In: British machine vision conference 2017
Zhang X, Wei Y, Yang Y, Huang TS (2020) Sg-one: similarity guidance network for one-shot semantic segmentation. IEEE Trans Cybern 50(9):3855–3865
Article PubMed Google Scholar
Li G, Jampani V, Sevilla-Lara L, Sun D, Kim J, Kim J (2021) Adaptive prototype learning and allocation for few-shot segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8334–8343
Liu W, Zhang C, Lin G, Liu F (2020) Crnet: cross-reference networks for few-shot segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4164–4172
Yang X, Wang B, Zhou X, Chen K, Yi S, Ouyang W, Zhou L (2020) Brinet: towards bridging the intra-class and inter-class gaps in one-shot segmentation. In: British machine vision conference 2020
Zhang C, Lin G, Liu F, Guo J, Wu Q, Yao R (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9586–9594
Wang H, Zhang X, Hu Y, Yang Y, Cao X, Zhen X (2020) Few-shot semantic segmentation with democratic attention networks. In: Proceedings of Europe conference on computer vision, pp 730–746
Boudiaf M, Kervadec H, Ziko IM, Piantanida P, Ayed IB, Dolz J (2021) Few-shot segmentation without meta-learning: A good transductive inference is all you need? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 13979–13988
Lu Z, He S, Zhu X, Zhang L, Song Y, Xiang T (2021) Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8721–8730
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article PubMed Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 315–323
Yang L, Fan Y, Xu N (2019) Video instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5187–5196
Caelles S, Montes A, Maninis K, Chen Y, Gool LV, Perazzi F, Pont-Tuset J (2018) The 2018 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1803.00557
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255

Download references

Acknowledgements

This research was supported by the National Key Research and Development Program of China under Grant No. 2018AAA0100400, and the National Natural Science Foundation of China under Grants 62071466, 62076242, and 61976208.

Author information

Authors and Affiliations

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Science, Beijing, 100049, China
Binjie Mao, Xiyan Liu & Shiming Xiang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Binjie Mao, Xiyan Liu & Shiming Xiang
China Tower Corporation Limited, Beijing, 100195, China
Linsu Shi, Jiazhong Yu & Fei Li

Authors

Binjie Mao
View author publications
You can also search for this author in PubMed Google Scholar
Xiyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Linsu Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jiazhong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Li
View author publications
You can also search for this author in PubMed Google Scholar
Shiming Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiming Xiang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mao, B., Liu, X., Shi, L. et al. Few-shot video object segmentation with prototype evolution. Neural Comput & Applic 36, 5367–5382 (2024). https://doi.org/10.1007/s00521-023-09325-y

Download citation

Received: 09 March 2023
Accepted: 06 November 2023
Published: 03 January 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00521-023-09325-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Few-shot video object segmentation with prototype evolution

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Few-shot video object segmentation with prototype evolution

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation