Self-supervised Video Object Segmentation Using Motion Feature Compensation

Zhang, Tianqi; Li, Bo

doi:10.1007/978-3-031-44195-0_41

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

International Conference on Artificial Neural Networks

998 Accesses

Abstract

Video object segmentation is a popular area of research in computer vision. Traditional models are trained using annotated data, which is both time-consuming and expensive. Training models in unsupervised manner has been proposed as a solution to this issue. However, previous works have focused only on spatial features extracted by self-supervised learning method, without considering the temporal information between frames. In this paper, we propose a new video object segmentation model that utilizes self-supervised learning to extract spatial features, and incorporates a motion feature, extracted from optical flow, as compensation of temporal information for the model, namely motion feature compensation (MFC) model. Additionally, we introduce an attention-based fusion method to merge features from both modalities. Notably, for each video used to train models, we only select two consecutive frames at random to train our model. The dataset Youtube-VOS and DAVIS-2017 are adopted as the training dataset and the validation dataset. The experimental results demonstrate that our approach outperforms previous methods, validating our proposed design. The source code is available at: https://github.com/CVisionProcessing/MFC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6804–6815 (2021)
Google Scholar
Girisha, R., Murali, S.: Object segmentation from surveillance video sequences. In: 2010 First International Conference on Integrated Intelligent Computing, pp. 146–153 (2010). https://doi.org/10.1109/ICIIC.2010.52
Hou, W., Qin, Z., Xi, X., Lu, X., Yin, Y.: Learning disentangled representation for self-supervised video object segmentation. Neurocomputing 481, 270–280 (2022)
Article Google Scholar
Lai, Z., Lu, E., Xie, W.: Mast: a memory-augmented self-supervised tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Lai, Z., Xie, W.: Self-supervised learning for video correspondence flow. In: BMVC (2019)
Google Scholar
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., Yang, M.H.: Joint-task self-supervised learning for temporal correspondence. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Liu, J., Dai, H.N., Zhao, G., Li, B., Zhang, T.: TMVOS: triplet matching for efficient video object segmentation. Signal Process. Image Commun. 107, 116779 (2022)
Article Google Scholar
Lu, X., Wang, W., Shen, J., Tai, Y.W., Crandall, D.J., Hoi, S.C.H.: Learning video object segmentation from unlabeled videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)
Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV, pp. 9226–9235 (2019)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 Davis challenge on video object segmentation. arXiv:1704.00675 (2017)
Rui, H., Chen, C., Shah, M.: An end-to-end 3D convolutional neural network for action detection and segmentation in videos (2017)
Google Scholar
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015)
Google Scholar
Taggart, R.J.: Point forecasting and forecast evaluation with generalized Huber loss (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 391–408 (2018)
Google Scholar
Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR, pp. 1286–1295 (2021)
Google Scholar
Xu, N., et al.: Youtube-vos: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6499–6507 (2018)
Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Chapter Google Scholar
Zhang, Z., Fidler, S., Urtasun, R.: Instance-level segmentation for autonomous driving with deep densely connected MRFs. Comput. Sci. (2015)
Google Scholar
Zhu, W., Meng, J., Xu, L.: Self-supervised video object segmentation using integration-augmented attention. Neurocomputing 455, 325–339 (2021)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 11627802, 51678249, 61871188).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, China
Tianqi Zhang & Bo Li

Authors

Tianqi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Li .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, T., Li, B. (2023). Self-supervised Video Object Segmentation Using Motion Feature Compensation. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-44195-0_41
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics