Skip to main content

Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

Abstract

Semi-supervised video object segmentation aims to segment the object in the video when only the annotated mask of the first frame is given. Recently, memory-based methods have attracted increasing attention with significant performance improvements. However, these methods employ pixel-level matching according to the similarity without considering the trajectory and the feature of the object, which may result in mismatching between the object and non-object region in complex scenarios. To relieve this problem, we propose spatial and temporal guidance for semi-supervised video object segmentation. The proposed method takes into account the consistency of the object in spatiotemporal domain and employs global matching to conduct pixel-level matching. Moreover, we design the spatial guidance module (SGM) to track the trajectory of the object. And we design the temporal guidance module (TGM) to focus on long-term object-level feature from the first frame. The proposed spatial and temporal guidance effectively alleviates mismatching and makes the model more robust and efficient. Experiments on YouTube-VOS and DAVIS benchmarks show that our method outperforms previous state-of-the-art methods with a fast inference speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bhat, G., et al.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46

    Chapter  Google Scholar 

  2. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)

    Google Scholar 

  3. Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., Qi, D.: State-aware tracker for real-time video object segmentation. In: CVPR, pp. 9384–9393 (2020)

    Google Scholar 

  4. Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: CascadePSP: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: CVPR, pp. 8890–8899 (2020)

    Google Scholar 

  5. Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)

    Google Scholar 

  6. Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NIPS (2021)

    Google Scholar 

  7. Duke, B., Ahmed, A., Wolf, C., et al.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: CVPR, pp. 5912–5921 (2021)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  9. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

    Google Scholar 

  10. Hu, L., Zhang, P., Zhang, B., et al.: Learning position and target consistency for memory-based video object segmentation. In: CVPR, pp. 4144–4154 (2021)

    Google Scholar 

  11. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-Cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019)

    Google Scholar 

  12. Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: A 1000-class dataset for few-shot segmentation. In: CVPR, pp. 2869–2878 (2020)

    Google Scholar 

  13. Li, Yu., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 735–750. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_43

    Chapter  Google Scholar 

  14. Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. In: NIPS, vol. 33, pp. 3430–3441 (2020)

    Google Scholar 

  15. Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Van Gool, L.: Video object segmentation with episodic graph memory networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 661–679. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_39

    Chapter  Google Scholar 

  16. Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_35

    Chapter  Google Scholar 

  17. Maninis, K.K.: Video object segmentation without temporal information. TPAMI 41(6), 1515–1530 (2018)

    Article  Google Scholar 

  18. Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)

    Google Scholar 

  19. Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV, pp. 9226–9235 (2019)

    Google Scholar 

  20. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663–2672 (2017)

    Google Scholar 

  21. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732 (2016)

    Google Scholar 

  22. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

  23. Robinson, A., Lawin, F.J., Danelljan, M., et al.: Learning fast and robust target models for video object segmentation. In: CVPR, pp. 7406–7415 (2020)

    Google Scholar 

  24. Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 629–645. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_38

    Chapter  Google Scholar 

  25. Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended CSSD. TPAMI 38(4), 717–729 (2015)

    Article  Google Scholar 

  26. Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277–5286 (2019)

    Google Scholar 

  27. Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)

    Google Scholar 

  28. Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object segmentation. In: CVPR, pp. 1296–1305 (2021)

    Google Scholar 

  29. Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., et al.: Learning to detect salient objects with image-level supervision. In: CVPR, pp. 136–145 (2017)

    Google Scholar 

  30. Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR, pp. 1286–1295 (2021)

    Google Scholar 

  31. Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: ECCV, pp. 585–601 (2018)

    Google Scholar 

  32. Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. In: ECCV, pp. 585–601 (2018)

    Google Scholar 

  33. Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499–6507 (2018)

    Google Scholar 

  34. Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20

    Chapter  Google Scholar 

  35. Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by multi-scale foreground-background integration. TPAMI 49, 4701–4712 (2021)

    Google Scholar 

  36. Zeng, Y., Zhang, P., Zhang, J., Lin, Z., Lu, H.: Towards high-resolution salient object detection. In: ICCV, pp. 7234–7243 (2019)

    Google Scholar 

  37. Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: CVPR, pp. 6949–6958 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61972059, 61773272, 62102347), China Postdoctoral Science Foundation (2021M69236), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172017K18), Natural Science Foundation of Jiangsu Province under Grant (BK20191474, BK20191475, BK20161268), Qinglan Project of Jiangsu Province (No. 2020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengrong Gong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, G., Gong, S., Zhong, S., Zhou, L. (2023). Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30111-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30110-0

  • Online ISBN: 978-3-031-30111-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics