Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

Li, Guoqiang; Gong, Shengrong; Zhong, Shan; Zhou, Lifan

doi:10.1007/978-3-031-30111-7_9

Guoqiang Li¹²,
Shengrong Gong^12,13,
Shan Zhong¹³ &
…
Lifan Zhou¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

International Conference on Neural Information Processing

888 Accesses
1 Citations

Abstract

Semi-supervised video object segmentation aims to segment the object in the video when only the annotated mask of the first frame is given. Recently, memory-based methods have attracted increasing attention with significant performance improvements. However, these methods employ pixel-level matching according to the similarity without considering the trajectory and the feature of the object, which may result in mismatching between the object and non-object region in complex scenarios. To relieve this problem, we propose spatial and temporal guidance for semi-supervised video object segmentation. The proposed method takes into account the consistency of the object in spatiotemporal domain and employs global matching to conduct pixel-level matching. Moreover, we design the spatial guidance module (SGM) to track the trajectory of the object. And we design the temporal guidance module (TGM) to focus on long-term object-level feature from the first frame. The proposed spatial and temporal guidance effectively alleviates mismatching and makes the model more robust and efficient. Experiments on YouTube-VOS and DAVIS benchmarks show that our method outperforms previous state-of-the-art methods with a fast inference speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhat, G., et al.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46
Chapter Google Scholar
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)
Google Scholar
Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., Qi, D.: State-aware tracker for real-time video object segmentation. In: CVPR, pp. 9384–9393 (2020)
Google Scholar
Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: CascadePSP: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: CVPR, pp. 8890–8899 (2020)
Google Scholar
Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)
Google Scholar
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NIPS (2021)
Google Scholar
Duke, B., Ahmed, A., Wolf, C., et al.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: CVPR, pp. 5912–5921 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Hu, L., Zhang, P., Zhang, B., et al.: Learning position and target consistency for memory-based video object segmentation. In: CVPR, pp. 4144–4154 (2021)
Google Scholar
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-Cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019)
Google Scholar
Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: A 1000-class dataset for few-shot segmentation. In: CVPR, pp. 2869–2878 (2020)
Google Scholar
Li, Yu., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 735–750. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_43
Chapter Google Scholar
Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. In: NIPS, vol. 33, pp. 3430–3441 (2020)
Google Scholar
Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Van Gool, L.: Video object segmentation with episodic graph memory networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 661–679. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_39
Chapter Google Scholar
Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_35
Chapter Google Scholar
Maninis, K.K.: Video object segmentation without temporal information. TPAMI 41(6), 1515–1530 (2018)
Article Google Scholar
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)
Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV, pp. 9226–9235 (2019)
Google Scholar
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663–2672 (2017)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732 (2016)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Robinson, A., Lawin, F.J., Danelljan, M., et al.: Learning fast and robust target models for video object segmentation. In: CVPR, pp. 7406–7415 (2020)
Google Scholar
Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 629–645. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_38
Chapter Google Scholar
Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended CSSD. TPAMI 38(4), 717–729 (2015)
Article Google Scholar
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277–5286 (2019)
Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)
Google Scholar
Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object segmentation. In: CVPR, pp. 1296–1305 (2021)
Google Scholar
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., et al.: Learning to detect salient objects with image-level supervision. In: CVPR, pp. 136–145 (2017)
Google Scholar
Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR, pp. 1286–1295 (2021)
Google Scholar
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: ECCV, pp. 585–601 (2018)
Google Scholar
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. In: ECCV, pp. 585–601 (2018)
Google Scholar
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499–6507 (2018)
Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Chapter Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by multi-scale foreground-background integration. TPAMI 49, 4701–4712 (2021)
Google Scholar
Zeng, Y., Zhang, P., Zhang, J., Lin, Z., Lu, H.: Towards high-resolution salient object detection. In: ICCV, pp. 7234–7243 (2019)
Google Scholar
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: CVPR, pp. 6949–6958 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61972059, 61773272, 62102347), China Postdoctoral Science Foundation (2021M69236), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172017K18), Natural Science Foundation of Jiangsu Province under Grant (BK20191474, BK20191475, BK20161268), Qinglan Project of Jiangsu Province (No. 2020).

Author information

Authors and Affiliations

Soochow University, Soochow, China
Guoqiang Li & Shengrong Gong
Changshu Institute of Technology, Soochow, China
Shengrong Gong, Shan Zhong & Lifan Zhou

Authors

Guoqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Shengrong Gong
View author publications
You can also search for this author in PubMed Google Scholar
Shan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Lifan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengrong Gong .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, G., Gong, S., Zhong, S., Zhou, L. (2023). Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-30111-7_9
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation