Skip to main content
Log in

Spatio-temporal compression for semi-supervised video object segmentation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

In this paper, we explore the spatial–temporal redundancy in video object segmentation (VOS) under semi-supervised context with the purpose to improve the computational efficiency. Recently, memory-based methods have attracted great attention for their excellent performance. These methods involve first constructing an external memory to store the target object information in the history frames and then selecting the information that is beneficial for modeling the target object by memory reading. However, such methods are inefficient and unable to achieve both high accuracy and high efficiency, due to the large amount of redundant information in memory. Moreover, they periodically sample historical frames and add them to memory; this operation may lose important information from dynamic frames with incremental object changing or aggravate temporal redundancy from static frames with no object changing. To address these problems, we propose an efficient semi-supervised VOS approach via spatio-temporal compression (termed as STCVOS). Specifically, we first adopt a temporally varying sensor to adaptively filter static frames with no target objects evolutions and trigger memory to update with frames containing noticeable variations. Furthermore, we propose a spatially compressed memory to absorb features with varied pixels and remove outdated features, which considerably reduces information redundancy. More importantly, we introduce an efficient memory reader to perform memory reading with less footprint and computational overhead. Experimental results indicate that STCVOS performs well against state-of-the-art methods on the DAVIS 2017 and YouTube-VOS datasets, with a \( {\mathcal {J}} \& {\mathcal {F}}\) overall score of 82.0% and 79.7%, respectively. Meanwhile, STCVOS achieves a high inference speed of approximately 30 \(\mathcal {FPS}\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Huang, Z., Zhao, H., Zhan, J., Li, H.: A multivariate intersection over union of siamrpn network for visual tracking. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02150-1

    Article  Google Scholar 

  2. Gökstorp, B.T.P. S.G.E.: Temporal and non-temporal contextual saliency analysis for generalized wide-area search within unmanned aerial vehicle (uav) video. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02264-6

  3. Tschiedel, R.M.F.K.E.E.A.M.: Real-time limb tracking in single depth images based on circle matching and line fitting. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02138-x

  4. Li, W.Z.Y.X.E.A.Y.: Efficient convolutional hierarchical autoencoder for human motion prediction. Visual Comput. (2019). https://doi.org/10.1007/s00371-019-01692-9

  5. Xu, L.Z.C.Q. D.: Object-based illumination transferring and rendering for applications of mixed reality. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02292-2

  6. Caelles, S., Maninis, K-., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5320–5329 (2017). https://doi.org/10.1109/CVPR.2017.565

  7. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. CoRR abs/1706.09364 (2017) arXiv:1706.09364

  8. Maninis, K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1515–1530 (2019)

    Article  Google Scholar 

  9. Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for video object segmentation. Int. J. Comput. Vision 2, 1–23 (2019)

    Google Scholar 

  10. Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. CoRR abs/1803.04242 (2018) arXiv:1803.04242

  11. Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for video object segmentation. CoRR abs/1807.09190 (2018) arXiv:1807.09190

  12. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3491–3500 (2017)

  13. Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.: Efficient video object segmentation via network modulation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6499–6507 (2018)

  14. Oh, S., Lee, J.-Y., Sunkavalli, K., Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)

  15. Tsai, Y.-H., Yang, M.-H., Black, M.J.: Video segmentation via object flow. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3899–3908 (2016)

  16. Hu, Y., Huang, J., Schwing, A.G.: Maskrnn: Instance level video object segmentation. CoRR abs/1803.11187 (2018) arXiv:1803.11187

  17. Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: Monet: Deep motion exploitation for video object segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1140–1148 (2018)

  18. Liu, W., Lin, G., Zhang, T., Liu, Z.: Guided co-segmentation network for fast video object segmentation. IEEE Trans. Circuits Syst. Video Technol. 31(4), 1607–1617 (2021). https://doi.org/10.1109/TCSVT.2020.3010293

    Article  Google Scholar 

  19. Chen, Y., Pont-Tuset, J., Montes, A., Gool, L.V.: Blazingly fast video object segmentation with pixel-wise metric learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018). https://doi.org/10.1109/CVPR.2018.00130

  20. Hu, Y., Huang, J., Schwing, A.G.: Videomatch: matching based video object segmentation. CoRR abs/1809.01123 (2018) arXiv:1809.01123

  21. Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.-C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9473–9482 (2019)

  22. Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. CoRR abs/2003.08333 (2020) arXiv:2003.08333

  23. Oh, S., Lee, J.-Y., Xu, N., Kim, S.: Video object segmentation using space-time memory networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9225–9234 (2019)

  24. Li, Y., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. CoRR abs/2001.11243 (2020) arXiv:2001.11243

  25. Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. CoRR abs/2007.08270 (2020) arXiv:2007.08270

  26. Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Gool, L.V.: Video object segmentation with episodic graph memory networks. CoRR abs/2007.07020 (2020) arXiv:2007.07020

  27. Bao, L., Wu, B., Liu, W.: Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5977–5986 (2018)

  28. Cunningham, P., Delany, S.J.: K-nearest neighbour classifiersl. ACM Comput. Surv. 54(6), 7789 (2021)

    Google Scholar 

  29. Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6947–6956 (2020). https://doi.org/10.1109/CVPR42600.2020.00698

  30. Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8401–8410 (2021)

  31. Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR (2021)

  32. Hu, L., Zhang, P., Zhang, B., Pan, P., Xu, Y., Jin, R.: Learning position and target consistency for memory-based video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4142–4152 (2021). https://doi.org/10.1109/CVPR46437.2021.00413

  33. Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. Adv. Neural. Inf. Process. Syst. 33, 3430–3441 (2020)

    Google Scholar 

  34. Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1296–1305 (2021)

  35. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Transformers. Springer, Berlin (2019)

    Google Scholar 

  36. Kitaev, N., Łukasz Kaiser, Levskaya, A.: Reformer: The Efficient Transformer (2020)

  37. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020)

  38. Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient Attention: Attention with Linear Complexities (2020)

  39. Li, R., Su, J., Duan, C., Zheng, S.: Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation (2020)

  40. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  41. Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732 (2016)

  42. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbelaez, P., Sorkine-Hornung, A., Gool, L.V.: The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675 (2017) arXiv:1704.00675

  43. Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.S.: Youtube-vos: A large-scale video object segmentation benchmark. CoRR abs/1809.03327 (2018) arXiv:1809.03327

  44. Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8945–8954 (2019). https://doi.org/10.1109/CVPR.2019.00916

  45. Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: Ranking attention network for fast video object segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3977–3986 (2019). https://doi.org/10.1109/ICCV.2019.00408

  46. Seong, H., Oh, S.W., Lee, J.-Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12869–12878 (2021)

  47. Cho, S., Lee, H., Kim, M., Jang, S., Lee, S.: Pixel-level bijective matching for video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 129–138 (2022)

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant Nos.62072449,61802197,61632003) and is also funded in part by the Science and Technology Development Fund, Macau SAR (Grant no.0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Guangdong Science and Technology Department (Grant no.2018B030324002), in part by the Zhuhai Science and Technology Innovation Bureau Zhuhai-Hong Kong-Macau Special Cooperation Project (Grant no. ZH22017002200001PWC)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yadang Chen.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Equivalent derivation of the formula

Appendix A Equivalent derivation of the formula

In this supplementary material, we provide a mathematical derivation of EMR discussed in Sect. 3.3. This is an efficient attention mechanism which is mathematically equivalent to dot-product attention but substantially faster in terms of memory reading.

Given the memory \(M_{t}\) with the embedded key \(K_{M}=\left\{ K_{M}\left( j\right) \right\} \in R^{k\times C/8}\) and query key \(K_{Q}=\left\{ K_{Q}\left( i\right) \right\} \in R^{H\times W\times C/8}\), under the condition of using softmax used in STM [23] as a normalization function, we can write the result of the dot product between i and j as follows:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\sum _{j} \frac{e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }}{\sum \nolimits _{j} e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }} V_{M}\left( j\right) , \end{aligned}$$
(A1)

where H, W, and C are the height, width, and channel size of ResNet-50 [40], respectively. Also, k represents the number of pixel features in \(M_{t}\), and \(V_{M}=\left\{ V_{M}\left( j\right) \right\} \in R^{k\times C/2}\) denotes the embedded value of the memory. Then, we can rewrite Eq.A1 generally by using any normalization function, formulated as sim, as follows:

$$\begin{aligned} \begin{gathered} \hat{V_{Q}} \left( i\right) =\sum _{j} \frac{sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) }{\sum \nolimits _{j} sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) } V_{M}\left( j\right) ,\\ sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) \ge 0.\end{gathered}. \end{aligned}$$
(A2)

We then propose a linear attention mechanism using a first-order approximation of the Taylor expansion on Eq.A1:

$$\begin{aligned} e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }\approx 1+K_{M}\left( j\right) K_{Q}\left( i\right) ^{T}. \end{aligned}$$
(A3)

However, the above approximation cannot guarantee the non-negativity. Therefore, we normalize \(K_{M}\left( j\right) \) and \(K_{Q}\left( i\right) \) by \(l_{2}\) norm to ensure \(K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} \ge -1\). We can write Eq.A3 as:

$$\begin{aligned} sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) =1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T}, \end{aligned}$$
(A4)

where \(l_{M}\left( j\right) =\frac{K_{M}\left( j\right) }{\parallel K_{M}\left( j\right) \parallel _{2} } \) and \(l_{Q}\left( i\right) =\frac{K_{Q}\left( i\right) }{\parallel K_{Q}\left( i\right) \parallel _{2} } \). Then, Eq.A2 can be rewritten as:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\frac{\sum _{j} \left( 1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T} \right) V_{M}\left( j\right) }{\sum \nolimits _{j} \left( 1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T} \right) }, \end{aligned}$$
(A5)

and simplified as:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\frac{\sum _{j} V_{M}\left( j\right) +l_{Q}\left( i\right) ^{T} \left( \sum _{j} l_{M}\left( j\right) \left( V_{M}\left( j\right) \right) ^{T} \right) }{N+l_{Q}\left( i\right) ^{T} \sum \nolimits _{j} l_{M}\left( j\right) }.\nonumber \\ \end{aligned}$$
(A6)

We find that the computational and memory complexity of Eq.A6 is \(O\left( N\right) \) that greatly improves efficiency, but also has the same effect as the memory reader of STM.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ji, C., Chen, Y., Yang, ZX. et al. Spatio-temporal compression for semi-supervised video object segmentation. Vis Comput 39, 4929–4942 (2023). https://doi.org/10.1007/s00371-022-02638-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02638-4

Keywords

Navigation