Spatio-temporal compression for semi-supervised video object segmentation

Ji, Chuanjun; Chen, Yadang; Yang, Zhi-Xin; Wu, Enhua

doi:10.1007/s00371-022-02638-4

Spatio-temporal compression for semi-supervised video object segmentation

Original article
Published: 13 August 2022

Volume 39, pages 4929–4942, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Chuanjun Ji^1,2,
Yadang Chen ORCID: orcid.org/0000-0002-4448-2617^1,2,
Zhi-Xin Yang³ &
…
Enhua Wu⁴

245 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we explore the spatial–temporal redundancy in video object segmentation (VOS) under semi-supervised context with the purpose to improve the computational efficiency. Recently, memory-based methods have attracted great attention for their excellent performance. These methods involve first constructing an external memory to store the target object information in the history frames and then selecting the information that is beneficial for modeling the target object by memory reading. However, such methods are inefficient and unable to achieve both high accuracy and high efficiency, due to the large amount of redundant information in memory. Moreover, they periodically sample historical frames and add them to memory; this operation may lose important information from dynamic frames with incremental object changing or aggravate temporal redundancy from static frames with no object changing. To address these problems, we propose an efficient semi-supervised VOS approach via spatio-temporal compression (termed as STCVOS). Specifically, we first adopt a temporally varying sensor to adaptively filter static frames with no target objects evolutions and trigger memory to update with frames containing noticeable variations. Furthermore, we propose a spatially compressed memory to absorb features with varied pixels and remove outdated features, which considerably reduces information redundancy. More importantly, we introduce an efficient memory reader to perform memory reading with less footprint and computational overhead. Experimental results indicate that STCVOS performs well against state-of-the-art methods on the DAVIS 2017 and YouTube-VOS datasets, with a $ {\mathcal {J}} \& {\mathcal {F}}$ overall score of 82.0% and 79.7%, respectively. Meanwhile, STCVOS achieves a high inference speed of approximately 30 $\mathcal {FPS}$.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

Learning spatiotemporal relationships with a unified framework for video object segmentation

Article 01 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Huang, Z., Zhao, H., Zhan, J., Li, H.: A multivariate intersection over union of siamrpn network for visual tracking. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02150-1
Article Google Scholar
Gökstorp, B.T.P. S.G.E.: Temporal and non-temporal contextual saliency analysis for generalized wide-area search within unmanned aerial vehicle (uav) video. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02264-6
Tschiedel, R.M.F.K.E.E.A.M.: Real-time limb tracking in single depth images based on circle matching and line fitting. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02138-x
Li, W.Z.Y.X.E.A.Y.: Efficient convolutional hierarchical autoencoder for human motion prediction. Visual Comput. (2019). https://doi.org/10.1007/s00371-019-01692-9
Xu, L.Z.C.Q. D.: Object-based illumination transferring and rendering for applications of mixed reality. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02292-2
Caelles, S., Maninis, K-., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5320–5329 (2017). https://doi.org/10.1109/CVPR.2017.565
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. CoRR abs/1706.09364 (2017) arXiv:1706.09364
Maninis, K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1515–1530 (2019)
Article Google Scholar
Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for video object segmentation. Int. J. Comput. Vision 2, 1–23 (2019)
Google Scholar
Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. CoRR abs/1803.04242 (2018) arXiv:1803.04242
Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for video object segmentation. CoRR abs/1807.09190 (2018) arXiv:1807.09190
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3491–3500 (2017)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.: Efficient video object segmentation via network modulation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6499–6507 (2018)
Oh, S., Lee, J.-Y., Sunkavalli, K., Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Tsai, Y.-H., Yang, M.-H., Black, M.J.: Video segmentation via object flow. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3899–3908 (2016)
Hu, Y., Huang, J., Schwing, A.G.: Maskrnn: Instance level video object segmentation. CoRR abs/1803.11187 (2018) arXiv:1803.11187
Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: Monet: Deep motion exploitation for video object segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1140–1148 (2018)
Liu, W., Lin, G., Zhang, T., Liu, Z.: Guided co-segmentation network for fast video object segmentation. IEEE Trans. Circuits Syst. Video Technol. 31(4), 1607–1617 (2021). https://doi.org/10.1109/TCSVT.2020.3010293
Article Google Scholar
Chen, Y., Pont-Tuset, J., Montes, A., Gool, L.V.: Blazingly fast video object segmentation with pixel-wise metric learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018). https://doi.org/10.1109/CVPR.2018.00130
Hu, Y., Huang, J., Schwing, A.G.: Videomatch: matching based video object segmentation. CoRR abs/1809.01123 (2018) arXiv:1809.01123
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.-C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9473–9482 (2019)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. CoRR abs/2003.08333 (2020) arXiv:2003.08333
Oh, S., Lee, J.-Y., Xu, N., Kim, S.: Video object segmentation using space-time memory networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9225–9234 (2019)
Li, Y., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. CoRR abs/2001.11243 (2020) arXiv:2001.11243
Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. CoRR abs/2007.08270 (2020) arXiv:2007.08270
Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Gool, L.V.: Video object segmentation with episodic graph memory networks. CoRR abs/2007.07020 (2020) arXiv:2007.07020
Bao, L., Wu, B., Liu, W.: Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5977–5986 (2018)
Cunningham, P., Delany, S.J.: K-nearest neighbour classifiersl. ACM Comput. Surv. 54(6), 7789 (2021)
Google Scholar
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6947–6956 (2020). https://doi.org/10.1109/CVPR42600.2020.00698
Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8401–8410 (2021)
Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR (2021)
Hu, L., Zhang, P., Zhang, B., Pan, P., Xu, Y., Jin, R.: Learning position and target consistency for memory-based video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4142–4152 (2021). https://doi.org/10.1109/CVPR46437.2021.00413
Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. Adv. Neural. Inf. Process. Syst. 33, 3430–3441 (2020)
Google Scholar
Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1296–1305 (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Transformers. Springer, Berlin (2019)
Google Scholar
Kitaev, N., Łukasz Kaiser, Levskaya, A.: Reformer: The Efficient Transformer (2020)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020)
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient Attention: Attention with Linear Complexities (2020)
Li, R., Su, J., Duan, C., Zheng, S.: Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732 (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbelaez, P., Sorkine-Hornung, A., Gool, L.V.: The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675 (2017) arXiv:1704.00675
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.S.: Youtube-vos: A large-scale video object segmentation benchmark. CoRR abs/1809.03327 (2018) arXiv:1809.03327
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8945–8954 (2019). https://doi.org/10.1109/CVPR.2019.00916
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: Ranking attention network for fast video object segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3977–3986 (2019). https://doi.org/10.1109/ICCV.2019.00408
Seong, H., Oh, S.W., Lee, J.-Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12869–12878 (2021)
Cho, S., Lee, H., Kim, M., Jang, S., Lee, S.: Pixel-level bijective matching for video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 129–138 (2022)

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant Nos.62072449,61802197,61632003) and is also funded in part by the Science and Technology Development Fund, Macau SAR (Grant no.0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Guangdong Science and Technology Department (Grant no.2018B030324002), in part by the Zhuhai Science and Technology Innovation Bureau Zhuhai-Hong Kong-Macau Special Cooperation Project (Grant no. ZH22017002200001PWC)

Author information

Authors and Affiliations

School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Chuanjun Ji & Yadang Chen
Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Chuanjun Ji & Yadang Chen
The State Key Laboratory of Internet of Things for Smart City and Department of Electromechanical Engineering, University of Macau, Macau, 999078, China
Zhi-Xin Yang
State Key Laboratory of Computer Science, Institute of Software, University of Chinese Academy of Sciences, Beijing, 100190, China
Enhua Wu

Authors

Chuanjun Ji
View author publications
You can also search for this author inPubMed Google Scholar
Yadang Chen
View author publications
You can also search for this author inPubMed Google Scholar
Zhi-Xin Yang
View author publications
You can also search for this author inPubMed Google Scholar
Enhua Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yadang Chen.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Equivalent derivation of the formula

In this supplementary material, we provide a mathematical derivation of EMR discussed in Sect. 3.3. This is an efficient attention mechanism which is mathematically equivalent to dot-product attention but substantially faster in terms of memory reading.

Given the memory $M_{t}$ with the embedded key $K_{M}=\left\{ K_{M}\left( j\right) \right\} \in R^{k\times C/8}$ and query key $K_{Q}=\left\{ K_{Q}\left( i\right) \right\} \in R^{H\times W\times C/8}$, under the condition of using softmax used in STM [23] as a normalization function, we can write the result of the dot product between i and j as follows:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\sum _{j} \frac{e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }}{\sum \nolimits _{j} e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }} V_{M}\left( j\right) , \end{aligned}$$

(A1)

where H, W, and C are the height, width, and channel size of ResNet-50 [40], respectively. Also, k represents the number of pixel features in $M_{t}$, and $V_{M}=\left\{ V_{M}\left( j\right) \right\} \in R^{k\times C/2}$ denotes the embedded value of the memory. Then, we can rewrite Eq.A1 generally by using any normalization function, formulated as sim, as follows:

$$\begin{aligned} \begin{gathered} \hat{V_{Q}} \left( i\right) =\sum _{j} \frac{sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) }{\sum \nolimits _{j} sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) } V_{M}\left( j\right) ,\\ sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) \ge 0.\end{gathered}. \end{aligned}$$

(A2)

We then propose a linear attention mechanism using a first-order approximation of the Taylor expansion on Eq.A1:

$$\begin{aligned} e^{K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} }\approx 1+K_{M}\left( j\right) K_{Q}\left( i\right) ^{T}. \end{aligned}$$

(A3)

However, the above approximation cannot guarantee the non-negativity. Therefore, we normalize $K_{M}\left( j\right) $ and $K_{Q}\left( i\right) $ by $l_{2}$ norm to ensure $K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} \ge -1$. We can write Eq.A3 as:

$$\begin{aligned} sim\left( K_{M}\left( j\right) ,K_{Q}\left( i\right) \right) =1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T}, \end{aligned}$$

(A4)

where $l_{M}\left( j\right) =\frac{K_{M}\left( j\right) }{\parallel K_{M}\left( j\right) \parallel _{2} } $ and $l_{Q}\left( i\right) =\frac{K_{Q}\left( i\right) }{\parallel K_{Q}\left( i\right) \parallel _{2} } $. Then, Eq.A2 can be rewritten as:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\frac{\sum _{j} \left( 1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T} \right) V_{M}\left( j\right) }{\sum \nolimits _{j} \left( 1+l_{M}\left( j\right) l_{Q}\left( i\right) ^{T} \right) }, \end{aligned}$$

(A5)

and simplified as:

$$\begin{aligned} \hat{V_{Q}} \left( i\right) =\frac{\sum _{j} V_{M}\left( j\right) +l_{Q}\left( i\right) ^{T} \left( \sum _{j} l_{M}\left( j\right) \left( V_{M}\left( j\right) \right) ^{T} \right) }{N+l_{Q}\left( i\right) ^{T} \sum \nolimits _{j} l_{M}\left( j\right) }.\nonumber \\ \end{aligned}$$

(A6)

We find that the computational and memory complexity of Eq.A6 is $O\left( N\right) $ that greatly improves efficiency, but also has the same effect as the memory reader of STM.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ji, C., Chen, Y., Yang, ZX. et al. Spatio-temporal compression for semi-supervised video object segmentation. Vis Comput 39, 4929–4942 (2023). https://doi.org/10.1007/s00371-022-02638-4

Download citation

Accepted: 28 July 2022
Published: 13 August 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00371-022-02638-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-temporal compression for semi-supervised video object segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

Learning spatiotemporal relationships with a unified framework for video object segmentation

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A Equivalent derivation of the formula

Appendix A Equivalent derivation of the formula

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now