Abstract
In this paper, we explore the spatial–temporal redundancy in video object segmentation (VOS) under semi-supervised context with the purpose to improve the computational efficiency. Recently, memory-based methods have attracted great attention for their excellent performance. These methods involve first constructing an external memory to store the target object information in the history frames and then selecting the information that is beneficial for modeling the target object by memory reading. However, such methods are inefficient and unable to achieve both high accuracy and high efficiency, due to the large amount of redundant information in memory. Moreover, they periodically sample historical frames and add them to memory; this operation may lose important information from dynamic frames with incremental object changing or aggravate temporal redundancy from static frames with no object changing. To address these problems, we propose an efficient semi-supervised VOS approach via spatio-temporal compression (termed as STCVOS). Specifically, we first adopt a temporally varying sensor to adaptively filter static frames with no target objects evolutions and trigger memory to update with frames containing noticeable variations. Furthermore, we propose a spatially compressed memory to absorb features with varied pixels and remove outdated features, which considerably reduces information redundancy. More importantly, we introduce an efficient memory reader to perform memory reading with less footprint and computational overhead. Experimental results indicate that STCVOS performs well against state-of-the-art methods on the DAVIS 2017 and YouTube-VOS datasets, with a \( {\mathcal {J}} \& {\mathcal {F}}\) overall score of 82.0% and 79.7%, respectively. Meanwhile, STCVOS achieves a high inference speed of approximately 30 \(\mathcal {FPS}\).








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Huang, Z., Zhao, H., Zhan, J., Li, H.: A multivariate intersection over union of siamrpn network for visual tracking. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02150-1
Gökstorp, B.T.P. S.G.E.: Temporal and non-temporal contextual saliency analysis for generalized wide-area search within unmanned aerial vehicle (uav) video. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02264-6
Tschiedel, R.M.F.K.E.E.A.M.: Real-time limb tracking in single depth images based on circle matching and line fitting. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02138-x
Li, W.Z.Y.X.E.A.Y.: Efficient convolutional hierarchical autoencoder for human motion prediction. Visual Comput. (2019). https://doi.org/10.1007/s00371-019-01692-9
Xu, L.Z.C.Q. D.: Object-based illumination transferring and rendering for applications of mixed reality. Visual Comput. (2021). https://doi.org/10.1007/s00371-021-02292-2
Caelles, S., Maninis, K-., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5320–5329 (2017). https://doi.org/10.1109/CVPR.2017.565
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. CoRR abs/1706.09364 (2017) arXiv:1706.09364
Maninis, K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1515–1530 (2019)
Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for video object segmentation. Int. J. Comput. Vision 2, 1–23 (2019)
Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. CoRR abs/1803.04242 (2018) arXiv:1803.04242
Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for video object segmentation. CoRR abs/1807.09190 (2018) arXiv:1807.09190
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3491–3500 (2017)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.: Efficient video object segmentation via network modulation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6499–6507 (2018)
Oh, S., Lee, J.-Y., Sunkavalli, K., Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Tsai, Y.-H., Yang, M.-H., Black, M.J.: Video segmentation via object flow. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3899–3908 (2016)
Hu, Y., Huang, J., Schwing, A.G.: Maskrnn: Instance level video object segmentation. CoRR abs/1803.11187 (2018) arXiv:1803.11187
Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: Monet: Deep motion exploitation for video object segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1140–1148 (2018)
Liu, W., Lin, G., Zhang, T., Liu, Z.: Guided co-segmentation network for fast video object segmentation. IEEE Trans. Circuits Syst. Video Technol. 31(4), 1607–1617 (2021). https://doi.org/10.1109/TCSVT.2020.3010293
Chen, Y., Pont-Tuset, J., Montes, A., Gool, L.V.: Blazingly fast video object segmentation with pixel-wise metric learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018). https://doi.org/10.1109/CVPR.2018.00130
Hu, Y., Huang, J., Schwing, A.G.: Videomatch: matching based video object segmentation. CoRR abs/1809.01123 (2018) arXiv:1809.01123
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.-C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9473–9482 (2019)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. CoRR abs/2003.08333 (2020) arXiv:2003.08333
Oh, S., Lee, J.-Y., Xu, N., Kim, S.: Video object segmentation using space-time memory networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9225–9234 (2019)
Li, Y., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. CoRR abs/2001.11243 (2020) arXiv:2001.11243
Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. CoRR abs/2007.08270 (2020) arXiv:2007.08270
Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Gool, L.V.: Video object segmentation with episodic graph memory networks. CoRR abs/2007.07020 (2020) arXiv:2007.07020
Bao, L., Wu, B., Liu, W.: Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5977–5986 (2018)
Cunningham, P., Delany, S.J.: K-nearest neighbour classifiersl. ACM Comput. Surv. 54(6), 7789 (2021)
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6947–6956 (2020). https://doi.org/10.1109/CVPR42600.2020.00698
Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8401–8410 (2021)
Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR (2021)
Hu, L., Zhang, P., Zhang, B., Pan, P., Xu, Y., Jin, R.: Learning position and target consistency for memory-based video object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4142–4152 (2021). https://doi.org/10.1109/CVPR46437.2021.00413
Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. Adv. Neural. Inf. Process. Syst. 33, 3430–3441 (2020)
Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1296–1305 (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Transformers. Springer, Berlin (2019)
Kitaev, N., Łukasz Kaiser, Levskaya, A.: Reformer: The Efficient Transformer (2020)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2020)
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient Attention: Attention with Linear Complexities (2020)
Li, R., Su, J., Duan, C., Zheng, S.: Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732 (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbelaez, P., Sorkine-Hornung, A., Gool, L.V.: The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675 (2017) arXiv:1704.00675
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.S.: Youtube-vos: A large-scale video object segmentation benchmark. CoRR abs/1809.03327 (2018) arXiv:1809.03327
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8945–8954 (2019). https://doi.org/10.1109/CVPR.2019.00916
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: Ranking attention network for fast video object segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3977–3986 (2019). https://doi.org/10.1109/ICCV.2019.00408
Seong, H., Oh, S.W., Lee, J.-Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12869–12878 (2021)
Cho, S., Lee, H., Kim, M., Jang, S., Lee, S.: Pixel-level bijective matching for video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 129–138 (2022)
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (Grant Nos.62072449,61802197,61632003) and is also funded in part by the Science and Technology Development Fund, Macau SAR (Grant no.0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Guangdong Science and Technology Department (Grant no.2018B030324002), in part by the Zhuhai Science and Technology Innovation Bureau Zhuhai-Hong Kong-Macau Special Cooperation Project (Grant no. ZH22017002200001PWC)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Equivalent derivation of the formula
Appendix A Equivalent derivation of the formula
In this supplementary material, we provide a mathematical derivation of EMR discussed in Sect. 3.3. This is an efficient attention mechanism which is mathematically equivalent to dot-product attention but substantially faster in terms of memory reading.
Given the memory \(M_{t}\) with the embedded key \(K_{M}=\left\{ K_{M}\left( j\right) \right\} \in R^{k\times C/8}\) and query key \(K_{Q}=\left\{ K_{Q}\left( i\right) \right\} \in R^{H\times W\times C/8}\), under the condition of using softmax used in STM [23] as a normalization function, we can write the result of the dot product between i and j as follows:
where H, W, and C are the height, width, and channel size of ResNet-50 [40], respectively. Also, k represents the number of pixel features in \(M_{t}\), and \(V_{M}=\left\{ V_{M}\left( j\right) \right\} \in R^{k\times C/2}\) denotes the embedded value of the memory. Then, we can rewrite Eq.A1 generally by using any normalization function, formulated as sim, as follows:
We then propose a linear attention mechanism using a first-order approximation of the Taylor expansion on Eq.A1:
However, the above approximation cannot guarantee the non-negativity. Therefore, we normalize \(K_{M}\left( j\right) \) and \(K_{Q}\left( i\right) \) by \(l_{2}\) norm to ensure \(K_{M}\left( j\right) K_{Q}\left( i\right) ^{T} \ge -1\). We can write Eq.A3 as:
where \(l_{M}\left( j\right) =\frac{K_{M}\left( j\right) }{\parallel K_{M}\left( j\right) \parallel _{2} } \) and \(l_{Q}\left( i\right) =\frac{K_{Q}\left( i\right) }{\parallel K_{Q}\left( i\right) \parallel _{2} } \). Then, Eq.A2 can be rewritten as:
and simplified as:
We find that the computational and memory complexity of Eq.A6 is \(O\left( N\right) \) that greatly improves efficiency, but also has the same effect as the memory reader of STM.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ji, C., Chen, Y., Yang, ZX. et al. Spatio-temporal compression for semi-supervised video object segmentation. Vis Comput 39, 4929–4942 (2023). https://doi.org/10.1007/s00371-022-02638-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02638-4