Abstract
Automatic video object segmentation aims to identify a video’s main object without human intervention. This task poses a challenge as it requires improving the synergy of feature fusion, which entails integrating motion and appearance cues. Although previous approaches have attempted to sample, propagate, and fuse these cues directly, they often suffer from misalignment issues. This is mainly because motion features focus on objects that are in motion, while appearance features tend to focus on more salient objects. In this paper, we design a Multi-scale Deep Feature Transfer Model (MFTM) to improve the upper limit of feature synergy through mutual mapping transformation between features. We consider the fused features as participants in feature interaction. By integrating these features, we encourage and constrain the appearance and motion features to enhance their compatibility. Additionally, we adopt pairwise combinations to facilitate the interaction propagation among motion cues, appearance cues, and fused features. This approach helps eliminate noise interference caused by different features, improving feature representations. In addition, we design a Multi-layer Feature Fusion Module (MFM) to further fuse features of different scales and levels, thereby improving the robustness and accuracy of the model’s prediction. We test our model on two popular benchmark datasets, DAVIS2016 and FBMS. Our j-score for DAVIS2016 reached 83.1 and our j-score for FBMS reached 77.3. Besides, we achieve impressive scores on the \(E_{MAX}\), \(F_{MAX}\), and M metrics for the FBMS. These results provide evidence for the effectiveness of our model.
Similar content being viewed by others
Notes
To facilitate understanding, we refer to Unsupervised Video Object Segmentation (UVOS) as Automatic Video Object Segmentation (AVOS) and Semi-supervised Video Object Segmentation (SVOS) as semi-automatic Video Object Segmentation (SVOS) in response to Wang’s suggestion [8].
References
Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9384–9393
Huang X, Xu J, Tai Y.-W, Tang C.-K (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8879–8889
Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8741–8750
Liu J, Dai H.-N, Zhao G, Li B, Zhang T (2022) TMVOS: triplet matching for efficient video object segmentation. Signal Process Image Commun 107
Maddern W, Pascoe G, Linegar C, Newman P (2017) 1 year, 1000 km: The oxford robotcar dataset. Int J Robot Res 36(1):3–15
Hadizadeh H, Bajić IV (2013) Saliency-aware video compression. IEEE Trans Image Process 23(1):19–33
Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1189–1198
Zhou T, Porikli F, Crandall D.J, Van Gool L, Wang W (2022) A survey on deep learning technique for video segmentation. In: IEEE Transactions on pattern analysis and machine intelligence. IEEE, pp 1–20
Wang W, Shen J, Porikli F, Yang R (2018) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985–998
Bhat G, Lawin F.J, Danelljan M, Robinson A, Felsberg M, Van Gool L, Timofte R (2020) Learning what to learn for video object segmentation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, pp 777–794
Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis K.-K, Van Gool L (2019) The 2019 davis challenge on vos: Unsupervised multi-object segmentation. CoRR abs/1905.00737
Lan M, Zhang Y, Xu Q, Zhang L (2020) E3sn: efficient end-to-end siamese network for video object segmentation. In: IJCAI, pp 701–707
Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part X 16. Springer, pp 735–750
Robinson A, Lawin F.J, Danelljan M, Khan F.S, Felsberg M (2020) Learning fast and robust target models for video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XXII 16. Springer, pp 629–645
Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) Youtube-vos: sequence-to-sequence video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 585–601
Yang L, Fan Y, Xu N (2019) Video instance segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5188–5197
Zhang K, Wang L, Liu D, Liu B, Liu Q, Li Z (2020) Dual temporal memory network for efficient video object segmentation. In: Proceedings of the 28th ACM international conference on multimedia, pp 1515–1523
Zhang Y, Wu Z, Peng H, Lin S (2020) A transductive approach for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6949–6958
Mahadevan S, Athar A, Ošep A, Hennen S, Leal-Taixé L, Leibe B (2020) Making a case for 3d convolutions for object segmentation in videos. CoRR abs/2008.11516
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3623–3632
Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE international conference on computer vision, pp 4481–4490
Tokmakov P, Schmid C, Alahari K (2019) Learning to segment moving objects. In: International journal of computer vision. Springer, pp 282–301
Yang Z, Wang Q, Bertinetto L, Hu W, Bai S, Torr P.H (2019) Anchor diffusion for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 931–940
Li G, Xie Y, Lin L, Yu Y (2017) Instance-level salient object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2386–2395
Hou Q, Cheng M.-M, Hu X, Borji A, Tu Z, Torr P.H (2017) Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3203–3212
Li G, Yu Y (2016) Deep contrast learning for salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 478–487
Wang W, Shen J (2017) Deep visual attention prediction. In: IEEE Transactions on image processing. IEEE, pp 2368–2378
Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7274–7283
Tokmakov P, Alahari K, Schmid C (2017) Learning motion patterns in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3386–3394
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2663–2672
Dutt Jain S, Xiong B, Grauman K (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3664–3673
Cheng J, Tsai Y.-H, Wang S, Yang M.-H (2017) Segflow: joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE international conference on computer vision, pp 686–695
Li S, Seybold B, Vorobyov A, Lei X, Kuo C.-C.J (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European conference on computer vision (ECCV), pp 207–223
Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. In: IEEE Transactions on image processing, pp 8326–8338
Ji G.-P, Fu K, Wu Z, Fan D.-P, Shen J, Shao L (2021) Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4922–4933
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–732
Ochs P, Malik J, Brox T (2013) Segmentation of moving objects by long term video analysis. In: IEEE Transaction on pattern analysis and machine intelligence, pp 1187–1200
Tsai Y.-H, Yang M.-H, Black M.J (2016) Video segmentation via object flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3899–3908
Xu Y.-S, Fu T.-J, Yang H.-K, Lee C.-Y (2018) Dynamic video segmentation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6556–6565
Wang J, Chen D, Wu Z, Luo C, Tang C, Dai X, Zhao Y, Xie Y, Yuan L, Jiang Y.-G (2023) Look before you match: instance understanding matters in video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2268–2278
Cheng H.K, Schwing A.G (2022) Xmem: long-term video object segmentation with an Atkinson–Shiffrin memory model. In: European conference on computer vision. Springer, pp 640–658
Hu Y.-T, Huang J.-B, Schwing A.G (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In: Proceedings of the European conference on computer vision (ECCV), pp 786–802
Wang W, Shen J, Li X, Porikli F (2015) Robust video object cosegmentation. IEEE Trans Image Process 24:3137–3148
Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3402
Faktor A, Irani M (2014) Video segmentation by non-local consensus voting. In: BMVC, p 8
Lee Y.J, Kim J, Grauman K (2011) Key-segments for video object segmentation. In: 2011 International conference on computer vision. IEEE, pp 1995–2002
Li F, Kim T, Humayun A, Tsai D, Rehg J.M (2013) Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE international conference on computer vision, pp 2192–2199
Robinson A, Lawin F.J, Danelljan M, Khan F.S, Felsberg M (2020) Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415
Ballas N, Yao L, Pal C, Courville A (2016) Delving deeper into convolutional networks for learning video representations
Song H, Wang W, Zhao S, Shen J, Lam K.-M (2018) Pyramid dilated deeper ConvLSTM for video salient object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 715–731
Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S. C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3064–3074
Xu M, Liu B, Fu P, Li J, Hu YH, Feng S (2019) Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. IEEE Trans Circuits Syst Video Technol 30:2191–2206
Zheng J, Luo W, Piao Z (2019) Cascaded ConvLSTMs using semantically-coherent data synthesis for video object segmentation. In: IEEE access, pp 132120–132129
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27
Wang W, Lu X, Shen J, Crandall D.J, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9236–9245
Galasso F, Cipolla R, Schiele B (2013) Video segmentation with superpixels. In: Computer vision–ACCV 2012: 11th Asian conference on computer vision, Daejeon, Korea, November 5–9, 2012, Revised Selected Papers, Part I 11. Springer, pp 760–774
Grundmann M, Kwatra V, Han M, Essa I (2010) Efficient hierarchical graph-based video segmentation. In: 2010 IEEE Computer society conference on computer vision and pattern recognition. IEEE, pp 2141–2148
Xu C, Xiong C, Corso J.J (2012) Streaming hierarchical video segmentation. In: Computer vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7–13, 2012, Proceedings, Part VI 12. Springer, pp 626–639
Li X, Loy C.C (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European conference on computer vision (ECCV), pp 90–105
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32
Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 136–145
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFS with gaussian edge potentials. Adv Neural Inf Process Syst 24
Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. In: Proceedings of the IEEE international conference on computer vision, pp 1777–1784
Lao D, Sundaramoorthi G (2018) Extending layered models to 3d motion. In: Proceedings of the European conference on computer vision (ECCV), pp 435–451
Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE international conference on computer vision, pp 4481–4490
Koh Y. J, Kim C.-S (2017) Primary object segmentation in videos based on region augmentation and reduction. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 7417–7425
Siam M, Jiang C, Lu S, Petrich L, Gamal M, Elhoseiny M, Jagersand M (2019) Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In: 2019 International conference on robotics and automation (ICRA). IEEE, pp 50–56
Akhter I, Ali M, Faisal M, Hartley R (2020) Epo-net: exploiting geometric constraints on dense trajectories for motion saliency. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1884–1893
Chen Y.-W, Jin X, Shen X, Yang M.-H (2022) Video salient object detection via contrastive features and attention modules. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1320–1329
Lee M, Cho S, Lee S, Park C, Lee S (2023) Unsupervised video object segmentation via prototype memory network. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 5924–5934
Fan D.-P, Cheng M.-M, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE international conference on computer vision, pp 4548–4557
Fan D-P, Ji G-P, Qin X-B, Cheng M-M (2021) Cognitive vision inspired object segmentation metric and loss function. Sci Sin Inf 51(9):1475
Achanta R, HemamiS, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: 2009 IEEE Conference on computer vision and pattern recognition. IEEE, pp 1597–1604
Perazzi F, Krähenbühl P, Pritch Y, Hornung A (2012) Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 733–740
Ding M, Wang Z, Zhou B, Shi J, Lu Z, Luo P (2020) Every frame counts: joint learning of video segmentation and optical flow. In: Proceedings of the AAAI conference on artificial intelligence, pp 10713–10720
Xu M, Liu B, Fu P, Li J, Hu YH (2019) Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. IEEE Trans Multimed 21:2790–2805
Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2018) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans Circuits Syst Video Technol 29:1973–1984
Li Y, Li S, Chen C, Hao A, Qin H (2019) Accurate and robust video saliency detection via self-paced diffusion. In: IEEE Transactions on multimedia, pp 1153–1167
Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. In: IEEE Transactions on image processing, pp 38–49
Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) SCOM: spatiotemporal constrained optimization for salient object detection. In: IEEE Transactions on image processing, pp 3345–3357
Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3243–3252
Chen C, Wang G, Peng C, Zhang X, Qin H (2019) Improved robust video saliency detection based on long-term spatial-temporal information. In: IEEE Transactions on image processing, pp 1090–1100
Yan P, Li G, Xie Y, Li Z, Wang C, Chen T , Lin L (2019) Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7284–7293
Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8554–8564
Gu Y, Wang L, Wang Z, Liu Y, Cheng M.-M, Lu S.-P (2020) Pyramid constrained self-attention network for fast video salient object detection. In: Proceedings of the AAAI conference on artificial intelligence, pp 10869–10876
Shi X, Chen Z, Wang H, Yeung D.-Y, Wong W.-K, Woo W.-C (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Adv Neural Inf Process Syst 28
Funding
This work is supported by The Natural Science Foundation of Hebei Province (F2019201451).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Z., Shi, Q. & Fang, Y. Multi-scale Deep Feature Transfer for Automatic Video Object Segmentation. Neural Process Lett 55, 11701–11719 (2023). https://doi.org/10.1007/s11063-023-11395-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11395-x