Skip to main content
Log in

Multi-scale Deep Feature Transfer for Automatic Video Object Segmentation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Automatic video object segmentation aims to identify a video’s main object without human intervention. This task poses a challenge as it requires improving the synergy of feature fusion, which entails integrating motion and appearance cues. Although previous approaches have attempted to sample, propagate, and fuse these cues directly, they often suffer from misalignment issues. This is mainly because motion features focus on objects that are in motion, while appearance features tend to focus on more salient objects. In this paper, we design a Multi-scale Deep Feature Transfer Model (MFTM) to improve the upper limit of feature synergy through mutual mapping transformation between features. We consider the fused features as participants in feature interaction. By integrating these features, we encourage and constrain the appearance and motion features to enhance their compatibility. Additionally, we adopt pairwise combinations to facilitate the interaction propagation among motion cues, appearance cues, and fused features. This approach helps eliminate noise interference caused by different features, improving feature representations. In addition, we design a Multi-layer Feature Fusion Module (MFM) to further fuse features of different scales and levels, thereby improving the robustness and accuracy of the model’s prediction. We test our model on two popular benchmark datasets, DAVIS2016 and FBMS. Our j-score for DAVIS2016 reached 83.1 and our j-score for FBMS reached 77.3. Besides, we achieve impressive scores on the \(E_{MAX}\), \(F_{MAX}\), and M metrics for the FBMS. These results provide evidence for the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. To facilitate understanding, we refer to Unsupervised Video Object Segmentation (UVOS) as Automatic Video Object Segmentation (AVOS) and Semi-supervised Video Object Segmentation (SVOS) as semi-automatic Video Object Segmentation (SVOS) in response to Wang’s suggestion [8].

References

  1. Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9384–9393

  2. Huang X, Xu J, Tai Y.-W, Tang C.-K (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8879–8889

  3. Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8741–8750

  4. Liu J, Dai H.-N, Zhao G, Li B, Zhang T (2022) TMVOS: triplet matching for efficient video object segmentation. Signal Process Image Commun 107

  5. Maddern W, Pascoe G, Linegar C, Newman P (2017) 1 year, 1000 km: The oxford robotcar dataset. Int J Robot Res 36(1):3–15

    Article  Google Scholar 

  6. Hadizadeh H, Bajić IV (2013) Saliency-aware video compression. IEEE Trans Image Process 23(1):19–33

    Article  MathSciNet  MATH  Google Scholar 

  7. Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1189–1198

  8. Zhou T, Porikli F, Crandall D.J, Van Gool L, Wang W (2022) A survey on deep learning technique for video segmentation. In: IEEE Transactions on pattern analysis and machine intelligence. IEEE, pp 1–20

  9. Wang W, Shen J, Porikli F, Yang R (2018) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985–998

    Article  Google Scholar 

  10. Bhat G, Lawin F.J, Danelljan M, Robinson A, Felsberg M, Van Gool L, Timofte R (2020) Learning what to learn for video object segmentation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, pp 777–794

  11. Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis K.-K, Van Gool L (2019) The 2019 davis challenge on vos: Unsupervised multi-object segmentation. CoRR abs/1905.00737

  12. Lan M, Zhang Y, Xu Q, Zhang L (2020) E3sn: efficient end-to-end siamese network for video object segmentation. In: IJCAI, pp 701–707

  13. Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part X 16. Springer, pp 735–750

  14. Robinson A, Lawin F.J, Danelljan M, Khan F.S, Felsberg M (2020) Learning fast and robust target models for video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415

  15. Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XXII 16. Springer, pp 629–645

  16. Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) Youtube-vos: sequence-to-sequence video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 585–601

  17. Yang L, Fan Y, Xu N (2019) Video instance segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5188–5197

  18. Zhang K, Wang L, Liu D, Liu B, Liu Q, Li Z (2020) Dual temporal memory network for efficient video object segmentation. In: Proceedings of the 28th ACM international conference on multimedia, pp 1515–1523

  19. Zhang Y, Wu Z, Peng H, Lin S (2020) A transductive approach for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6949–6958

  20. Mahadevan S, Athar A, Ošep A, Hennen S, Leal-Taixé L, Leibe B (2020) Making a case for 3d convolutions for object segmentation in videos. CoRR abs/2008.11516

  21. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3623–3632

  22. Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE international conference on computer vision, pp 4481–4490

  23. Tokmakov P, Schmid C, Alahari K (2019) Learning to segment moving objects. In: International journal of computer vision. Springer, pp 282–301

  24. Yang Z, Wang Q, Bertinetto L, Hu W, Bai S, Torr P.H (2019) Anchor diffusion for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 931–940

  25. Li G, Xie Y, Lin L, Yu Y (2017) Instance-level salient object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2386–2395

  26. Hou Q, Cheng M.-M, Hu X, Borji A, Tu Z, Torr P.H (2017) Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3203–3212

  27. Li G, Yu Y (2016) Deep contrast learning for salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 478–487

  28. Wang W, Shen J (2017) Deep visual attention prediction. In: IEEE Transactions on image processing. IEEE, pp 2368–2378

  29. Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7274–7283

  30. Tokmakov P, Alahari K, Schmid C (2017) Learning motion patterns in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3386–3394

  31. Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2663–2672

  32. Dutt Jain S, Xiong B, Grauman K (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3664–3673

  33. Cheng J, Tsai Y.-H, Wang S, Yang M.-H (2017) Segflow: joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE international conference on computer vision, pp 686–695

  34. Li S, Seybold B, Vorobyov A, Lei X, Kuo C.-C.J (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European conference on computer vision (ECCV), pp 207–223

  35. Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. In: IEEE Transactions on image processing, pp 8326–8338

  36. Ji G.-P, Fu K, Wu Z, Fan D.-P, Shen J, Shao L (2021) Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4922–4933

  37. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–732

  38. Ochs P, Malik J, Brox T (2013) Segmentation of moving objects by long term video analysis. In: IEEE Transaction on pattern analysis and machine intelligence, pp 1187–1200

  39. Tsai Y.-H, Yang M.-H, Black M.J (2016) Video segmentation via object flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3899–3908

  40. Xu Y.-S, Fu T.-J, Yang H.-K, Lee C.-Y (2018) Dynamic video segmentation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6556–6565

  41. Wang J, Chen D, Wu Z, Luo C, Tang C, Dai X, Zhao Y, Xie Y, Yuan L, Jiang Y.-G (2023) Look before you match: instance understanding matters in video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2268–2278

  42. Cheng H.K, Schwing A.G (2022) Xmem: long-term video object segmentation with an Atkinson–Shiffrin memory model. In: European conference on computer vision. Springer, pp 640–658

  43. Hu Y.-T, Huang J.-B, Schwing A.G (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In: Proceedings of the European conference on computer vision (ECCV), pp 786–802

  44. Wang W, Shen J, Li X, Porikli F (2015) Robust video object cosegmentation. IEEE Trans Image Process 24:3137–3148

    Article  MathSciNet  MATH  Google Scholar 

  45. Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3402

  46. Faktor A, Irani M (2014) Video segmentation by non-local consensus voting. In: BMVC, p 8

  47. Lee Y.J, Kim J, Grauman K (2011) Key-segments for video object segmentation. In: 2011 International conference on computer vision. IEEE, pp 1995–2002

  48. Li F, Kim T, Humayun A, Tsai D, Rehg J.M (2013) Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE international conference on computer vision, pp 2192–2199

  49. Robinson A, Lawin F.J, Danelljan M, Khan F.S, Felsberg M (2020) Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415

  50. Ballas N, Yao L, Pal C, Courville A (2016) Delving deeper into convolutional networks for learning video representations

  51. Song H, Wang W, Zhao S, Shen J, Lam K.-M (2018) Pyramid dilated deeper ConvLSTM for video salient object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 715–731

  52. Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S. C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3064–3074

  53. Xu M, Liu B, Fu P, Li J, Hu YH, Feng S (2019) Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. IEEE Trans Circuits Syst Video Technol 30:2191–2206

    Google Scholar 

  54. Zheng J, Luo W, Piao Z (2019) Cascaded ConvLSTMs using semantically-coherent data synthesis for video object segmentation. In: IEEE access, pp 132120–132129

  55. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27

  56. Wang W, Lu X, Shen J, Crandall D.J, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9236–9245

  57. Galasso F, Cipolla R, Schiele B (2013) Video segmentation with superpixels. In: Computer vision–ACCV 2012: 11th Asian conference on computer vision, Daejeon, Korea, November 5–9, 2012, Revised Selected Papers, Part I 11. Springer, pp 760–774

  58. Grundmann M, Kwatra V, Han M, Essa I (2010) Efficient hierarchical graph-based video segmentation. In: 2010 IEEE Computer society conference on computer vision and pattern recognition. IEEE, pp 2141–2148

  59. Xu C, Xiong C, Corso J.J (2012) Streaming hierarchical video segmentation. In: Computer vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7–13, 2012, Proceedings, Part VI 12. Springer, pp 626–639

  60. Li X, Loy C.C (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European conference on computer vision (ECCV), pp 90–105

  61. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  62. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  63. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  64. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32

  65. Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 136–145

  66. Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFS with gaussian edge potentials. Adv Neural Inf Process Syst 24

  67. Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. In: Proceedings of the IEEE international conference on computer vision, pp 1777–1784

  68. Lao D, Sundaramoorthi G (2018) Extending layered models to 3d motion. In: Proceedings of the European conference on computer vision (ECCV), pp 435–451

  69. Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE international conference on computer vision, pp 4481–4490

  70. Koh Y. J, Kim C.-S (2017) Primary object segmentation in videos based on region augmentation and reduction. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 7417–7425

  71. Siam M, Jiang C, Lu S, Petrich L, Gamal M, Elhoseiny M, Jagersand M (2019) Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In: 2019 International conference on robotics and automation (ICRA). IEEE, pp 50–56

  72. Akhter I, Ali M, Faisal M, Hartley R (2020) Epo-net: exploiting geometric constraints on dense trajectories for motion saliency. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1884–1893

  73. Chen Y.-W, Jin X, Shen X, Yang M.-H (2022) Video salient object detection via contrastive features and attention modules. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1320–1329

  74. Lee M, Cho S, Lee S, Park C, Lee S (2023) Unsupervised video object segmentation via prototype memory network. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 5924–5934

  75. Fan D.-P, Cheng M.-M, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE international conference on computer vision, pp 4548–4557

  76. Fan D-P, Ji G-P, Qin X-B, Cheng M-M (2021) Cognitive vision inspired object segmentation metric and loss function. Sci Sin Inf 51(9):1475

    Article  Google Scholar 

  77. Achanta R, HemamiS, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: 2009 IEEE Conference on computer vision and pattern recognition. IEEE, pp 1597–1604

  78. Perazzi F, Krähenbühl P, Pritch Y, Hornung A (2012) Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 733–740

  79. Ding M, Wang Z, Zhou B, Shi J, Lu Z, Luo P (2020) Every frame counts: joint learning of video segmentation and optical flow. In: Proceedings of the AAAI conference on artificial intelligence, pp 10713–10720

  80. Xu M, Liu B, Fu P, Li J, Hu YH (2019) Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. IEEE Trans Multimed 21:2790–2805

    Article  Google Scholar 

  81. Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2018) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans Circuits Syst Video Technol 29:1973–1984

    Article  Google Scholar 

  82. Li Y, Li S, Chen C, Hao A, Qin H (2019) Accurate and robust video saliency detection via self-paced diffusion. In: IEEE Transactions on multimedia, pp 1153–1167

  83. Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. In: IEEE Transactions on image processing, pp 38–49

  84. Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) SCOM: spatiotemporal constrained optimization for salient object detection. In: IEEE Transactions on image processing, pp 3345–3357

  85. Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3243–3252

  86. Chen C, Wang G, Peng C, Zhang X, Qin H (2019) Improved robust video saliency detection based on long-term spatial-temporal information. In: IEEE Transactions on image processing, pp 1090–1100

  87. Yan P, Li G, Xie Y, Li Z, Wang C, Chen T , Lin L (2019) Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7284–7293

  88. Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8554–8564

  89. Gu Y, Wang L, Wang Z, Liu Y, Cheng M.-M, Lu S.-P (2020) Pyramid constrained self-attention network for fast video salient object detection. In: Proceedings of the AAAI conference on artificial intelligence, pp 10869–10876

  90. Shi X, Chen Z, Wang H, Yeung D.-Y, Wong W.-K, Woo W.-C (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Adv Neural Inf Process Syst 28

Download references

Funding

This work is supported by The Natural Science Foundation of Hebei Province (F2019201451).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingxuan Shi.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Shi, Q. & Fang, Y. Multi-scale Deep Feature Transfer for Automatic Video Object Segmentation. Neural Process Lett 55, 11701–11719 (2023). https://doi.org/10.1007/s11063-023-11395-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11395-x

Keywords

Navigation