skip to main content
10.1145/3503161.3548039acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.

Skip Supplemental Material Section

Supplemental Material

References

  1. Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. In CVPR.Google ScholarGoogle Scholar
  2. Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, and Sergio Escalera. 2019. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In ICCVW.Google ScholarGoogle Scholar
  3. Goutam Bhat, Felix J¨aremo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In ECCV.Google ScholarGoogle Scholar
  4. Chenglizhao Chen, Guotao Wang, Chong Peng, Xiaowei Zhang, and Hong Qin. 2019. Improved robust video saliency detection based on long-term spatial-temporal information. TIP (2019).Google ScholarGoogle Scholar
  5. Yuhuan Chen, Wenbin Zou, Yi Tang, Xia Li, Chen Xu, and Nikos Komodakis. 2018. SCOM: Spatiotemporal constrained optimization for salient object detection. IEEE Transactions on Image Processing 27, 7 (2018), 3345--3357.Google ScholarGoogle ScholarCross RefCross Ref
  6. Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, and Ming-Hsuan Yang. 2022. Video Salient Object Detection via Contrastive Features and Attention Modules. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1320--1329.Google ScholarGoogle ScholarCross RefCross Ref
  7. Runmin Cong, Jianjun Lei, Huazhu Fu, Fatih Porikli, Qingming Huang, and Chunping Hou. 2019. Video saliency detection via sparsity-based reconstruction and propagation. TIP (2019).Google ScholarGoogle Scholar
  8. Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2020. EpO-net: Exploiting geometric constraints on dense trajectories for motion saliency. In WACV.Google ScholarGoogle Scholar
  9. Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV.Google ScholarGoogle Scholar
  10. Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. 2019. Shifting more attention to video salient object detection. In CVPR.Google ScholarGoogle Scholar
  11. Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. Graph convolutional tracking. In CVPR.Google ScholarGoogle Scholar
  12. Yuchao Gu, Lijuan Wang, Ziqin Wang, Yun Liu, Ming-Ming Cheng, and Shao-Ping Lu. 2020. Pyramid constrained selfattention network for fast video salient object detection. In AAAI.Google ScholarGoogle Scholar
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarGoogle ScholarCross RefCross Ref
  14. Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713--13722.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  16. Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Unsupervised video object segmentation using motion saliencyguided spatio-temporal propagation. In Proceedings of the European conference on computer vision (ECCV). 786--802.Google ScholarGoogle Scholar
  17. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.Google ScholarGoogle ScholarCross RefCross Ref
  18. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google ScholarGoogle Scholar
  19. Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2117--2126.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In ICCV.Google ScholarGoogle Scholar
  21. Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. TIP (2015).Google ScholarGoogle Scholar
  22. Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. 2019. Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 7274--7283.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jiangtong Li, Wentao Wang, Junjie Chen, Li Niu, Jianlou Si, Chen Qian, and Liqing Zhang. 2021. Video Semantic Segmentation via Sparse Temporal Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 59--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6526--6535.Google ScholarGoogle ScholarCross RefCross Ref
  25. Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV). 207--223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, and Hong Qin. 2019. Accurate and robust video saliency detection via self-paced diffusion. TMM (2019).Google ScholarGoogle Scholar
  27. Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. 2021. F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation. In AAAI.Google ScholarGoogle Scholar
  28. Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In ECCV.Google ScholarGoogle Scholar
  29. Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.Google ScholarGoogle Scholar
  30. Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2020. Zero-shot video object segmentation with coattention siamese networks. PAMI (2020).Google ScholarGoogle Scholar
  31. Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020. Learning video object segmentation from unlabeled videos. In CVPR.Google ScholarGoogle Scholar
  32. Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV). 116--131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: lightweight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021).Google ScholarGoogle Scholar
  34. Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. PAMI (2013).Google ScholarGoogle Scholar
  35. Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by referenceguided mask propagation. In CVPR.Google ScholarGoogle Scholar
  36. Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9413--9422.Google ScholarGoogle ScholarCross RefCross Ref
  37. Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.Google ScholarGoogle Scholar
  38. Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. Advances in Neural Information Processing Systems 34 (2021).Google ScholarGoogle Scholar
  39. Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455--15464.Google ScholarGoogle ScholarCross RefCross Ref
  40. Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In ECCV.Google ScholarGoogle Scholar
  41. Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA.Google ScholarGoogle Scholar
  42. Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV.Google ScholarGoogle Scholar
  43. Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017. Learning video object segmentation with visual memory. In ICCV.Google ScholarGoogle Scholar
  44. Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. SwiftNet: Real-time Video Object Segmentation. In CVPR.Google ScholarGoogle Scholar
  45. Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. 2021. Salient object detection in the deep learning era: An in-depth survey. PAMI (2021).Google ScholarGoogle Scholar
  46. Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In ICCV.Google ScholarGoogle Scholar
  47. Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven CH Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. PAMI (2020).Google ScholarGoogle Scholar
  48. Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017. Super-trajectory for video segmentation. In ICCV.Google ScholarGoogle Scholar
  49. Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019. Learning unsupervised video object segmentation through visual attention. In CVPR.Google ScholarGoogle Scholar
  50. Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12321--12328.Google ScholarGoogle ScholarCross RefCross Ref
  51. Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. 2020. Polarmask: Single shot instance segmentation with polar representation. In CVPR.Google ScholarGoogle Scholar
  52. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.Google ScholarGoogle Scholar
  53. Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. 2020. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12484--12491.Google ScholarGoogle ScholarCross RefCross Ref
  54. Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, and Yu Hen Hu. 2019. Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. TMM (2019).Google ScholarGoogle Scholar
  55. Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, Yu Hen Hu, and Shou Feng. 2019. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. TCSVT (2019).Google ScholarGoogle Scholar
  56. Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 585--601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yi Xu, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2019. Non-local convlstm for video compression artifact reduction. In ICCV.Google ScholarGoogle Scholar
  58. Pengxiang Yan, Guanbin Li, Yuan Xie, Zhen Li, Chuan Wang, Tianshui Chen, and Liang Lin. 2019. Semi-supervised video salient object detection using pseudo-labels. In ICCV.Google ScholarGoogle Scholar
  59. Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.Google ScholarGoogle Scholar
  60. Ren Yang. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Methods and results. In CVPR.Google ScholarGoogle Scholar
  61. Shu Yang, Lu Zhang, Jinqing Qi, Huchuan Lu, Shuo Wang, and Xiaoxing Zhang. 2021. Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.Google ScholarGoogle ScholarCross RefCross Ref
  62. Zhao Yang, Qiang Wang, Luca Bertinetto, Weiming Hu, Song Bai, and Philip HS Torr. 2019. Anchor diffusion for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 931--940.Google ScholarGoogle ScholarCross RefCross Ref
  63. Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, and Bo Liu. 2021. Deep Transport Network for Unsupervised Video Object Segmentation. In ICCV.Google ScholarGoogle Scholar
  65. Lu Zhang, Jianming Zhang, Zhe Lin, Radom´?r M?ech, Huchuan Lu, and You He. 2020. Unsupervised video object segmentation with joint hotspot tracking. In ECCV.Google ScholarGoogle Scholar
  66. Miao Zhang, Jie Liu, Yifei Wang, Yongri Piao, Shunyu Yao, Wei Ji, Jingjing Li, Huchuan Lu, and Zhongxuan Luo. 2021. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553--1563.Google ScholarGoogle ScholarCross RefCross Ref
  67. He Zhao and Richard P Wildes. 2019. Spatiotemporal feature residual propagation for action prediction. In ICCV.Google ScholarGoogle Scholar
  68. Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020. Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision. Springer, 35--51.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In European Conference on Computer Vision. Springer, 445--462.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zeroshot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13066--13073.Google ScholarGoogle Scholar

Index Terms

  1. Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '22: Proceedings of the 30th ACM International Conference on Multimedia
            October 2022
            7537 pages
            ISBN:9781450392037
            DOI:10.1145/3503161

            Copyright © 2022 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 October 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia
          • Article Metrics

            • Downloads (Last 12 months)99
            • Downloads (Last 6 weeks)10

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader