research-article

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

Authors:
Jiaqing Fan

Nanjing University of Aeronautics and Astronautics, Nanjing, China

Nanjing University of Aeronautics and Astronautics, Nanjing, China
View Profile

,
Tiankang Su

Nanjing University of Information Science and Technology, Nanjing, China

Nanjing University of Information Science and Technology, Nanjing, China
View Profile

,
Kaihua Zhang

Nanjing University of Information Science and Technology, Nanjing, China

Nanjing University of Information Science and Technology, Nanjing, China
View Profile

,
Qingshan Liu

Nanjing University of Information Science and Technology, Nanjing, China

Nanjing University of Information Science and Technology, Nanjing, China
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 3646–3655https://doi.org/10.1145/3503161.3548039

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3646–3655

ABSTRACT

Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.

Supplemental Material

Available for Download

mp4

MM22-fp1209.mp4 (41.2 MB)

References

Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. In CVPR.Google Scholar
Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, and Sergio Escalera. 2019. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In ICCVW.Google Scholar
Goutam Bhat, Felix J¨aremo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In ECCV.Google Scholar
Chenglizhao Chen, Guotao Wang, Chong Peng, Xiaowei Zhang, and Hong Qin. 2019. Improved robust video saliency detection based on long-term spatial-temporal information. TIP (2019).Google Scholar
Yuhuan Chen, Wenbin Zou, Yi Tang, Xia Li, Chen Xu, and Nikos Komodakis. 2018. SCOM: Spatiotemporal constrained optimization for salient object detection. IEEE Transactions on Image Processing 27, 7 (2018), 3345--3357.Google ScholarCross Ref
Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, and Ming-Hsuan Yang. 2022. Video Salient Object Detection via Contrastive Features and Attention Modules. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1320--1329.Google ScholarCross Ref
Runmin Cong, Jianjun Lei, Huazhu Fu, Fatih Porikli, Qingming Huang, and Chunping Hou. 2019. Video saliency detection via sparsity-based reconstruction and propagation. TIP (2019).Google Scholar
Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2020. EpO-net: Exploiting geometric constraints on dense trajectories for motion saliency. In WACV.Google Scholar
Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV.Google Scholar
Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. 2019. Shifting more attention to video salient object detection. In CVPR.Google Scholar
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. Graph convolutional tracking. In CVPR.Google Scholar
Yuchao Gu, Lijuan Wang, Ziqin Wang, Yun Liu, Ming-Ming Cheng, and Shao-Ping Lu. 2020. Pyramid constrained selfattention network for fast video salient object detection. In AAAI.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarCross Ref
Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713--13722.Google ScholarCross Ref
Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR.Google Scholar
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Unsupervised video object segmentation using motion saliencyguided spatio-temporal propagation. In Proceedings of the European conference on computer vision (ECCV). 786--802.Google Scholar
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.Google ScholarCross Ref
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google Scholar
Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2117--2126.Google ScholarCross Ref
Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In ICCV.Google Scholar
Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. TIP (2015).Google Scholar
Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. 2019. Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 7274--7283.Google ScholarCross Ref
Jiangtong Li, Wentao Wang, Junjie Chen, Li Niu, Jianlou Si, Chen Qian, and Liqing Zhang. 2021. Video Semantic Segmentation via Sparse Temporal Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 59--68.Google ScholarDigital Library
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6526--6535.Google ScholarCross Ref
Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV). 207--223.Google ScholarDigital Library
Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, and Hong Qin. 2019. Accurate and robust video saliency detection via self-paced diffusion. TMM (2019).Google Scholar
Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. 2021. F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation. In AAAI.Google Scholar
Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In ECCV.Google Scholar
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.Google Scholar
Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2020. Zero-shot video object segmentation with coattention siamese networks. PAMI (2020).Google Scholar
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020. Learning video object segmentation from unlabeled videos. In CVPR.Google Scholar
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV). 116--131.Google ScholarDigital Library
Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: lightweight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021).Google Scholar
Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. PAMI (2013).Google Scholar
Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by referenceguided mask propagation. In CVPR.Google Scholar
Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9413--9422.Google ScholarCross Ref
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.Google Scholar
Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455--15464.Google ScholarCross Ref
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In ECCV.Google Scholar
Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA.Google Scholar
Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV.Google Scholar
Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017. Learning video object segmentation with visual memory. In ICCV.Google Scholar
Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. SwiftNet: Real-time Video Object Segmentation. In CVPR.Google Scholar
Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. 2021. Salient object detection in the deep learning era: An in-depth survey. PAMI (2021).Google Scholar
Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In ICCV.Google Scholar
Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven CH Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. PAMI (2020).Google Scholar
Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017. Super-trajectory for video segmentation. In ICCV.Google Scholar
Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019. Learning unsupervised video object segmentation through visual attention. In CVPR.Google Scholar
Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12321--12328.Google ScholarCross Ref
Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. 2020. Polarmask: Single shot instance segmentation with polar representation. In CVPR.Google Scholar
SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.Google Scholar
Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. 2020. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12484--12491.Google ScholarCross Ref
Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, and Yu Hen Hu. 2019. Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. TMM (2019).Google Scholar
Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, Yu Hen Hu, and Shou Feng. 2019. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. TCSVT (2019).Google Scholar
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 585--601.Google ScholarDigital Library
Yi Xu, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2019. Non-local convlstm for video compression artifact reduction. In ICCV.Google Scholar
Pengxiang Yan, Guanbin Li, Yuan Xie, Zhen Li, Chuan Wang, Tianshui Chen, and Liang Lin. 2019. Semi-supervised video salient object detection using pseudo-labels. In ICCV.Google Scholar
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.Google Scholar
Ren Yang. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Methods and results. In CVPR.Google Scholar
Shu Yang, Lu Zhang, Jinqing Qi, Huchuan Lu, Shuo Wang, and Xiaoxing Zhang. 2021. Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.Google ScholarCross Ref
Zhao Yang, Qiang Wang, Luca Bertinetto, Weiming Hu, Song Bai, and Philip HS Torr. 2019. Anchor diffusion for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 931--940.Google ScholarCross Ref
Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarDigital Library
Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, and Bo Liu. 2021. Deep Transport Network for Unsupervised Video Object Segmentation. In ICCV.Google Scholar
Lu Zhang, Jianming Zhang, Zhe Lin, Radom´?r M?ech, Huchuan Lu, and You He. 2020. Unsupervised video object segmentation with joint hotspot tracking. In ECCV.Google Scholar
Miao Zhang, Jie Liu, Yifei Wang, Yongri Piao, Shunyu Yao, Wei Ji, Jingjing Li, Huchuan Lu, and Zhongxuan Luo. 2021. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553--1563.Google ScholarCross Ref
He Zhao and Richard P Wildes. 2019. Spatiotemporal feature residual propagation for action prediction. In ICCV.Google Scholar
Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarDigital Library
Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020. Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision. Springer, 35--51.Google ScholarDigital Library
Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In European Conference on Computer Vision. Springer, 445--462.Google ScholarDigital Library
Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zeroshot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13066--13073.Google Scholar

Index Terms

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences

Extraction of moving objects is an important and fundamental research topic for many digital video applications. This paper addresses an efficient spatio-temporal segmentation scheme to extract moving objects from video sequences. The temporal ...
Read More
A Spatio-temporal Feature Based on Triangulation of Dense SURF
ICCVW '13: Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops

In this paper, we propose a spatio-temporal feature which is based on the appearance and movement of interest SURF key points. Given a video, we extract its spatiotemporal features according to every small set of frames. For each frame set, we first ...
Read More
Unsupervised Video Object Segmentation Using Motion Saliency-Guided Spatio-Temporal Propagation
Computer Vision – ECCV 2018
Abstract
Unsupervised video segmentation plays an important role in a wide variety of applications from object identification to compression. However, to date, fast motion, motion blur and occlusions pose significant challenges. To address these challenges ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bidirectional feature propagation
deep learning
direction-aware graph attention
feature refinement
unsupervised video object segmentation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 237
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences

A Spatio-temporal Feature Based on Triangulation of Dense SURF

Unsupervised Video Object Segmentation Using Motion Saliency-Guided Spatio-Temporal Propagation