ABSTRACT
Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.
Supplemental Material
Available for Download
- Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. In CVPR.Google Scholar
- Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, and Sergio Escalera. 2019. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In ICCVW.Google Scholar
- Goutam Bhat, Felix J¨aremo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In ECCV.Google Scholar
- Chenglizhao Chen, Guotao Wang, Chong Peng, Xiaowei Zhang, and Hong Qin. 2019. Improved robust video saliency detection based on long-term spatial-temporal information. TIP (2019).Google Scholar
- Yuhuan Chen, Wenbin Zou, Yi Tang, Xia Li, Chen Xu, and Nikos Komodakis. 2018. SCOM: Spatiotemporal constrained optimization for salient object detection. IEEE Transactions on Image Processing 27, 7 (2018), 3345--3357.Google ScholarCross Ref
- Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, and Ming-Hsuan Yang. 2022. Video Salient Object Detection via Contrastive Features and Attention Modules. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1320--1329.Google ScholarCross Ref
- Runmin Cong, Jianjun Lei, Huazhu Fu, Fatih Porikli, Qingming Huang, and Chunping Hou. 2019. Video saliency detection via sparsity-based reconstruction and propagation. TIP (2019).Google Scholar
- Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2020. EpO-net: Exploiting geometric constraints on dense trajectories for motion saliency. In WACV.Google Scholar
- Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV.Google Scholar
- Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. 2019. Shifting more attention to video salient object detection. In CVPR.Google Scholar
- Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. Graph convolutional tracking. In CVPR.Google Scholar
- Yuchao Gu, Lijuan Wang, Ziqin Wang, Yun Liu, Ming-Ming Cheng, and Shao-Ping Lu. 2020. Pyramid constrained selfattention network for fast video salient object detection. In AAAI.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarCross Ref
- Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713--13722.Google ScholarCross Ref
- Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR.Google Scholar
- Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Unsupervised video object segmentation using motion saliencyguided spatio-temporal propagation. In Proceedings of the European conference on computer vision (ECCV). 786--802.Google Scholar
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.Google ScholarCross Ref
- Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google Scholar
- Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2117--2126.Google ScholarCross Ref
- Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In ICCV.Google Scholar
- Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. TIP (2015).Google Scholar
- Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. 2019. Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 7274--7283.Google ScholarCross Ref
- Jiangtong Li, Wentao Wang, Junjie Chen, Li Niu, Jianlou Si, Chen Qian, and Liqing Zhang. 2021. Video Semantic Segmentation via Sparse Temporal Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 59--68.Google ScholarDigital Library
- Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6526--6535.Google ScholarCross Ref
- Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV). 207--223.Google ScholarDigital Library
- Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, and Hong Qin. 2019. Accurate and robust video saliency detection via self-paced diffusion. TMM (2019).Google Scholar
- Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. 2021. F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation. In AAAI.Google Scholar
- Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In ECCV.Google Scholar
- Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.Google Scholar
- Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2020. Zero-shot video object segmentation with coattention siamese networks. PAMI (2020).Google Scholar
- Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020. Learning video object segmentation from unlabeled videos. In CVPR.Google Scholar
- Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV). 116--131.Google ScholarDigital Library
- Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: lightweight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021).Google Scholar
- Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. PAMI (2013).Google Scholar
- Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by referenceguided mask propagation. In CVPR.Google Scholar
- Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9413--9422.Google ScholarCross Ref
- Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.Google Scholar
- Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
- Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455--15464.Google ScholarCross Ref
- Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In ECCV.Google Scholar
- Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA.Google Scholar
- Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV.Google Scholar
- Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017. Learning video object segmentation with visual memory. In ICCV.Google Scholar
- Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. SwiftNet: Real-time Video Object Segmentation. In CVPR.Google Scholar
- Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. 2021. Salient object detection in the deep learning era: An in-depth survey. PAMI (2021).Google Scholar
- Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In ICCV.Google Scholar
- Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven CH Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. PAMI (2020).Google Scholar
- Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017. Super-trajectory for video segmentation. In ICCV.Google Scholar
- Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019. Learning unsupervised video object segmentation through visual attention. In CVPR.Google Scholar
- Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12321--12328.Google ScholarCross Ref
- Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. 2020. Polarmask: Single shot instance segmentation with polar representation. In CVPR.Google Scholar
- SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.Google Scholar
- Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. 2020. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12484--12491.Google ScholarCross Ref
- Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, and Yu Hen Hu. 2019. Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. TMM (2019).Google Scholar
- Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, Yu Hen Hu, and Shou Feng. 2019. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. TCSVT (2019).Google Scholar
- Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 585--601.Google ScholarDigital Library
- Yi Xu, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2019. Non-local convlstm for video compression artifact reduction. In ICCV.Google Scholar
- Pengxiang Yan, Guanbin Li, Yuan Xie, Zhen Li, Chuan Wang, Tianshui Chen, and Liang Lin. 2019. Semi-supervised video salient object detection using pseudo-labels. In ICCV.Google Scholar
- Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.Google Scholar
- Ren Yang. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Methods and results. In CVPR.Google Scholar
- Shu Yang, Lu Zhang, Jinqing Qi, Huchuan Lu, Shuo Wang, and Xiaoxing Zhang. 2021. Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.Google ScholarCross Ref
- Zhao Yang, Qiang Wang, Luca Bertinetto, Weiming Hu, Song Bai, and Philip HS Torr. 2019. Anchor diffusion for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 931--940.Google ScholarCross Ref
- Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarDigital Library
- Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, and Bo Liu. 2021. Deep Transport Network for Unsupervised Video Object Segmentation. In ICCV.Google Scholar
- Lu Zhang, Jianming Zhang, Zhe Lin, Radom´?r M?ech, Huchuan Lu, and You He. 2020. Unsupervised video object segmentation with joint hotspot tracking. In ECCV.Google Scholar
- Miao Zhang, Jie Liu, Yifei Wang, Yongri Piao, Shunyu Yao, Wei Ji, Jingjing Li, Huchuan Lu, and Zhongxuan Luo. 2021. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553--1563.Google ScholarCross Ref
- He Zhao and Richard P Wildes. 2019. Spatiotemporal feature residual propagation for action prediction. In ICCV.Google Scholar
- Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarDigital Library
- Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020. Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision. Springer, 35--51.Google ScholarDigital Library
- Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In European Conference on Computer Vision. Springer, 445--462.Google ScholarDigital Library
- Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zeroshot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13066--13073.Google Scholar
Index Terms
- Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation
Recommendations
Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences
Extraction of moving objects is an important and fundamental research topic for many digital video applications. This paper addresses an efficient spatio-temporal segmentation scheme to extract moving objects from video sequences. The temporal ...
A Spatio-temporal Feature Based on Triangulation of Dense SURF
ICCVW '13: Proceedings of the 2013 IEEE International Conference on Computer Vision WorkshopsIn this paper, we propose a spatio-temporal feature which is based on the appearance and movement of interest SURF key points. Given a video, we extract its spatiotemporal features according to every small set of frames. For each frame set, we first ...
Unsupervised Video Object Segmentation Using Motion Saliency-Guided Spatio-Temporal Propagation
Computer Vision – ECCV 2018AbstractUnsupervised video segmentation plays an important role in a wide variety of applications from object identification to compression. However, to date, fast motion, motion blur and occlusions pose significant challenges. To address these challenges ...
Comments