skip to main content
research-article

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Published: 03 March 2022 Publication History

Abstract

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

References

[1]
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481–2495.
[2]
Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In European Conference on Computer Vision (ECCV). 25–36.
[3]
Thomas Brox and Jitendra Malik. 2011. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 3 (2011), 500–513.
[4]
XiaoQing Bu, YuKuan Sun, JianMing Wang, KunLiang Liu, JiaYu Liang, GuangHao Jin, and Tae-Sun Chung. 2021. Weakly supervised video object segmentation initialized with referring expression. Neurocomputing 453 (2021), 754–765.
[5]
Bowen Chen, Huan Ling, Xiaohui Zeng, Jun Gao, Ziyue Xu, and Sanja Fidler. 2020. ScribbleBox: Interactive annotation framework for video object segmentation. In European Conference on Computer Vision (ECCV). 293–310.
[6]
Jingchun Cheng, Yi Hsuan Tsai, Shengjin Wang, and Ming Hsuan Yang. 2017. SegFlow: Joint learning for video object segmentation and optical flow. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 686–695.
[7]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338.
[8]
Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay Huang. 2016. STFCN: Spatio-Temporal FCN for semantic video segmentation. arXiv e-prints (2016), arXiv:1608.05971.
[9]
D. Freedman and Tao Zhang. 2005. Interactive graph cut based segmentation with shape priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 755–762.
[10]
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3141–3149.
[11]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Conference on Advances in Neural Information Processing Systems. 2672–2680.
[12]
Junwei Han, Rong Quan, Dingwen Zhang, and Feiping Nie. 2018. Robust object co-segmentation using background prior. IEEE Trans. Image Process. 27, 4 (2018), 1639–1651.
[13]
Michael Hardegger, Daniel Roggen, Alberto Calatroni, and Gerhard Tröster. 2016. S-SMART: A unified Bayesian framework for simultaneous semantic mapping, activity recognition, and tracking. ACM Trans. Intell. Syst. Technol. 7, 3 (2016), 1–28.
[14]
Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee, and Bohyung Han. 2017. Weakly supervised semantic segmentation using web-crawled videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2224–2232.
[15]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7132–7141.
[16]
Suyog Dutt Jain and Kristen Grauman. 2014. Supervoxel-consistent foreground propagation in video. In European Conference on Computer Vision (ECCV). 6560–671.
[17]
Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng Yan. 2017. Video scene parsing with predictive feature learning. In IEEE International Conference on Computer Vision (ICCV). 5581–5589.
[18]
Yeong Jun Koh, Young Yoon Lee, and Chang Su Kim. 2018. Sequential clique optimization for video object segmentation. In European Conference on Computer Vision (ECCV). 537–556.
[19]
Philipp Krähenbühl and Vladlen Koltun. 2012. Efficient inference in fully connected CRFs with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 24 (2012), 109–117.
[20]
Yong Jae Lee, Jaechul Kim, and Kristen Grauman. 2011. Key-segments for video object segmentation. In IEEE International Conference on Computer Vision (ICCV). 1995–2002.
[21]
Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. 2017. Dual motion GAN for future-flow embedded video prediction. In IEEE International Conference on Computer Vision (ICCV). 1762–1770.
[22]
Tsung-Yi Lin, Piotr Dollär, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 936–944.
[23]
Mao Ling and Xie Mei. 2007. Automatic segmentation of moving objects in video sequences based on spatio-temporal information. In International Conference on Communications, Circuits and Systems (ICCCAS). 750–754.
[24]
Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2020. Zero-shot video object segmentation with co-attention siamese networks. IEEE Trans. Pattern Anal. Mach. Intell. PP, 99 (2020), 1–1.
[25]
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, and Steven C. H. Hoi. 2020. Learning video object segmentation from unlabeled videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 8957–8967.
[26]
Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training. arXiv:1611.08408.
[27]
Ming-Ming Cheng, Guo-Xin Zhang, Niloy J. Mitra, Xiaolei Huang, and Shi-Min Hu. 2011. Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 409–416.
[28]
David Nilsson and Cristian Sminchisescu. 2018. Semantic video segmentation by gated recurrent flow propagation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6819–6828.
[29]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Fast user-guided video object segmentation by interaction-and-propagation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5242–5251.
[30]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In IEEE International Conference on Computer Vision (ICCV). 9225–9234.
[31]
Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In IEEE International Conference on Computer Vision (ICCV). 1777–1784.
[32]
Foteini Patrona, Alexandros Iosifidis, Anastasios Tefas, Nikolaos Nikolaidis, and Ioannis Pitas. 2016. Visual voice activity detection in the wild. IEEE Trans. Multimedia 18, 6 (2016), 967–977.
[33]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 724–732.
[34]
Federico Perazzi, Oliver Wang, Markus Gross, and Alexander Sorkine-Hornung. 2015. Fully connected object proposals for video segmentation. In IEEE International Conference on Computer Vision (ICCV). 3227–3234.
[35]
Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. 2012. Learning object class detectors from weakly annotated video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3282–3289.
[36]
Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. GrabCut : Interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23 (08 2004), 309–314.
[37]
Idoia Ruiz, Lorenzo Porzi, Samuel Rota Buló, Peter Kontschieder, and Joan Serrat. 2021. Weakly supervised multi-object tracking and segmentation. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[38]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (2015), 211–252.
[39]
Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian, Mathieu Salzmann, Lars Petersson, and Jose M. Alvarez. 2017. Bringing background into the foreground: Making all classes equal in weakly-supervised video semantic segmentation. In IEEE International Conference on Computer Vision (ICCV). 2125–2135.
[40]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision (ICCV). 618–626.
[41]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In European Conference on Computer Vision (ECCV). 629–645.
[42]
Xuan Song, Xiaowei Shao, Quanshi Zhang, Ryosuke Shibasaki, Huijing Zhao, Jinshi Cui, and Hongbin Zha. 2013. A fully online and unsupervised system for large and high-density area surveillance: Tracking, semantic scene learning and abnormality detection. ACM Trans. Intell. Syst. Technol. 4, 2 (2013), 1–21.
[43]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2818–2826.
[44]
Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017. Learning motion patterns in videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 531–539.
[45]
Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J. Black. 2016. Video segmentation via object flow. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3899–3908.
[46]
Yi Hsuan Tsai, Guangyu Zhong, and Ming Hsuan Yang. 2016. Semantic co-segmentation in videos. In European Conference on Computer Vision (ECCV). 760–775.
[47]
Zhengzheng Tu, Andrew Abel, Lei Zhang, Bin Luo, and Amir Hussain. 2016. A new spatio-temporal saliency-based video object segmentation. Cog. Comput. 8, 4 (2016), 629–647.
[48]
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9473–9482.
[49]
Huiling Wang, Tapani Raiko, Lasse Lensu, Tinghuai Wang, and Juha Karhunen. 2016. Semi-supervised domain adaptation for weakly labeled semantic video object segmentation. In Asia Conference on Computer Vision (ACCV). 163–179.
[50]
Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017. Super-trajectory for video segmentation. In IEEE International Conference on Computer Vision (ICCV). 1680–1688.
[51]
Wenguan Wang, Jianbing Shen, Ruigang Yang, and Fatih Porikli. 2018. Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1 (2018), 20–33.
[52]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7794–7803.
[53]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision (ECCV). 3–19.
[54]
Kai Xu, Longyin Wen, Guorong Li, Liefeng Bo, and Qingming Huang. 2019. Spatiotemporal CNN for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1379–1388.
[55]
Le Yang, Junwei Han, Dingwen Zhang, Nian Liu, and Dong Zhang. 2018. Segmentation in weakly labeled videos via a semantic ranking and optical warping network. IEEE Trans. Image Process. 27, 8 (2018), 4025–4037.
[56]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision (ECCV). 332–348.
[57]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1.
[58]
Dingwen Zhang, Junwei Han, Le Yang, and Dong Xu. 2020. SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2020), 475–489.
[59]
Dingwen Zhang, Le Yang, Deyu Meng, Dong Xu, and Junwei Han. 2017. SPFTN: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5340–5348.
[60]
Guanhong Zhang, Muyi Sun, and Xiaoguang Zhou. 2019. Sub-pixel Upsampling Decode Network for Semantic Segmentation. Artificial Intelligence.
[61]
Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, and Radomír Mech. 2016. Unconstrained salient object detection via proposal subset optimization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5733–5742.
[62]
Yu Zhang, Xiaowu Chen, Jia Li, Wei Teng, and Haokun Song. 2018. Exploring weakly labeled images for video object segmentation with submodular proposal selection. IEEE Trans. Image Process. 27, 9 (2018), 4245–4259.
[63]
Yu Zhang, Xiaowu Chen, Jia Li, Chen Wang, and Changqun Xia. 2015. Semantic object segmentation via detection in weakly labeled video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3641–3649.
[64]
Zongpu Zhang, Yang Hua, Tao Song, Zhengui Xue, and Haibing Guan. 2018. Tracking-assisted weakly supervised online visual object segmentation in unconstrained videos. In ACM International Conference on Multimedia (ACM MM). 941–949.
[65]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6230–6239.
[66]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2921–2929.
[67]
Yu Zhu, Wenbin Chen, and Guodong Guo. 2015. Fusing multiple features for depth-based action recognition. ACM Trans. Intell. Syst. Technol. 6, 2 (2015), 1–20.

Cited By

View all
  • (2024)An Unbiased Risk Estimator for Partial Label Learning with Augmented ClassesACM Transactions on Intelligent Systems and Technology10.1145/370013715:6(1-22)Online publication date: 14-Oct-2024
  • (2024)UVIS: Unsupervised Video Instance Segmentation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00274(2682-2692)Online publication date: 17-Jun-2024
  • (2024)Soft Hybrid Knowledge Distillation against deep neural networksNeurocomputing10.1016/j.neucom.2023.127142570:COnline publication date: 12-Apr-2024
  • Show More Cited By

Index Terms

  1. Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 13, Issue 3
      June 2022
      415 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3508465
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 March 2022
      Accepted: 01 December 2021
      Revised: 01 November 2021
      Received: 01 June 2021
      Published in TIST Volume 13, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. video object segmentation
      2. weakly supervised
      3. temporal information
      4. attention

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • Beijing Natural Science Foundation
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)66
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)An Unbiased Risk Estimator for Partial Label Learning with Augmented ClassesACM Transactions on Intelligent Systems and Technology10.1145/370013715:6(1-22)Online publication date: 14-Oct-2024
      • (2024)UVIS: Unsupervised Video Instance Segmentation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00274(2682-2692)Online publication date: 17-Jun-2024
      • (2024)Soft Hybrid Knowledge Distillation against deep neural networksNeurocomputing10.1016/j.neucom.2023.127142570:COnline publication date: 12-Apr-2024
      • (2023)Fast Real-Time Video Object Segmentation with a Tangled Memory NetworkACM Transactions on Intelligent Systems and Technology10.1145/358507614:3(1-21)Online publication date: 13-Apr-2023
      • (2023)A Weakly Supervised Learning Framework for Salient Object Detection via Hybrid LabelsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.320518233:2(534-548)Online publication date: Feb-2023
      • (2023)Weakly Annotated Residential Area Segmentation Based on Attention Redistribution and Co-LearningIEEE Geoscience and Remote Sensing Letters10.1109/LGRS.2023.331835120(1-5)Online publication date: 2023
      • (2023)A systematic review of deep learning frameworks for moving object segmentationMultimedia Tools and Applications10.1007/s11042-023-16417-383:8(24715-24748)Online publication date: 9-Aug-2023
      • (2022)Vision beyond the Field-of-View: A Collaborative Perception System to Improve Safety of Intelligent Cyber-Physical SystemsSensors10.3390/s2217661022:17(6610)Online publication date: 1-Sep-2022
      • (2022)Sequential Clique Optimization for Unsupervised and Weakly Supervised Video Object SegmentationElectronics10.3390/electronics1118289911:18(2899)Online publication date: 13-Sep-2022

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media