Abstract
We propose a novel spatiotemporal graphical model for unsupervised video object segmentation. The core of our model is a layered-CRF (conditional random field) that contains two layers, i.e., pixel layer and supervoxel layer. First, the heat diffusion based segmentation and salient region detection is integrated to obtain the segmentation results of the first frame. The results are used as input seeds to train dual probabilistic models of each object class. In the spatiotemporal layered-CRF framework we extend binary segmentation to multiple object segmentation. We add intra-frame spatial matching potential and inter-frame temporal supervoxels consistent potential to link the pixel layer and the supervoxel layer. This improves the spatiotemporal smoothing throughout the video sequence in the proposed model. The proposed unsupervised method lightens the burden of labeling training samples and obtains a smooth and accurate object boundary in video segmentation. The experiments on two public datasets demonstrate that our method outperforms several state-of-the-art methods in both single and multiple foreground cases.








Similar content being viewed by others
References
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34:2274–2282
Akamine K, Fukuchi K, Kimura A, Takagi S (2012) Fully automatic extraction of salient objects from videos in near real time. Comput J 55:3–14
Badrinarayanan V, Budvytis I, Cipolla R (2013) Semi-supervised video segmentation using tree structured graphical models. IEEE Transactions on Pattern Analysis and Machine Intelligence 35:2751–2764
Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:1222–1239
Cheng M.-M, Warrell J, Lin W.-Y, Zheng S, Vineet V, Crook N (2013) Efficient salient region detection with soft image abstraction, 2013 I.E. International Conference on Computer Vision (ICCV) IEEE, pp. 1529–1536
Chiu W.-C, Fritz M (2013) Multi-class video co-segmentation with a generative multi-video model, 2013 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 321–328
Dong Z, Javed O, Shah M (2013) Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions, 2013 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 628–635
Endres I, Hoiem D (2010) Category independent object proposals, computer vision–ECCV 2010, Springer, pp 575-588
Gopalakrishnan V, Hu Y, Rajan D (2009) Salient region detection by modeling distributions of color and orientation. IEEE Transactions on Multimedia 11:892–905
Hsien-Ting C, Ahuja N (2012) Exploiting nonlocal spatiotemporal structure for video segmentation, 2012 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 741–748
Huazhu F, Xiaochun C, Zhuowen T (2013) Cluster-based Co-saliency detection. IEEE Trans Image Process 22:3766–3778
Huazhu F, Dong X, Bao Z, Lin S (2014) Object-Based Multiple Foreground Video Co-segmentation, 2014 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3166–3173
Joulin A, Bach F, Ponce J (2012) Multi-class cosegmentation, 2012 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 542–549
Kae A, Marlin B, Learned-Miller E (2014) The Shape-Time Random Field for Semantic Video Labeling, 2014 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 272–279
Kim G, Xing EP, Fei-Fei L, Kanade T (2011) Distributed cosegmentation via submodular optimization on anisotropic diffusion, 2011 I.E. International Conference on Computer Vision (ICCV), pp. 169–176
Kohli P, Kumar MP, Torr PH (2007) P3 and beyond: Solving energies with higher order cliques, 2007 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8
Kohli P, Ladicky L, Torr P.H.S (2008) Robust higher order potentials for enforcing label consistency, 2008 I.E. Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1–8
Lee YJ, Kim J, Grauman K (2011) key-segments for video object segmentation, 2011 I.E. international conference on computer vision (ICCV) IEEE, pp. 1995-2002
Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using three-dimensional textons. Int J Comput Vis 43:29–44
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110
Paris S, Durand F (2007) A topological approach to hierarchical segmentation using mean shift, 2007 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8
Raza S.H, Grundmann M, Essa I (2013) Geometric context from videos, 2013 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3081–3088
Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation, computer vision–ECCV 2006, Springer, pp 1-15
Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation, 2008 I.E. Conference on Computer vision and pattern recognition (CVPR), pp. 1–8
Tianyang M, Latecki LJ (2012) Maximum weight cliques with mutex constraints for video object segmentation, 2012 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 670–677S
Torralba A, Murphy K, Freeman W (2014) Sharing features: efficient boosting procedures for multiclass object detection. 2004 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 762–769
Tsai D, Flagg M, Nakazawa A, Rehg J (2012) Motion coherent tracking using multi-label MRF optimization. Int J Comput Vis 100:190–202
Xu C, Xiong C, Corso JJ (2012) Streaming hierarchical video segmentation, computer vision–ECCV 2012, Springer, pp. 626-639
Zhang D, Javed O, Shah M (2014) Video object Co-segmentation by regulated maximum weight cliques, computer vision–ECCV 2014, Springer, pp. 551-566
Acknowledgments
This work is supported by National Natural Science Foundation of China (NSFC:61175026), Inte-rnational Science and Technology Cooperation Special Programme (No. 2013DFG12810), Ningbo Municipal Natural Science Foundation of China (2014A610031, 2014A610032), Open Research Fund of Zhejiang First-foremost Key Subject-Information and Communications Engineering of China(XKXL1316),C.Wong Magna Fund in Ningbo University,Open Fund of Zhejiang Provincial Key Academic Project(first level).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guo, L., Cheng, T., Huang, Y. et al. Unsupervised video object segmentation by spatiotemporal graphical model. Multimed Tools Appl 76, 1037–1053 (2017). https://doi.org/10.1007/s11042-015-3100-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3100-9