Abstract
Effective parsing of video through the spatial and temporal domains is vital to many computer vision problems because it is helpful to automatically label objects in video instead of manual fashion, which is tedious. Some literatures propose to parse the semantic information on individual 2D images or individual video frames, however, these approaches only take use of the spatial information, ignore the temporal continuity information and fail to consider the relevance of frames. On the other hand, some approaches which only consider the spatial information attempt to propagate labels in the temporal domain for parsing the semantic information of the whole video, yet the non-injective and non-surjective natures can cause the black hole effect. In this paper, inspirited by some annotated image datasets (e.g., Stanford Background Dataset, LabelMe, and SIFT-FLOW), we propose to transfer or propagate such labels from images to videos. The proposed approach consists of three main stages: I) the posterior category probability density function (PDF) is learned by an algorithm which combines frame relevance and label propagation from images. II) the prior contextual constraint PDF on the map of pixel categories through whole video is learned by the Markov Random Fields (MRF). III) finally, based on both learned PDFs, the final parsing results are yielded up to the maximum a posterior (MAP) process which is computed via a very efficient graph-cut based integer optimization algorithm. The experiments show that the black hole effect can be effectively handled by the proposed approach.
Similar content being viewed by others
Notes
1 A partition of set \(\mathbf {r}^{t}=\{{r_{1}^{t}}, {r_{2}^{t}}, \cdots , r_{n_{t}}^{t}\}\) is equal to a collection of sets \(R_{i}\subset \mathbf {r}^{t}\), where i=1,2,⋯ ,k, i≠j and \(\cup _{i=1}^{\infty } R_{i}=\mathbf {r}^{t}\).
2 In this paper, a clique is a set of superpixels that are adjacent neighbors of one another or either a single superpixel.
3 The impulse function is, δ(c)=0, for c≠0, and δ(c)=1, when c=0.
References
Bai X, Sapiro G (2009) Geodesic matting: a framework for fast interactive image and video segmentation and matting. Int J Comput Vis 82:113–132
Baker S, Roth S, Scharstein D, Black M, Lewis J, Szeliski R (2007) A database and evaluation methodology for optical flow. In: Proceedings of international conference on computer vision
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. in J R Stat Soc B 36:192–236
Boykov Y, Veksler O, Zabih R (2001) Efficient approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23:1222–1239
Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell 26:1124–1137
Chen X, Jin X, Wang K (2014) Lighting virtual objects in a single image via coarse scene understanding. Sci China Inf Sci 57(9):092105(14)
Chuang Y, Agarwala A, Curless B, Salesin D, Szeliski R (2002) Video matting of complex scenes. In: Proceedings of ACM SIGGRAPH
Criminisi A, Cross G, Blake A, Kolmogorov V (2006) Bilayer segmentation of live video. In: Proceedings of internaltional conference on computer vision and pattern recogintion
Ess A, Mueller T, Grabner H, van Gool L (2009) Segmentation-based urban traffic scene understanding. In: Proceedings of British machine vision conference
Fauqueur J, Brostow G, Cipolla R (2007) Assisted video object labeling by joint tracking of regions and keypoints. In: Proceedings of internaltional conference on computer vision
Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Gould S, Fulton R, Koller D (2009) Decomposing a scene into geo-metric and semantically consistent regions. In: Proceedings of international conference on computer vision
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of internaltional conference on computer vision and pattern recogintion
Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts. IEEE Trans Pattern Anal Mach Intell 26:147–159
Kolmogorov (2006) Convergent tree-reweighted message passing for energy minimization. IEEE Trans Pattern Anal Mach Intell 28:1568–1583
Ladicky L, Sturgess P, Russell C, Sengupta S, Bastan-lar Y, Clocksin W, Torr P (2010) Joint optimization for object class segmentation and dense stereo reconstruction. Int J Comput Vis:1–12
Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Proceedings of neural information processing systems
Li X, Mou L, Lu X (2014) Scene parsing from an MAP perspective. IEEE Trans Cybern. doi:10.1109/TCYB.2014.2361489
Liu Y, Liu Y, Chan K (2011) Tensor-based locally maximum margin classifier for image and video classification. Comput Vis Image Understand 115:1762–1771
Liu C, Yuen J, Torralba A (2011) Nonparametric scene parsing via label transfer. IEEE Trans Pattern Anal Mach Intell 33:2368–2382
Lu X, Li X, Mou L (2014) Semi-supervised multi-task learning for scene recognition. IEEE Trans Cybern. doi:10.1109/TCYB.2014.2362959
Malisiewicz T, Gupta A, Efros A A (2011) Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of internaltional conference on computer vision
Mou L, Lu X, Yuan Y (2013) Object or background: whose call is it in complicated scene classification? In: Proceedings of IEEE China summit and international conference on signal and information processing
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175
Robertson N, Reid (2006) A general method for human activity recognition in video. Comput Vis Image Understand 104:232–248
Russell B, Torralba A, Murphy K, Freeman W (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis 77:157–173
Shao L, Simon J, Li X (2014) Efficient search and localization of human actions in video databases. IEEE Trans Circuits Syst Video Techn 24:504–512
Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: Proceedings of internaltional conference on computer vision and pattern recogintion
Tighe J, Lazebnik S (2013) Finding things: image parsing with regions and per-exemplar detectors. In: Proceedings of internaltional conference on computer vision and pattern recogintion
Tighe J, Lazebnik S (2013) Superparsing - scalable nonparametric image parsing with superpixels. Int J Comput Vis 101:329–349
Wang J, Cohen M (2005) An iterative optimization approach for unified image segmentation and matting. In: Proceedings of international conference on computer vision
Xiao J, Hays J, Ehinger K, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of international conference on computer vision and pattern recogintion
Yang X, Gao X, Tao D, Li X, Li J (2015) Object or an efficient MRF embedded level set method for image segmentation. IEEE Trans Image Process 24:9–21
Yedidia J, Freeman W, Weiss Y (2000) Generalized belief propagation. In: Proceedings of neural information processing systems
Yedidia J, Freeman W, Weiss Y (2003) Understanding belief propagation and its generalizations. Explor Artif Intell New Millennium 8:236–239
Yedidia J S, Freeman W T, Weiss Y (2005) Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans Inf Theory 51:2282–2312
Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2014.2359471
Zhang C, Wang L, Yang R (2010) Semantic segmentation of urban scenes using dense depth maps. In: Proceedings of European conference on computer vision
Acknowledgments
This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB719905, in part by the National Natural Science Foundation of China under Grant 61472413, in part by Chinese Academy of Sciences under Grant LSIT201408 and in part by the Key Research Program of the Chinese Academy of Sciences under Grant KGZD-EW-T03.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Mou, L. & Lu, X. Video parsing via spatiotemporally analysis with images. Multimed Tools Appl 75, 11961–11976 (2016). https://doi.org/10.1007/s11042-015-2735-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2735-x