Skip to main content
Log in

Video parsing via spatiotemporally analysis with images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Effective parsing of video through the spatial and temporal domains is vital to many computer vision problems because it is helpful to automatically label objects in video instead of manual fashion, which is tedious. Some literatures propose to parse the semantic information on individual 2D images or individual video frames, however, these approaches only take use of the spatial information, ignore the temporal continuity information and fail to consider the relevance of frames. On the other hand, some approaches which only consider the spatial information attempt to propagate labels in the temporal domain for parsing the semantic information of the whole video, yet the non-injective and non-surjective natures can cause the black hole effect. In this paper, inspirited by some annotated image datasets (e.g., Stanford Background Dataset, LabelMe, and SIFT-FLOW), we propose to transfer or propagate such labels from images to videos. The proposed approach consists of three main stages: I) the posterior category probability density function (PDF) is learned by an algorithm which combines frame relevance and label propagation from images. II) the prior contextual constraint PDF on the map of pixel categories through whole video is learned by the Markov Random Fields (MRF). III) finally, based on both learned PDFs, the final parsing results are yielded up to the maximum a posterior (MAP) process which is computed via a very efficient graph-cut based integer optimization algorithm. The experiments show that the black hole effect can be effectively handled by the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. 1 A partition of set \(\mathbf {r}^{t}=\{{r_{1}^{t}}, {r_{2}^{t}}, \cdots , r_{n_{t}}^{t}\}\) is equal to a collection of sets \(R_{i}\subset \mathbf {r}^{t}\), where i=1,2,⋯ ,k, ij and \(\cup _{i=1}^{\infty } R_{i}=\mathbf {r}^{t}\).

  2. 2 In this paper, a clique is a set of superpixels that are adjacent neighbors of one another or either a single superpixel.

  3. 3 The impulse function is, δ(c)=0, for c≠0, and δ(c)=1, when c=0.

References

  1. Bai X, Sapiro G (2009) Geodesic matting: a framework for fast interactive image and video segmentation and matting. Int J Comput Vis 82:113–132

    Article  Google Scholar 

  2. Baker S, Roth S, Scharstein D, Black M, Lewis J, Szeliski R (2007) A database and evaluation methodology for optical flow. In: Proceedings of international conference on computer vision

  3. Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. in J R Stat Soc B 36:192–236

    MathSciNet  MATH  Google Scholar 

  4. Boykov Y, Veksler O, Zabih R (2001) Efficient approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23:1222–1239

    Article  Google Scholar 

  5. Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell 26:1124–1137

    Article  MATH  Google Scholar 

  6. Chen X, Jin X, Wang K (2014) Lighting virtual objects in a single image via coarse scene understanding. Sci China Inf Sci 57(9):092105(14)

    Article  Google Scholar 

  7. Chuang Y, Agarwala A, Curless B, Salesin D, Szeliski R (2002) Video matting of complex scenes. In: Proceedings of ACM SIGGRAPH

  8. Criminisi A, Cross G, Blake A, Kolmogorov V (2006) Bilayer segmentation of live video. In: Proceedings of internaltional conference on computer vision and pattern recogintion

  9. Ess A, Mueller T, Grabner H, van Gool L (2009) Segmentation-based urban traffic scene understanding. In: Proceedings of British machine vision conference

  10. Fauqueur J, Brostow G, Cipolla R (2007) Assisted video object labeling by joint tracking of regions and keypoints. In: Proceedings of internaltional conference on computer vision

  11. Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741

    Article  MATH  Google Scholar 

  12. Gould S, Fulton R, Koller D (2009) Decomposing a scene into geo-metric and semantically consistent regions. In: Proceedings of international conference on computer vision

  13. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of internaltional conference on computer vision and pattern recogintion

  14. Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts. IEEE Trans Pattern Anal Mach Intell 26:147–159

    Article  MATH  Google Scholar 

  15. Kolmogorov (2006) Convergent tree-reweighted message passing for energy minimization. IEEE Trans Pattern Anal Mach Intell 28:1568–1583

    Article  Google Scholar 

  16. Ladicky L, Sturgess P, Russell C, Sengupta S, Bastan-lar Y, Clocksin W, Torr P (2010) Joint optimization for object class segmentation and dense stereo reconstruction. Int J Comput Vis:1–12

  17. Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Proceedings of neural information processing systems

  18. Li X, Mou L, Lu X (2014) Scene parsing from an MAP perspective. IEEE Trans Cybern. doi:10.1109/TCYB.2014.2361489

  19. Liu Y, Liu Y, Chan K (2011) Tensor-based locally maximum margin classifier for image and video classification. Comput Vis Image Understand 115:1762–1771

    Article  Google Scholar 

  20. Liu C, Yuen J, Torralba A (2011) Nonparametric scene parsing via label transfer. IEEE Trans Pattern Anal Mach Intell 33:2368–2382

    Article  Google Scholar 

  21. Lu X, Li X, Mou L (2014) Semi-supervised multi-task learning for scene recognition. IEEE Trans Cybern. doi:10.1109/TCYB.2014.2362959

  22. Malisiewicz T, Gupta A, Efros A A (2011) Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of internaltional conference on computer vision

  23. Mou L, Lu X, Yuan Y (2013) Object or background: whose call is it in complicated scene classification? In: Proceedings of IEEE China summit and international conference on signal and information processing

  24. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175

    Article  MATH  Google Scholar 

  25. Robertson N, Reid (2006) A general method for human activity recognition in video. Comput Vis Image Understand 104:232–248

    Article  Google Scholar 

  26. Russell B, Torralba A, Murphy K, Freeman W (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis 77:157–173

    Article  Google Scholar 

  27. Shao L, Simon J, Li X (2014) Efficient search and localization of human actions in video databases. IEEE Trans Circuits Syst Video Techn 24:504–512

    Article  Google Scholar 

  28. Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: Proceedings of internaltional conference on computer vision and pattern recogintion

  29. Tighe J, Lazebnik S (2013) Finding things: image parsing with regions and per-exemplar detectors. In: Proceedings of internaltional conference on computer vision and pattern recogintion

  30. Tighe J, Lazebnik S (2013) Superparsing - scalable nonparametric image parsing with superpixels. Int J Comput Vis 101:329–349

    Article  MathSciNet  Google Scholar 

  31. Wang J, Cohen M (2005) An iterative optimization approach for unified image segmentation and matting. In: Proceedings of international conference on computer vision

  32. Xiao J, Hays J, Ehinger K, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of international conference on computer vision and pattern recogintion

  33. Yang X, Gao X, Tao D, Li X, Li J (2015) Object or an efficient MRF embedded level set method for image segmentation. IEEE Trans Image Process 24:9–21

    Article  MathSciNet  Google Scholar 

  34. Yedidia J, Freeman W, Weiss Y (2000) Generalized belief propagation. In: Proceedings of neural information processing systems

  35. Yedidia J, Freeman W, Weiss Y (2003) Understanding belief propagation and its generalizations. Explor Artif Intell New Millennium 8:236–239

    Google Scholar 

  36. Yedidia J S, Freeman W T, Weiss Y (2005) Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans Inf Theory 51:2282–2312

    Article  MathSciNet  MATH  Google Scholar 

  37. Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2014.2359471

  38. Zhang C, Wang L, Yang R (2010) Semantic segmentation of urban scenes using dense depth maps. In: Proceedings of European conference on computer vision

Download references

Acknowledgments

This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB719905, in part by the National Natural Science Foundation of China under Grant 61472413, in part by Chinese Academy of Sciences under Grant LSIT201408 and in part by the Key Research Program of the Chinese Academy of Sciences under Grant KGZD-EW-T03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoqiang Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Mou, L. & Lu, X. Video parsing via spatiotemporally analysis with images. Multimed Tools Appl 75, 11961–11976 (2016). https://doi.org/10.1007/s11042-015-2735-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2735-x

Keywords

Navigation