Spatially Coherent Interpretations of Videos Using Pattern Theory

de Souza, Fillipe D. M.; Sarkar, Sudeep; Srivastava, Anuj; Su, Jingyong

doi:10.1007/s11263-016-0913-6

Spatially Coherent Interpretations of Videos Using Pattern Theory

Published: 30 May 2016

Volume 121, pages 5–25, (2017)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Fillipe D. M. de Souza¹,
Sudeep Sarkar¹,
Anuj Srivastava² &
…
Jingyong Su³

1017 Accesses
5 Citations
Explore all metrics

Abstract

Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Explorations on the Depth of Gestalt Hierarchies in Social Imagery

Article 01 July 2021

Unsupervised Semantic Discovery Through Visual Patterns Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Albanese, M., Chellappa, R., Cuntoor, N., Moscato, V., Picariello, A., Subrahmanian, V., et al. (2010). Pads: A probabilistic activity detection framework for video data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2246–2261.
Article Google Scholar
Albanese, M., Chellappa, R., Moscato, V., Picariello, A., Subrahmanian, V., Turaga, P., et al. (2008). A constrained probabilistic petri net framework for human activity detection in video. IEEE Transactions on Multimedia, 10(6), 982–996.
Article Google Scholar
Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C. (2013). Monte carlo tree search for scheduling activity recognition. In IEEE International Conference on Computer Vision (ICCV) (pp. 1353–1360).
Bhattacharya, S., Kalayeh, M.M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Brendel, W., Fern, A., Todorovic, S. (2011). Probabilistic event logic for interval-based event recognition. In: CVPR.
Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2011). Smote: Synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813.
Das, P., Xu, C., Doell, R.F., Corso, J.J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2634–2641).
de Souza, F.D.M., Sarkar, S., Srivastava, A., Su, J. (2014). Pattern theory-based interpretation of activities. In: IEEE International Conference on Pattern Recognition (ICPR).
Dubba, K.S.R. (2012). Learning relational event models from videos. Ph.D. thesis, University of Leeds.
Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).
Ghanem, N., DeMenthon, D., Doermann, D., Davis, L. (2004). Representation and recognition of events in surveillance video using petri nets. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. 2004. CVPRW’04 (pp. 112–112).
Grenander, U. (1993). General pattern theory: A mathematical study of regular structures. Oxford: Clarendon Press.
MATH Google Scholar
Grenander, U., & Miller, M. I. (2007). Pattern theory: From representation to inference (Vol. 1). Oxford: Oxford University Press.
MATH Google Scholar
Hilde, K., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8), 852–872.
Article Google Scholar
Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.
Joo, S.W., Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In: IEEE International Conference on Image Processing (pp. 2897–2900).
Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In: ICCV.
Lan, T., Sigal, L., Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In: CVPR.
Lan, T., Wang, Y., Yang, W., Robinovitch, S., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562.
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8).
Morariu, V.I., Davis, L.S. (2011). Multi-agent event recognition in structured scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3289–3296).
Narayanaswamy, S., Barbu, A., Siskind, J. (2014). Seeing what youŕe told: Sentence-guided activity recognition in video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Pei, M., Jia, Y., Zhu, S.C. (2011). Parsing video events with goal inference and intent prediction. In: IEEE International Conference on Computer Vision (ICCV) (pp. 487–494).
Romdhane, R., Boulay, B., Bremond, F., Thonnat, M. (2011). Probabilistic recognition of complex event. In: Computer Vision Systems (CVS) (pp. 122–131). Springer.
Ryoo, M.S., Aggarwal, J.K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In: IJCAI (pp. 2850–2855).
Sadanand, S., Corso, J.J. (2012). Action bank: A high-level representation of activity in video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.C. (2015). Joint inference of groups, events and human roles in aerial videos. In: CVPR.
Si, Z., Pei, M., Yao, B., Zhu, S.C. (2011). Unsupervised learning of event and-or grammar and semantics from video. In: IEEE International Conference on Computer Vision (ICCV) (pp. 41–48).
Souza, F., Sarkar, S., Srivastava, A., Su, J. (2015). Temporally coherent interpretations for long videos using pattern theory. In: CVPR.
Vahdat, A., Cannons, K., Mori, G., Kim, I., Oh, S. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In: ICCV.
Wang, X., Ji, Q. (2015). Video event recognition with deep hierarchical context model. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wei, P., Zhao, Y., Zheng, N., Zhu, S.C. (2013). Modeling 4d human-object interactions for event and object recognition. In: ICCV.
Xu, Z., Yang, Y., Hauptmann, A.G. (2015). A discriminative cnn video representation for event detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Download references

Acknowledgments

This research was supported in part by NSF Grants 1217515 and 1217676.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of South Florida, Tampa, FL, USA
Fillipe D. M. de Souza & Sudeep Sarkar
Department of Statistics, Florida State University, Tallahassee, FL, USA
Anuj Srivastava
Department of Mathematics & Statistics, Texas Tech University, Lubbock, TX, USA
Jingyong Su

Authors

Fillipe D. M. de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Sudeep Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Anuj Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Jingyong Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fillipe D. M. de Souza.

Additional information

Communicated by M. Hebert.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Souza, F.D.M., Sarkar, S., Srivastava, A. et al. Spatially Coherent Interpretations of Videos Using Pattern Theory. Int J Comput Vis 121, 5–25 (2017). https://doi.org/10.1007/s11263-016-0913-6

Download citation

Received: 24 September 2014
Accepted: 03 May 2016
Published: 30 May 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11263-016-0913-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatially Coherent Interpretations of Videos Using Pattern Theory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Explorations on the Depth of Gestalt Hierarchies in Social Imagery

Unsupervised Semantic Discovery Through Visual Patterns Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Spatially Coherent Interpretations of Videos Using Pattern Theory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Explorations on the Depth of Gestalt Hierarchies in Social Imagery

Unsupervised Semantic Discovery Through Visual Patterns Detection

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation