Abstract
This paper describes a methodology for automated recognition of complex human activities. The paper proposes a general framework which reliably recognizes high-level human actions and human-human interactions. Our approach is a description-based approach, which enables a user to encode the structure of a high-level human activity as a formal representation. Recognition of human activities is done by semantically matching constructed representations with actual observations. The methodology uses a context-free grammar (CFG) based representation scheme as a formal syntax for representing composite activities. Our CFG-based representation enables us to define complex human activities based on simpler activities or movements. Our system takes advantage of both statistical recognition techniques from computer vision and knowledge representation concepts from traditional artificial intelligence. In the low-level of the system, image sequences are processed to extract poses and gestures. Based on the recognition of gestures, the high-level of the system hierarchically recognizes composite actions and interactions occurring in a sequence of image frames. The concept of hallucinations and a probabilistic semantic-level recognition algorithm is introduced to cope with imperfect lower-layers. As a result, the system recognizes human activities including ‘fighting’ and ‘assault’, which are high-level activities that previous systems had difficulties. The experimental results show that our system reliably recognizes sequences of complex human activities with a high recognition rate.
Similar content being viewed by others
References
Allen, J. F., & Ferguson, G. (1994). Actions and events in interval temporal logic. Journal of Logic and Computation, 4(5), 531–579.
Bobick, A. F., & Wilson, A. D. (1997). A state-based approach to the representation and recognition of gesture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(12), 1325–1337.
Chomsky, N. (1956). Three models for the description of language. IEEE Transactions on Information Theory, 2(3), 113–124.
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 41–62.
Hongeng, S., Nevatia, R., & Bremond, F. (2004). Video-based event recognition: Activity representation and probabilistic recognition methods. Computer Vision and Image Understanding: CVIU, 96(2), 129–162.
Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 852–872.
Joo, S.-W., & Chellappa, R. (2006). Attribute grammar-based event recognition and anomaly detection. In CVPRW ’06: Proceedings of the 2006 conference on computer vision and pattern recognition workshop (p. 107).
Minnen, D., Essa, I. A., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In CVPR(2) (pp. 626–632). IEEE Computer Society.
Moore, D. J., & Essa, I. A. (2002). Recognizing multitasked activities from video using stochastic context-free grammar. In AAAI/IAAI (pp. 770–776).
Natarajan, P., & Nevatia, R. (2007). Coupled hidden semi Markov models for activity recognition. In IEEE workshop on motion and video computing, 2007. WMVC ’07 (pp. 10–10).
Nevatia, R., Hobbs, J., & Bolles, B. (2004). An ontology for video event representation. In CVPRW ’04: Proceedings of the 2004 conference on computer vision and pattern recognition workshop (CVPRW’04) (Vol. 7, p. 119).
Nguyen, N. T., Phung, D. Q., Venkatesh, S., & Bui, H. H. (2005). Learning and detecting activities from movement trajectories using the hierarchical hidden Markov models. In CVPR(2) (pp. 955–960). IEEE Computer Society.
Oliver, N. M., Rosario, B., & Pentland, A. P. (2000). A Bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 831–843.
Park, S., & Aggarwal, J. K. (2004a). A hierarchical Bayesian network for event recognition of human actions and interactions. Multimedia Systems, 10(2), 164–179.
Park, S., & Aggarwal, J. K. (2004b). Semantic-level understanding of human actions and interactions using event hierarchy. In CVPRW ’04: Proceedings of the 2004 conference on computer vision and pattern recognition workshop (CVPRW’04) (Vol. 1, p. 12).
Park, S., & Aggarwal, J. K. (2006). Simultaneous tracking of multiple body parts of interacting persons. Computer Vision and Image Understanding, 102(1), 1–21.
Pinhanez, C. (1999). Representation and recognition of action in interactive spaces. MIT Media Lab, June 1999: PhD thesis.
Ryoo, M. S., & Aggarwal, J. K. (2006a). Recognition of composite human activities through context-free grammar based representation. In CVPR ’06: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (pp. 1709–1718).
Ryoo, M. S., & Aggarwal, J. K. (2006b). Semantic understanding of continued and recursive human activities. In ICPR ’06: Proceedings of the 18th international conference on pattern recognition (pp. 379–382).
Ryoo, M. S., & Aggarwal, J. K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In M. M. Veloso (Ed.), IJCAI 2007, Proceedings of the 20th international joint conference on artificial intelligence (pp. 2850–2855).
Shi, Y., Huang, Y., Minnen, D., Bobick, A. F., & Essa, I. A. (2004). Propagation networks for recognition of partially ordered sequential action. In CVPR(2) (pp. 862–869).
Siskind, J. M. (2001). Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research (JAIR), 15, 31–90.
Starner, T., & Pentland, A. (1995). Real-time American sign language recognition from video using hidden Markov models. ISCV, 00, 265.
Vu, V.-T., Brémond, F., & Thonnat, M. (2003). Automatic video interpretation: A novel algorithm for temporal scenario recognition. In G. Gottlob & T. Walsh (Eds.), IJCAI-03, Proceedings of the eighteenth international joint conference on artificial intelligence (pp. 1295–1302). San Mateo: Morgan Kaufmann.
Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (pp. 379–385).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ryoo, M.S., Aggarwal, J.K. Semantic Representation and Recognition of Continued and Recursive Human Activities. Int J Comput Vis 82, 1–24 (2009). https://doi.org/10.1007/s11263-008-0181-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-008-0181-1