Abstract
We propose a layered-grammar model to represent actions. Using this model, an action is represented by a set of grammar rules. The bottom layer of an action instance’s parse tree contains action primitives such as spatiotemporal (ST) interest points. At each layer above, we iteratively mine grammar rules and “super rules” that account for the high-order compositional feature structures. The grammar rules are categorized into three classes according to three different ST-relations of their action components, namely the strong relation, weak relation and stochastic relation. These ST-relations characterize different action styles (degree of stiffness), and they are pursued in terms of grammar rules for the purpose of action recognition. By adopting the Emerging Pattern (EP) mining algorithm for relation pursuit, the learned production rules are statistically significant and discriminative. Using the learned rules, the parse tree of an action video is constructed by combining a bottom-up rule detection step and a top-down ambiguous rule pruning step. An action instance is recognized based on the discriminative configurations generated by the production rules of its parse tree. Experiments confirm that by incorporating the high-order feature statistics, the proposed method largely improves the recognition performance over the bag-of-words models.
Similar content being viewed by others
References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proc. int’l conf. very large data bases (pp. 487–499).
Alhammady, H., & Ramamohanarao, K. (2006). Using emerging patterns to construct weighted decision trees. IEEE Transactions on Knowledge and Data Engineering, 18(7), 865–876.
Allen, J. F., & Ferguson, G. (1994). Actions and events in interval temporal logic. Journal of Logic and Computation, 4(5), 531–579.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proc. IEEE conf. computer vision and pattern recognition (Vol. 1, pp. 886–893).
Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Proc. IEEE int’l workshop on PETS (pp. 65–72).
Dong, G., & Li, J. (2004). Efficient mining of emerging patterns: discovering trends and differences. In Proc. ACM SIGKDD int’l conf. knowledge discovery and data mining (pp. 43–52).
Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP: classification by aggregating emerging patterns. Discovery Science, 1721, 737–747.
Gilbert, A., Illingworth, J., & Bowden, R. (2008). Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In Proc. European conf. computer vision (pp. 222–233).
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proc. Alvey vision conference (pp. 147–152).
Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 852–872.
Joo, S. W., & Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In Proc. int’l conf. image processing (pp. 2897–2900).
Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proc. int’l conf. computer vision (pp. 166–173).
Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proc. int’l conf. computer vision (pp. 432–439).
Laptev, I., Marszalek, M., Schmid, C., & Rozeneld, B. (2008). Learning realistic human actions from movies. In Proc. int’l conf. computer vision and pattern recognition.
Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77, 259–289.
Leonardis, A., Gupta, A., & Bajcsy, R. (1995). Segmentation of range images as the search for geometric parametric models. International Journal of Computer Vision, 14, 253–277.
Lin, L., Gong, H., Li, L., & Wang, L. (2009). Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters, 30, 180–186.
Liu, J., & Shah, M. (2008). Learning human actions via information maximization. In Proc. int’l conf. computer vision and pattern recognition.
Liu, J., Yang, Y., & Shah, M. (2009). Learning semantic visual vocabularies using diffusion distance. In Proc. IEEE int’l conf. computer vision and pattern recognition.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3), 299–318.
Nowozin, S., Bakir, G., & Tsuda, K. (2007). Discriminative subsequence mining for action recognition. In Proc. int’l conf. computer vision.
Quack, T., Ferrari, V., Leibe, B., & Gool, L. V. (2007). Efficient mining of frequent and distinctive feature configurations. In Proc. ICCV.
Quelhas, P., Monay, F., Odobez, J., Perez, D., & Tuytelaars, T. (2007). A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1575–1589.
Rapantzikos, K., Avrithis, Y., & Kollias, S. (2009). Dense saliency-based spatiotemporal feature points for action recognition. In Proc. IEEE int’l conf. computer vision and pattern recognition (pp. 1–8).
Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proc. int’l conf. computer vision and pattern recognition.
Ryoo, M. S., & Aggarwal, J. K. (2009). Semantic representation and recognition of continued and recursive human activities. International Journal of Computer Vision, 82, 1–24.
Schindler, K., & Gool, L. (2008). Action snippets: how many frames does human action recognition require? In Proc. IEEE conf. computer vision and pattern recognition.
Schnitzspan, P., Fritz, M., Roth, S., & Schiele, B. (2009). Discriminative structure learning of hierarchical representations for object detection. In Proc. IEEE conf. computer vision and pattern recognition (pp. 1–8).
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local SVM approach. In Proc. int’l conf. pattern recognition (pp. 32–36).
Sivic, J., & Zisserman, A. (2004). Video data mining using configurations of viewpoint invariant regions. In Proc. int’l conf. computer vision and pattern recognition.
Sun, J., Wu, X., Yan, S., Cheong, L., Chua, T., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In Proc. IEEE conf. computer vision and pattern recognition (pp. 1–8).
Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18, 77–95.
Wang, Y., & Mori, G. (2009). Max-margin hidden conditional random fields for human action recognition. In Proc. IEEE conf. computer vision and pattern recognition.
Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In Proc. IEEE int’l conf. computer vision.
Yao, B., & Zhu, S. (2009). Learning deformable action templates from cluttered videos. In Proc. int’l conf. computer vision.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, L., Wang, Y. & Gao, W. Mining Layered Grammar Rules for Action Recognition. Int J Comput Vis 93, 162–182 (2011). https://doi.org/10.1007/s11263-010-0393-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-010-0393-z