Skip to main content
Log in

Mining Layered Grammar Rules for Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a layered-grammar model to represent actions. Using this model, an action is represented by a set of grammar rules. The bottom layer of an action instance’s parse tree contains action primitives such as spatiotemporal (ST) interest points. At each layer above, we iteratively mine grammar rules and “super rules” that account for the high-order compositional feature structures. The grammar rules are categorized into three classes according to three different ST-relations of their action components, namely the strong relation, weak relation and stochastic relation. These ST-relations characterize different action styles (degree of stiffness), and they are pursued in terms of grammar rules for the purpose of action recognition. By adopting the Emerging Pattern (EP) mining algorithm for relation pursuit, the learned production rules are statistically significant and discriminative. Using the learned rules, the parse tree of an action video is constructed by combining a bottom-up rule detection step and a top-down ambiguous rule pruning step. An action instance is recognized based on the discriminative configurations generated by the production rules of its parse tree. Experiments confirm that by incorporating the high-order feature statistics, the proposed method largely improves the recognition performance over the bag-of-words models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proc. int’l conf. very large data bases (pp. 487–499).

    Google Scholar 

  • Alhammady, H., & Ramamohanarao, K. (2006). Using emerging patterns to construct weighted decision trees. IEEE Transactions on Knowledge and Data Engineering, 18(7), 865–876.

    Article  Google Scholar 

  • Allen, J. F., & Ferguson, G. (1994). Actions and events in interval temporal logic. Journal of Logic and Computation, 4(5), 531–579.

    Article  MATH  MathSciNet  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proc. IEEE conf. computer vision and pattern recognition (Vol. 1, pp. 886–893).

    Google Scholar 

  • Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Proc. IEEE int’l workshop on PETS (pp. 65–72).

    Google Scholar 

  • Dong, G., & Li, J. (2004). Efficient mining of emerging patterns: discovering trends and differences. In Proc. ACM SIGKDD int’l conf. knowledge discovery and data mining (pp. 43–52).

    Google Scholar 

  • Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP: classification by aggregating emerging patterns. Discovery Science, 1721, 737–747.

    Article  Google Scholar 

  • Gilbert, A., Illingworth, J., & Bowden, R. (2008). Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In Proc. European conf. computer vision (pp. 222–233).

    Google Scholar 

  • Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proc. Alvey vision conference (pp. 147–152).

    Google Scholar 

  • Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 852–872.

    Article  Google Scholar 

  • Joo, S. W., & Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In Proc. int’l conf. image processing (pp. 2897–2900).

    Google Scholar 

  • Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proc. int’l conf. computer vision (pp. 166–173).

    Google Scholar 

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proc. int’l conf. computer vision (pp. 432–439).

    Chapter  Google Scholar 

  • Laptev, I., Marszalek, M., Schmid, C., & Rozeneld, B. (2008). Learning realistic human actions from movies. In Proc. int’l conf. computer vision and pattern recognition.

    Google Scholar 

  • Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77, 259–289.

    Article  Google Scholar 

  • Leonardis, A., Gupta, A., & Bajcsy, R. (1995). Segmentation of range images as the search for geometric parametric models. International Journal of Computer Vision, 14, 253–277.

    Article  Google Scholar 

  • Lin, L., Gong, H., Li, L., & Wang, L. (2009). Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters, 30, 180–186.

    Article  Google Scholar 

  • Liu, J., & Shah, M. (2008). Learning human actions via information maximization. In Proc. int’l conf. computer vision and pattern recognition.

    Google Scholar 

  • Liu, J., Yang, Y., & Shah, M. (2009). Learning semantic visual vocabularies using diffusion distance. In Proc. IEEE int’l conf. computer vision and pattern recognition.

    Google Scholar 

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3), 299–318.

    Article  Google Scholar 

  • Nowozin, S., Bakir, G., & Tsuda, K. (2007). Discriminative subsequence mining for action recognition. In Proc. int’l conf. computer vision.

    Google Scholar 

  • Quack, T., Ferrari, V., Leibe, B., & Gool, L. V. (2007). Efficient mining of frequent and distinctive feature configurations. In Proc. ICCV.

    Google Scholar 

  • Quelhas, P., Monay, F., Odobez, J., Perez, D., & Tuytelaars, T. (2007). A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1575–1589.

    Article  Google Scholar 

  • Rapantzikos, K., Avrithis, Y., & Kollias, S. (2009). Dense saliency-based spatiotemporal feature points for action recognition. In Proc. IEEE int’l conf. computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proc. int’l conf. computer vision and pattern recognition.

    Google Scholar 

  • Ryoo, M. S., & Aggarwal, J. K. (2009). Semantic representation and recognition of continued and recursive human activities. International Journal of Computer Vision, 82, 1–24.

    Article  Google Scholar 

  • Schindler, K., & Gool, L. (2008). Action snippets: how many frames does human action recognition require? In Proc. IEEE conf. computer vision and pattern recognition.

    Google Scholar 

  • Schnitzspan, P., Fritz, M., Roth, S., & Schiele, B. (2009). Discriminative structure learning of hierarchical representations for object detection. In Proc. IEEE conf. computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local SVM approach. In Proc. int’l conf. pattern recognition (pp. 32–36).

    Google Scholar 

  • Sivic, J., & Zisserman, A. (2004). Video data mining using configurations of viewpoint invariant regions. In Proc. int’l conf. computer vision and pattern recognition.

    Google Scholar 

  • Sun, J., Wu, X., Yan, S., Cheong, L., Chua, T., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In Proc. IEEE conf. computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18, 77–95.

    Article  Google Scholar 

  • Wang, Y., & Mori, G. (2009). Max-margin hidden conditional random fields for human action recognition. In Proc. IEEE conf. computer vision and pattern recognition.

    Google Scholar 

  • Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In Proc. IEEE int’l conf. computer vision.

    Google Scholar 

  • Yao, B., & Zhu, S. (2009). Learning deformable action templates from cluttered videos. In Proc. int’l conf. computer vision.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yizhou Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, Y. & Gao, W. Mining Layered Grammar Rules for Action Recognition. Int J Comput Vis 93, 162–182 (2011). https://doi.org/10.1007/s11263-010-0393-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-010-0393-z

Keywords

Navigation