MoFAP: A Multi-level Representation for Action Recognition

Wang, Limin; Qiao, Yu; Tang, Xiaoou

doi:10.1007/s11263-015-0859-0

MoFAP: A Multi-level Representation for Action Recognition

Published: 07 October 2015

Volume 119, pages 254–271, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Limin Wang¹,
Yu Qiao² &
Xiaoou Tang¹

2304 Accesses
90 Citations
3 Altmetric
Explore all metrics

Abstract

This paper proposes a multi-level video representation by stacking the activations of motion features, atoms, and phrases (MoFAP). Motion features refer to those low-level local descriptors, while motion atoms and phrases can be viewed as mid-level “temporal parts”. Motion atom is defined as an atomic part of action, and captures the motion information of video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms defined with an AND/OR structure. It further enhances the discriminative capacity of motion atoms by incorporating temporal structure in a longer temporal scale. Specifically, we first design a discriminative clustering method to automatically discover a set of representative motion atoms. Then, we mine effective motion phrases with high discriminative and representative capacity in a bottom-up manner. Based on these basic units of motion features, atoms, and phrases, we construct a MoFAP network by stacking them layer by layer. This MoFAP network enables us to extract the effective representation of video data from different levels and scales. The separate representations from motion features, motion atoms, and motion phrases are concatenated as a whole one, called Activation of MoFAP. The effectiveness of this representation is demonstrated on four challenging datasets: Olympic Sports, UCF50, HMDB51, and UCF101. Experimental results show that our representation achieves the state-of-the-art performance on these datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

http://vision.stanford.edu/Datasets/OlympicSports/.
http://crcv.ucf.edu/data/UCF50.php.
http://serre-lab.clps.brown.edu/resources/HMDB/index.htm.
http://crcv.ucf.edu/ICCV13-Action-Workshop/.
Here we use the notation of $\#$-motion phrase to represent motion phrase of size $\#$.
Fig. 5
Exploration of the effect of motion phrase size on the Olympic Sports dataset. We first conduct experiments using the motion phrases from a single scale (first two figures). Then, we investigate the motion phrases mined from the multiple scales and verify the effectiveness hierarchical motion phrases (last two figures)
Full size image

References

Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43(3), 16.
Article Google Scholar
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB (pp. 487–499).
Amer, M. R., Xie, D., Zhao, M., Todorovic, S., & Zhu, S. C. (2012). Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV (pp. 187–200).
Berg, T. L., Berg, A. C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In ECCV (pp. 663–676).
Bishop, C. (2006). Pattern recognition and machine learning (Vol. 4). Berlin: Springer.
MATH Google Scholar
Bourdev, L. D., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV (pp. 1365–1372).
Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR (pp. 596–603).
Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.
Article Google Scholar
Chen, Y., Zhu, L., Lin, C., Yuille, A. L., & Zhang, H. (2007). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS.
Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Prague (Vol. 1, pp. 1–2).
Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS pp. 494–502.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J. F., & Ramanan, D. (2005). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision 1(2/3).
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.
Article MathSciNet MATH Google Scholar
Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.
Article Google Scholar
Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. International Journal of Computer Vision, 107(3), 219–238.
Article MathSciNet Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.
Article Google Scholar
Jain, A., Gupta, A., Rodriguez, M., & Davis, L. S. (2013a). Representing videos using mid-level discriminative patches. In CVPR (pp. 2571–2578).
Jain, M., Jegou, H., & Bouthemy, P. (2013b). Better exploiting motion for better action recognition. In CVPR (pp. 2555–2562).
Jiang, Y., Dai, Q., Xue, X., Liu, W., & Ngo, C. (2012). Trajectory-based modeling of human actions with motion reference points. In ECCV (pp. 425–438).
Jiang, Y. G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/ICCV13-Action-Workshop/.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR (pp. 1725–1732).
Kliper-Gross, O., Gurovich, Y., Hassner, T., & Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. In ECCV (pp. 256–269).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV (pp. 2556–2563).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laxton, B., Lim, J., & Kriegman, D. J. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In CVPR (pp. 1–8).
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).
Niebles, J. C., Chen, C. W., & Li, F. F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV (pp. 392–405).
Oliver, N., Rosario, B., & Pentland, A. (2000). A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 831–843.
Article Google Scholar
Parikh, D., & Grauman, K. (2011). Relative attributes. In ICCV (pp. 503–510).
Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR (pp. 612–619).
Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR (pp. 1242–1249).
Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.
Article Google Scholar
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. In ECCV.
Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR (pp. 1234–1241).
Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.
Article MathSciNet MATH Google Scholar
Sapienza, M., Cuzzolin, F., & Torr, P. H. S. (2012). Learning discriminative space-time actions from weakly labelled videos. In BMVC (pp. 1–12).
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.
Si, Z., & Zhu, S. C. (2013). Learning AND-OR templates for object recognition and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9), 2189–2205.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).
Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV (pp. 73–86).
Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV (pp. 1470–1477).
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402.
Tang, K. D., Li, F. F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR (pp. 1250–1257).
Turaga, P. K., Chellappa, R., Subrahmanian, V. S., & Udrea, O. (2008). Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473–1488.
Article Google Scholar
Wang, H., & Schmid, C. (2013a). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).
Wang, H., & Schmid, C. (2013b). Lear-inria submission for the thumos workshop. In: ICCV Workshop on Action Recognition with a Large Number of Classes.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.
Article MathSciNet Google Scholar
Wang, L., Qiao, Y., & Tang, X. (2013b). Mining motion atoms and phrases for complex action recognition. In ICCV (pp. 2680–2687).
Wang, L., Qiao, Y., & Tang, X. (2013c). Motionlets: Mid-level 3D parts for human motion recognition. In CVPR (pp. 2674–2681).
Wang, L., Qiao, Y., & Tang, X. (2014a). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.
Article MathSciNet Google Scholar
Wang, L., Qiao, Y., & Tang, X. (2014b). Video action detection with relational dynamic-poselets. In ECCV (pp. 565–580).
Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trjectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).
Wang, S. B., Quattoni, A., Morency, L. P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. In CVPR (pp. 1521–1527).
Wang, X., Wang, L., & Qiao, Y. (2012). A comparative study of encoding, pooling and normalization methods for action recognition. In ACCV (pp. 572–585).
Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR (pp. 2577–2584).
Yao, B., & Li, F. F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR.
Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV (pp. 2248–2255).
Zhao, Y., & Zhu S. C. (2011). Image parsing with stochastic scene grammar. In NIPS (pp. 73–81).
Zhu ,J., Wang, B., Yang, X., Zhang, W., & Tu, Z. (2013). Action recognition with actons. In ICCV (pp. 3559–3566).

Download references

Author information

Authors and Affiliations

Department of Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
Limin Wang & Xiaoou Tang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Yu Qiao

Authors

Limin Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoou Tang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Limin Wang.

Additional information

Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Qiao, Y. & Tang, X. MoFAP: A Multi-level Representation for Action Recognition. Int J Comput Vis 119, 254–271 (2016). https://doi.org/10.1007/s11263-015-0859-0

Download citation

Received: 16 June 2014
Accepted: 21 September 2015
Published: 07 October 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11263-015-0859-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MoFAP: A Multi-level Representation for Action Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Manet: motion-aware network for video action recognition

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Unsupervised Learning of Video Representations via Dense Trajectory Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

MoFAP: A Multi-level Representation for Action Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Manet: motion-aware network for video action recognition

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Unsupervised Learning of Video Representations via Dense Trajectory Clustering

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now