Elsevier

Neurocomputing

Volume 99, 1 January 2013, Pages 144-153
Neurocomputing

LF-EME: Local features with elastic manifold embedding for human action recognition

https://doi.org/10.1016/j.neucom.2012.06.011Get rights and content

Abstract

Human action recognition has been an active topic in computer vision. Currently, most of the approaches to this problem can be categorized into two classes. One is based on local features, and the other is based on global features. Meanwhile, manifold learning has become successful in many problems in computer vision, but because of the high variability of human body, the application of manifold learning to human action recognition is limited. We propose a framework based on Elastic Manifold Embedding (EME), a new sparse manifold learning algorithm, together with local interest point features to handle human action recognition. The result of the new framework is very promising in comparison with state-of-the-art methods.

Introduction

In the last decades, computer vision had gone through a rapid evolution from recognition of simple objects towards analyzing complex objects and motions. Low-level visual cues are extracted by various feature extraction algorithms, and are subsequently processed on a higher level to form semantic concepts through mathematical models. In these many problems, action recognition is a rapidly growing topic, since information of what people are doing in the video is crucial for many applications including navigation, surveillance and video indexing.

Traditional action recognition methods can be categorized into two classes according to their low-level features. One is based on local features, and the other is based on global features. Usually, local features are extracted at certain point called “interest points”, and are grouped using BOF (Bag of Features) or SPM (Spatial Pyramid Matching) methods, and classified by classifiers like SVM. Local feature based methods are robust when the subject cannot be tracked consistently or undergoes occlusion. Global features are commonly called “templates” or “poses”, concepts borrowed from the computer animation society [1], and human actions are represented as a temporal variation of different templates or poses. Global feature based methods are suitable for static or tracked people, so the experiment settings are highly restricted.

However, the aforementioned image features are usually in a high dimensional space, which always leads to the problem named “curse of dimensionality”. By analyzing the data distributions of many applications, it is found that data samples usually lie in a small part of the feature space, and the subspace is locally quasi-Euclidian anywhere. This observation leads to the development of manifold learning methods, which approximate the real distribution of the data manifold in the feature space, and project them onto a lower space, while keeping key discriminative information. These years, manifold-based nonlinear learning methods have gained huge success in many problems of computer vision and pattern recognition.

In the literature, manifold based methods have been used along with human silhouette mask to model the smooth transformation of human body shapes. The body shape features are projected onto a lower space, e.g. onto a topological structure like torus, to describe the continuous variation and the periodicity of the motion. The trajectories of the projected points are connected, and can be modeled using statistical models, e.g. HMM.

In this paper, we propose a new scheme to apply manifold learning to action recognition, using local features called space–time interest points. Space–time interest points contain discriminative local motion information of the body. To the best of our knowledge, they have not been used in a manifold learning framework, so we decide to put local features in a manifold learning framework. We propose Elastic Manifold Embedding (EME), which combines the strength of manifold learning and sparse coding, and is proven to be solvable very fast.

The major contributions of our work are as follows:

  • A novel framework to put local video features on a manifold to reduce dimensionality while keeping similarity relationships.

  • A novel sparse manifold learning algorithm, Elastic Manifold Embedding (EME), which can be solved very quickly.

  • A novel action embedding and classification method based on the dimension reduced local features.

Our method can be used in many computer vision applications, typically video surveillance. We assume the backgrounds of the training and testing videos are quasi-static, with simple textures. There is only one person performing a certain kind of action during the video. If there are multiple people in a video, we can detect and track them to separate them into individuals before action recognition. The actors in the videos can wear different cloths, with varying body sizes and appearances. The more specific characters of each dataset are introduced in Section 4.

The following paper is organized as follows: in Section 2, we briefly review related research about human action recognition and manifold learning. In Section 3 we introduce our method. In Section 4, experimental details and results are given. Conclusion and future work are in Section 5.

Section snippets

Human action recognition

In recent years, human action recognition has attracted much attention from computer vision community. Traditionally, action recognition methods can be categorized according to their visual feature and classification method adopted. Template-based methods were introduced in the early years, which analyze the human body shape as a whole [2]. In [3], Bobick and Davis introduced a global template-based feature called Motion History Image (MHI), which describe the body shape variation through time.

Overview

In this paper, we introduce a novel framework for recognizing human actions by putting local space–time interest point features on a manifold. Local features are discriminative for action recognition, e.g. features describing local motions of legs or arms, but have never been embedded into a manifold for dimension reduction before. We show that compared with traditional clustering and coding techniques, e.g., K-means or vector quantization (VQ), manifold is a more natural way to reduce

Feature extraction

Local feature extraction is very crucial for the success of human action recognition. The feature must be discriminative, concise and robust. In the field of action recognition, Space–Time Interest Point feature is very popular due to its high discriminability and computational efficiency. We adopt the Space–Time Interest Point feature detector proposed in [7], free implementation can be obtained on the author's website. The detector finds 3D Harris corners in video, filtered by scale-invariant

Conclusion and future work

In this paper, we have proposed a novel method to classify human actions by putting local Space–Time Interest Point features on a manifold. We proposed Elastic Manifold Embedding as our dimension reduction model, due to its sparseness and manifold properties. The projected features are integrated by Action Embedding, and then each video is classified according to its distance relations to other videos in the dataset, using a higher level of manifold learning. Extensive experiments have been

Acknowledgments

This work is supported by National Natural Science Foundation of China (61170142), National Key Technology R&D Program (2011BAG05B04), the Zhejiang Province Key S&T Innovation Group Project (2009R50009) and the Fundamental Research Funds for the Central Universities (2012FZA5017).

Xiaoyu Deng received the BS degrees in Computer Science from Zhejiang University, China, in 2005. He is currently a candidate for a PhD degree in Computer Science at Zhejiang University. His areas of research interests include image processing, pattern recognition, and video surveillance.

References (60)

  • D. Weinland et al.

    Free viewpoint action recognition using motion history volumes

    Comput. Vis. Image Understand.

    (2006)
  • B.A. Olshausen et al.

    Sparse coding with an overcomplete basis set: a strategy employed by v1?

    Vis. Res.

    (1997)
  • J. Yu et al.

    Complex object correspondence construction in two-dimensional animation

    IEEE Trans. Image Process.

    (2011)
  • A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: IEEE International Conference on...
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • F. Martinez-Contreras, C. Orrite-Urunuela, E.H.-J.H. Ragheb, S.A. Velastin, Recognizing human actions using...
  • M.D. Rodriguez, J. Ahmed, M. Shah, Action mach a spatio-temporal maximum average correlation height filter for action...
  • I. Laptev, T. Lindeberg, Space–time interest points, in: IEEE International Conference on Computer Vision,...
  • I. Laptev

    On space–time interest points

    Int. J. Comput. Vis.

    (2005)
  • I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: IEEE Conference on...
  • J.C. Niebles et al.

    Unsupervised learning of human action categories using spatial–temporal words

    Int. J. Comput. Vis.

    (2008)
  • M. Bregonzio, J. Li, S. Gong, T. Xiang, Discriminative topics modelling for action feature selection and recognition,...
  • Y. Wang et al.

    Human action recognition by semi-latent topic models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • H. Jhuang, T. Serre, L. Wolf, T. Poggio, A biologically inspired system for action recognition, in: IEEE International...
  • M.-J. Escobar et al.

    Action recognition using a bio-inspired feedforward spiking network

    Int. J. Comput. Vis.

    (2009)
  • S. Singh, S.A. Velastin, H. Ragheb, Muhavi: A multicamera human action video dataset for the evaluation of action...
  • E. Dexter, P. Perez, I. Laptev, Multi-view synchronization of human actions and dynamic scenes, in: British Machine...
  • H. Hotelling

    Analysis of a complex of statistical variables into principal components

    J. Educ. Psychol.

    (1933)
  • R.A. Fisher

    The use of multiple measurements in taxonomic problems

    Ann. Human Genet.

    (1936)
  • D. Tao et al.

    Geometric mean for subspace selection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • S.T. Roweis et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • J.B. Tenenbaum et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural...
  • Z. Zhang et al.

    Principal manifolds and nonlinear dimensionality reduction via tangent space alignment

    SIAM J. Sci. Comput.

    (2005)
  • X. He, P. Niyogi, Locality preserving projections, in: Advances in Neural Information Processing Systems, vol. 16, MIT...
  • X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: IEEE International Conference on Computer...
  • W. Bian et al.

    Max–min distance analysis by using sequential sdp relaxation for dimension reduction

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • X. Wang et al.

    Subspaces indexing model on Grassmann manifold for image search

    IEEE Trans. Image Process.

    (2011)
  • N. Guan et al.

    Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent

    IEEE Trans. Image Process.

    (2011)
  • N. Guan et al.

    Non-negative patch alignment framework

    IEEE Trans. Neural Networks

    (2011)
  • Cited by (17)

    • Local convex-and-concave pattern: An effective texture descriptor

      2016, Information Sciences
      Citation Excerpt :

      Texture plays an important role in computer vision and image processing applications, which has been applied to visual inspection, image retrieval, motion detection, remote sensing, biomedical image analysis, biometric recognition and outdoor scene analysis [12,27,9,36,37].

    • L<inf>2,1</inf> Norm regularized fisher criterion for optimal feature selection

      2015, Neurocomputing
      Citation Excerpt :

      With the rapid development of public security business, pattern recognition technique has found its usage in various applications such as intelligent video surveillance and automatic entrance control. In these applications, objects with high level semantics [1] are often represented by some quantitative low level features [2] for classification based on prior information. Usually, these features have high data dimensionality and high data redundancy, which bring out a series of passive influences to the classification results.

    • Low-rank matrix factorization with multiple Hypergraph regularizer

      2015, Pattern Recognition
      Citation Excerpt :

      Selecting suitable data representation is usually the first step to design an effective data analysis algorithm. Many data representation methods have been presented, among which matrix factorization [1–3] is important for simple and easy implementation such as the following: QR, LU decomposition, Truncated singular value decomposition (TSVD), Non-negative matrix factorization (NMF), etc. TSVD and NMF are two common methods, which can be implemented iteratively via matrix–vector products and multiplicative updating rules, respectively.

    • KPLS-based image super-resolution using clustering and weighted boosting

      2015, Neurocomputing
      Citation Excerpt :

      Zhang et al. [18] introduced a multi-scale dictionary that simultaneously integrated local and non-local priors into a unified energy minimization model. The NE-based methods (e.g., [19–22]) estimate each desired HR image patch by linearly combining its neighbor training HR image patches. Chang et al. [19] assumed that the LR image patches and the corresponding HR image patches can use linear approximation in the local structure.

    • Image clustering by hyper-graph regularized non-negative matrix factorization

      2014, Neurocomputing
      Citation Excerpt :

      In order to improve the performance of applications such as image retrieval, image annotation and image indexing, image clustering is used to better represent and browse images. Finding a suitable data representation is a critical problem in many data analysis tasks [1,2,5–11]. Various researchers have long sought appropriate data representation which typically makes intrinsic structure in the data explicit so further process, such as clustering [5,6,10,11,40,43] and classification [11,39,41,42,44] can be applied.

    View all citing articles on Scopus

    Xiaoyu Deng received the BS degrees in Computer Science from Zhejiang University, China, in 2005. He is currently a candidate for a PhD degree in Computer Science at Zhejiang University. His areas of research interests include image processing, pattern recognition, and video surveillance.

    Xiao Liu received the BS degrees in Computer Science from Zhejiang University, China, in 2011. He is currently a candidate for a PhD degree in Computer Science at Zhejiang University. His areas of research interests include image processing, pattern recognition, and video surveillance.

    Mingli Song received the PhD degree in computer science from Zhejiang University, China, in 2006. He is currently an associate professor in the College of Computer Science and Microsoft Visual Perception Laboratory, Zhejiang university. His research interests include visual perception analysis, image enhancement, and face modeling. He is a member of the IEEE and the IEEE Computer Society.

    Jun Cheng is now with Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, as a Research Associate Professor and Director of Laboratory for Semiconductor Equipment. He received Bachelor of Engineering, Bachelor of Finance and Master of Engineering from the University of Science & Technology of China in 1999 and 2002 respectively. His PhD degree was awarded at the Chinese University of Hong Kong in 2006. His research interests include computer vision, robotics, machine intelligence, and control. He has published more than 30 papers and applied for nine patents. Ongoing projects include automated optic inspection system (AOI), virtual sports based on computer vision, visual servo control, etc.

    Jiajun Bu received the BS and PhD degrees in Computer Science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in College of Computer Science and the Director of Embedded System and Software Center at Zhejiang University. His research interests include video coding in embedded system, data mining, and mobile database.

    Chun Chen is a professor in the College of Computer Science at Zhejiang University, China. His research interests include computer vision, computer graphics, and embedded technology.

    View full text