Elsevier

Pattern Recognition

Volume 45, Issue 11, November 2012, Pages 3902-3911
Pattern Recognition

Supervised class-specific dictionary learning for sparse modeling in action recognition

https://doi.org/10.1016/j.patcog.2012.04.024Get rights and content

Abstract

In this paper, we propose a new supervised classification method based on a modified sparse model for action recognition. The main contributions are three-fold. First, a novel hierarchical descriptor is presented for action representation. To capture spatial information about neighboring interest points, a compound motion and appearance feature is proposed for the interest point at low level. Furthermore, at high level, a continuous motion segment descriptor is presented to combine temporal ordering information of motion. Second, we propose a modified sparse model which incorporates the similarity constrained term and the dictionary incoherence term for classification. Our sparse model not only captures the correlations between similar samples by sharing dictionary, but also encourages dictionaries associated with different classes to be independent by the dictionary incoherence term. The proposed sparse model targets classification, rather than pure reconstruction. Third, in the sparse model, we adopt a specific dictionary for each action class. Moreover, a classification loss function is proposed to optimize the class-specific dictionaries. Experiments validate that the proposed framework obtains the performance comparable to the state-of-the-art.

Highlights

► We propose a novel hierarchical descriptor for action representation. ► Our descriptor captures the spatial and temporal information of actions. ► We present a modified sparse model (SDSM) for classification. ► In SDSM, we adopt class-specific dictionaries to best reconstruct samples. ► A classification loss function is proposed to optimize class-specific dictionaries.

Introduction

In recent years, a large number of approaches have been proposed to fulfill action recognition. Among them, bag of visual words approaches are greatly popular, due to their simple implementation and good reliability. A video clip is summarized by the histogram of its local features. By fully exploiting local space–time features, the bag of visual words approaches are robust to noise, occlusion and geometric variation, without requiring reliable tracks on a particular subject. Recent work has shown promising results using local space–time features together with bag of visual words models. The methods in Refs. [1], [2], [3] are classical interest-point-based methods for action recognition. These approachs extract the local feature from a single interest point, and achieve good results. However, the conventional interest-point-based methods describe the feature of a single interest point. They are mainly based on the individual power of the interest point, and therefore do not consider the spatio-temporal relationship between them. As an improvement, Gilbert et al. [4] perform dense interest point detection, and compute the distribution of interest points in a small area. Although this approach utilizes some spatial information, it does not exhibit the temporally ordering information in actions. A key limitation of interest-point-based representation is failing to capture adequate spatial or temporal information.

Sparse representation has received a lot of attention from the signal processing community due in part to the fact that various signals such as audio and natural images can be well approximated by a linear combination of a few elements of some redundant bases, usually called dictionary. Recent publications about sparse representation have shown that this approach is very effective, leading to state-of-the-art results, e.g., in image restoration, image denoising, texture classification and texture synthesis. In the supervised or weakly supervised methods, algorithms adopt features of the sparse coding of signals for classification [5], [6], [7], [8], [9]. But the sparse models mainly consider minimizing the reconstruction error. Little attention is paid to better classification.

Recent research on dictionary learning for sparse coding has been targeted on learning discriminative sparse models instead of the purely reconstructive ones. Mairal et al. [10] generalize the reconstructive sparse dictionary learning process by optimizing the sparse reconstruction jointly with a linear prediction model. Bradley and Bagnell [11] propose a novel differentiable sparse prior rather than the conventional L1 norm, and employ a back propagation procedure to train the dictionary for sparse coding in order to minimize the training error. These approaches need to explicitly associate each sample with a label in order to perform the supervised training. How to learn a discriminative dictionary for both sparse data representation and classification is still an open problem.

In this paper, we solve the three problems aforementioned and present a novel framework for action recognition. Fig. 1 shows the flowchart of our framework. We make the following three contributions.

First, traditional interest-point-based representation only utilizes features of a single interest point, and fails to capture adequate spatial or temporal information. It is sensitive to the noise. We consider that spatial and temporal information is very important for action representation. So a novel hierarchical descriptor is presented. The proposed compound appearance and motion feature captures spatial information of neighboring interest points. Furthermore, we propose a continuous motion segment descriptor to represent human action by capturing the temporal ordering information in actions. Spatial and temporal context information is utilized for action representation.

Second, we propose a modified sparse model for classification. Different from traditional sparse representation whose only task is to minimize the reconstruction error, the proposed sparse model targets at classification. Given K action classes, we learn K class-specific dictionaries for representing the data, and then classify the test sample into the class whose dictionary generates the minimum reconstruction error. The similarity-constrained term is utilized to project each descriptor into its local coordinate system which captures the correlations between similar descriptors by sharing bases. The dictionary incoherence term ensures that samples from different classes are reconstructed by independent dictionaries. Our proposed sparse model ensures samples are best reconstructed by their own class specific dictionary.

Third, we introduce a classification loss function for the class-specific dictionary learning. The dictionaries are trained by minimizing the classification loss function. The test sample is classified into the class whose dictionary generates the minimum reconstruction error. The learned dictionaries are remarkably more discriminative.

The remainder of this paper is organized as follows. Section 2 gives a review of related approaches for action representation and sparse representation. Section 3 introduces the compound appearance and motion feature and the continuous motion segment descriptor. Section 4 presents a modified sparse model and a supervised class-specific dictionary learning method for classification. Section 5 demonstrates experimental results. Section 6 concludes this paper.

Section snippets

Related work

Over the last few years, many methods for action recognition have been presented and made impressive progress. Approaches can be categorized on the basis of action representation. There are appearance-based representation [13], [14], [15], shape-based representation [16], [17], [18], optical-flow-based representation [19], [20], [21], volume-based representation [22], [23], [24] and interest-point-based representation [1], [2], [3]. A number of approaches adopt the bag of space–time interest

Action representation

Traditional interest-point-based representation only describes features of a single interest point, and fails to capture adequate spatial or temporal information. So it is easily influenced by noise. A novel hierarchical descriptor for action representation is introduced in this section. We present a compound appearance and motion feature, and then design a continuous motion segment descriptor based on our compound features. The compound feature considers the relationship between neighboring

Sparse representation and dictionary learning

Sparse representation has been successfully applied to solve some problems in computer vision, such as image restoration, image denoising, texture synthesis and texture classification. In this section, we propose a modified sparse representation for action recognition.

Sparse representation means to represent a signal as a linear combination of a few bases of a given dictionary. Mathematically, given a signal xRn and a dictionary DRn×k, the sparse representation problem is stated as mina||a||0

Experiments

We evaluate our approach on three benchmark datasets for human action recognition: the KTH actions dataset [49], the Weizmann action recognition dataset [50], and the UCF Sports dataset [43]. All the video clips contain primarily a single action of interest. Examples of the datasets are shown in Fig. 6.

Conclusions

In this paper, we have presented a novel method for action representation based on compound features and continuous motion segments. Our descriptor incorporates spatial and temporal information which represents actions more accurately. We have also introduced a supervised classification based on class specific sparse representation and dictionary learning. We have proposed a classification loss function for the class specific dictionary learning as well. Our framework achieves comparable

Acknowledgment

This work was carried out at the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. It was supported in part by the NSFC (Grant nos. 60825204, 60935002, 61100099 and 61005008) and the National 863 High-Tech $R\& D$ Program of China (Grant no. 2009AA01Z318).

Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeast University, Shenyang, China, in 2008. Now, he is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

References (53)

  • H. Lee et al.

    Efficient sparse coding algorithms

    Advances in Neural Information Processing Systems

    (2007)
  • J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in:...
  • Ignacio Ramirez, Pablo Sprechmann, Guillermo Sapiro, Classification and clustering via dictionary learning with...
  • J. Mairal et al.

    Supervised dictionary learning

    Advances in Neural Information Processing Systems

    (2008)
  • D.M. Bradley et al.

    Differential sparse coding

    Advances in Neural Information Processing Systems

    (2008)
  • Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, Yihong Gong, Locality-constrained linear coding for image...
  • H. Jiang, M. Crew, Z. Li, Successive convex matching for action detection, in: Proceedings of the IEEE Conference on...
  • T. Starner, A. Pentland, Visual recognition of American sign language using hidden Markov model, in: International...
  • L. Gorelick et al.

    Actions as space–time shapes

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • S. Ali et al.

    Human action recognition in videos using kinematic features and multiple instance learning

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2010)
  • T. Mahmood, A. Vasilescu, S. Sethi, Recognition of action events from multiple video points, in: Proceedings of the...
  • J. Liu, S. Ali, M. Shah, Recognizing human actions using multiple features, in: Proceedings of the IEEE Conference on...
  • P. Scovanner, S. Ali, M. Shah, A 3-dimensional SIFT descriptor and its application to action recognition, in:...
  • M. Marszalek, I. Laptev, C. Schmid, Actions in context, in: Proceedings of the IEEE Conference on Computer Vision and...
  • Y. Wang et al.

    Human action recognition by semilatent topic models

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2009)
  • Cited by (123)

    • Deep multiple aggregation networks for action recognition

      2024, International Journal of Multimedia Information Retrieval
    View all citing articles on Scopus

    Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeast University, Shenyang, China, in 2008. Now, he is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

    Chunfeng Yuan received the B.S. and M.S. degrees in information science and technology from the Qingdao University of Science and Technology, China, in 2004 and 2007, respectively, and the Ph.D. degree in 2010 from the National Laboratory of Pattern Recognition at Institute of Automation, Chinese Academy of Sciences. She is currently working as an assistant professor at Institute of Automation, Chinese Academy of Sciences. Her main research interests include activity analysis and pattern recognition.

    Weiming Hu received the Ph.D. degree from the Department of Computer Science and Engineering, Zhejiang University. From April 1998 to March 2000, he was a Postdoctoral Research Fellow with the Institute of Computer Science and Technology, Founder Research and Design Center, Peking University. Since April 2000, he has been with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Now he is a Professor, a Ph.D. Student Supervisor in the laboratory.

    Changyin Sun is a professor in School of Automation at the Southeast University, China. He received the M.S. and Ph.D. degrees in Electrical Engineering from the Southeast University, Nanjing, China, respectively, in 2001 and 2003. His research interests include Intelligent Control, Neural Networks, SVM, Pattern Recognition, Optimal Theory, etc. He has received the First Prize of Nature Science of Ministry of Education, China. He has published more than 40 papers. Professor Sun is a member of an IEEE, an Associate Editor of IEEE Transactions on Neural Networks, Neural Processing Letters and International Journal of Swarm Intelligence Research, Recent Patents on Computer Science.

    View full text