Supervised class-specific dictionary learning for sparse modeling in action recognition

doi:10.1016/j.patcog.2012.04.024

Pattern Recognition

Volume 45, Issue 11, November 2012, Pages 3902-3911

https://doi.org/10.1016/j.patcog.2012.04.024 Get rights and content

Abstract

In this paper, we propose a new supervised classification method based on a modified sparse model for action recognition. The main contributions are three-fold. First, a novel hierarchical descriptor is presented for action representation. To capture spatial information about neighboring interest points, a compound motion and appearance feature is proposed for the interest point at low level. Furthermore, at high level, a continuous motion segment descriptor is presented to combine temporal ordering information of motion. Second, we propose a modified sparse model which incorporates the similarity constrained term and the dictionary incoherence term for classification. Our sparse model not only captures the correlations between similar samples by sharing dictionary, but also encourages dictionaries associated with different classes to be independent by the dictionary incoherence term. The proposed sparse model targets classification, rather than pure reconstruction. Third, in the sparse model, we adopt a specific dictionary for each action class. Moreover, a classification loss function is proposed to optimize the class-specific dictionaries. Experiments validate that the proposed framework obtains the performance comparable to the state-of-the-art.

Highlights

► We propose a novel hierarchical descriptor for action representation. ► Our descriptor captures the spatial and temporal information of actions. ► We present a modified sparse model (SDSM) for classification. ► In SDSM, we adopt class-specific dictionaries to best reconstruct samples. ► A classification loss function is proposed to optimize class-specific dictionaries.

Introduction

In recent years, a large number of approaches have been proposed to fulfill action recognition. Among them, bag of visual words approaches are greatly popular, due to their simple implementation and good reliability. A video clip is summarized by the histogram of its local features. By fully exploiting local space–time features, the bag of visual words approaches are robust to noise, occlusion and geometric variation, without requiring reliable tracks on a particular subject. Recent work has shown promising results using local space–time features together with bag of visual words models. The methods in Refs. [1], [2], [3] are classical interest-point-based methods for action recognition. These approachs extract the local feature from a single interest point, and achieve good results. However, the conventional interest-point-based methods describe the feature of a single interest point. They are mainly based on the individual power of the interest point, and therefore do not consider the spatio-temporal relationship between them. As an improvement, Gilbert et al. [4] perform dense interest point detection, and compute the distribution of interest points in a small area. Although this approach utilizes some spatial information, it does not exhibit the temporally ordering information in actions. A key limitation of interest-point-based representation is failing to capture adequate spatial or temporal information.

Sparse representation has received a lot of attention from the signal processing community due in part to the fact that various signals such as audio and natural images can be well approximated by a linear combination of a few elements of some redundant bases, usually called dictionary. Recent publications about sparse representation have shown that this approach is very effective, leading to state-of-the-art results, e.g., in image restoration, image denoising, texture classification and texture synthesis. In the supervised or weakly supervised methods, algorithms adopt features of the sparse coding of signals for classification [5], [6], [7], [8], [9]. But the sparse models mainly consider minimizing the reconstruction error. Little attention is paid to better classification.

Recent research on dictionary learning for sparse coding has been targeted on learning discriminative sparse models instead of the purely reconstructive ones. Mairal et al. [10] generalize the reconstructive sparse dictionary learning process by optimizing the sparse reconstruction jointly with a linear prediction model. Bradley and Bagnell [11] propose a novel differentiable sparse prior rather than the conventional L₁ norm, and employ a back propagation procedure to train the dictionary for sparse coding in order to minimize the training error. These approaches need to explicitly associate each sample with a label in order to perform the supervised training. How to learn a discriminative dictionary for both sparse data representation and classification is still an open problem.

In this paper, we solve the three problems aforementioned and present a novel framework for action recognition. Fig. 1 shows the flowchart of our framework. We make the following three contributions.

First, traditional interest-point-based representation only utilizes features of a single interest point, and fails to capture adequate spatial or temporal information. It is sensitive to the noise. We consider that spatial and temporal information is very important for action representation. So a novel hierarchical descriptor is presented. The proposed compound appearance and motion feature captures spatial information of neighboring interest points. Furthermore, we propose a continuous motion segment descriptor to represent human action by capturing the temporal ordering information in actions. Spatial and temporal context information is utilized for action representation.

Second, we propose a modified sparse model for classification. Different from traditional sparse representation whose only task is to minimize the reconstruction error, the proposed sparse model targets at classification. Given K action classes, we learn K class-specific dictionaries for representing the data, and then classify the test sample into the class whose dictionary generates the minimum reconstruction error. The similarity-constrained term is utilized to project each descriptor into its local coordinate system which captures the correlations between similar descriptors by sharing bases. The dictionary incoherence term ensures that samples from different classes are reconstructed by independent dictionaries. Our proposed sparse model ensures samples are best reconstructed by their own class specific dictionary.

Third, we introduce a classification loss function for the class-specific dictionary learning. The dictionaries are trained by minimizing the classification loss function. The test sample is classified into the class whose dictionary generates the minimum reconstruction error. The learned dictionaries are remarkably more discriminative.

The remainder of this paper is organized as follows. Section 2 gives a review of related approaches for action representation and sparse representation. Section 3 introduces the compound appearance and motion feature and the continuous motion segment descriptor. Section 4 presents a modified sparse model and a supervised class-specific dictionary learning method for classification. Section 5 demonstrates experimental results. Section 6 concludes this paper.

Section snippets

Related work

Over the last few years, many methods for action recognition have been presented and made impressive progress. Approaches can be categorized on the basis of action representation. There are appearance-based representation [13], [14], [15], shape-based representation [16], [17], [18], optical-flow-based representation [19], [20], [21], volume-based representation [22], [23], [24] and interest-point-based representation [1], [2], [3]. A number of approaches adopt the bag of space–time interest

Action representation

Traditional interest-point-based representation only describes features of a single interest point, and fails to capture adequate spatial or temporal information. So it is easily influenced by noise. A novel hierarchical descriptor for action representation is introduced in this section. We present a compound appearance and motion feature, and then design a continuous motion segment descriptor based on our compound features. The compound feature considers the relationship between neighboring

Sparse representation and dictionary learning

Sparse representation has been successfully applied to solve some problems in computer vision, such as image restoration, image denoising, texture synthesis and texture classification. In this section, we propose a modified sparse representation for action recognition.

Sparse representation means to represent a signal as a linear combination of a few bases of a given dictionary. Mathematically, given a signal x∈Rⁿ and a dictionary D∈R^n×k, the sparse representation problem is stated as $\min_{a} | | a | |_{0}$

Experiments

We evaluate our approach on three benchmark datasets for human action recognition: the KTH actions dataset [49], the Weizmann action recognition dataset [50], and the UCF Sports dataset [43]. All the video clips contain primarily a single action of interest. Examples of the datasets are shown in Fig. 6.

Conclusions

In this paper, we have presented a novel method for action representation based on compound features and continuous motion segments. Our descriptor incorporates spatial and temporal information which represents actions more accurately. We have also introduced a supervised classification based on class specific sparse representation and dictionary learning. We have proposed a classification loss function for the class specific dictionary learning as well. Our framework achieves comparable

Acknowledgment

This work was carried out at the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. It was supported in part by the NSFC (Grant nos. 60825204, 60935002, 61100099 and 61005008) and the National 863 High-Tech $R\& D$ Program of China (Grant no. 2009AA01Z318).

Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeast University, Shenyang, China, in 2008. Now, he is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

References (53)

L. Qiao et al.
Sparsity preserving projections with applications to face recognition
Pattern Recognition
(2010)
M. Fan et al.
Sparse regularization for semi-supervised classification
Pattern Recognition
(2011)
I. Kotsia et al.
Texture and shape information fusion for facial expression and facial action unit recognition
Pattern Recognition
(2008)
J. Zhang et al.
Action categorization with modified hidden conditional random field
Pattern Recognition
(2010)
S. Xiang et al.
Contour graph based human tracking and action sequence recognition
Pattern Recognition
(2008)
M. Ahmad et al.
Human action recognition using shape and CLG-motion flow from multi-view image sequences
Pattern Recognition
(2008)
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of the...
I. Laptev
On space–time interest points
International Journal of Computer Vision
(2005)
J.C. Niebles et al.
Unsupervised learning of human action categories using spatial–temporal words
International Journal of Computer Vision
(2008)
A. Gilbert, J. Illingworth, R. Bowden, Fast realistic multi-action recognition using mined dense spatio-temporal...

H. Lee et al.

Efficient sparse coding algorithms

Advances in Neural Information Processing Systems

(2007)

J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in:...

Ignacio Ramirez, Pablo Sprechmann, Guillermo Sapiro, Classification and clustering via dictionary learning with...

J. Mairal et al.

Supervised dictionary learning

Advances in Neural Information Processing Systems

(2008)

D.M. Bradley et al.

Differential sparse coding

Advances in Neural Information Processing Systems

(2008)

Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, Yihong Gong, Locality-constrained linear coding for image...

H. Jiang, M. Crew, Z. Li, Successive convex matching for action detection, in: Proceedings of the IEEE Conference on...

T. Starner, A. Pentland, Visual recognition of American sign language using hidden Markov model, in: International...

L. Gorelick et al.

Actions as space–time shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2007)

A.F. Bobick et al.

The recognition of human movement using temporal templates

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2001)

S. Ali et al.

Human action recognition in videos using kinematic features and multiple instance learning

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2010)

T. Mahmood, A. Vasilescu, S. Sethi, Recognition of action events from multiple video points, in: Proceedings of the...

J. Liu, S. Ali, M. Shah, Recognizing human actions using multiple features, in: Proceedings of the IEEE Conference on...

P. Scovanner, S. Ali, M. Shah, A 3-dimensional SIFT descriptor and its application to action recognition, in:...

M. Marszalek, I. Laptev, C. Schmid, Actions in context, in: Proceedings of the IEEE Conference on Computer Vision and...

Y. Wang et al.

Human action recognition by semilatent topic models

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2009)

Cited by (123)

Two-stream deep spatial-temporal auto-encoder for surveillance video abnormal event detection
2021, Neurocomputing
With the improvement of public security awareness, video anomaly detection has become an indispensable demand in surveillance videos. To improve the accuracy of video anomaly detection, this paper proposes a novel two-stream spatial-temporal architecture called Two-Stream Deep Spatial-Temporal Auto-Encoder (Two-Stream DSTAE), which is composed of a spatial stream DSTAE and a temporal stream DSTAE. Firstly, the spatial stream extracts appearance characteristics whereas the temporal stream extracts the motion patterns, respectively. Then, based on the novel policy joint reconstruction error, this model fuses the spatial stream and the temporal stream to extract spatial-temporal characteristics to detect anomalies. Furthermore, since the optical flow is invariant to appearances such as color or light, we introduce optical flow to enhance the capability of extracting continuity between adjacent frames and inter-frame motion information. We demonstrate the accuracy of the proposed method on the publicly available standard datasets: UCSD, Avenue and UMN datasets. Our experiments demonstrate high accuracy, which is superior to the state-of-the-art methods.
FDSR: A new fuzzy discriminative sparse representation method for medical image classification
2020, Artificial Intelligence in Medicine
Recent developments in medical image analysis techniques make them essential tools in medical diagnosis. Medical imaging is always involved with different kinds of uncertainties. Managing these uncertainties has motivated extensive research on medical image classification methods, particularly for the past decade. Despite being a powerful classification tool, the sparse representation suffers from the lack of sufficient discrimination and robustness, which are required to manage the uncertainty and noisiness in medical image classification issues. It is tried to overcome this deficiency by introducing a new fuzzy discriminative robust sparse representation classifier, which benefits from the fuzzy terms in its optimization function of the dictionary learning process. In this work, we present a new medical image classification approach, fuzzy discriminative sparse representation (FDSR). The proposed fuzzy terms increase the inter-class representation difference and the intra-class representation similarity. Also, an adaptive fuzzy dictionary learning approach is used to learn dictionary atoms. FDSR is applied on Magnetic Resonance Images (MRI) from three medical image databases. The comprehensive experimental results clearly show that our approach outperforms its series of rival techniques in terms of accuracy, sensitivity, specificity, and convergence speed.
Label embedded dictionary learning for image classification
2020, Neurocomputing
Recently, label consistent k-svd (LC-KSVD) algorithm has been successfully applied in image classification. The objective function of LC-KSVD is consisted of reconstruction error, classification error and discriminative sparse codes error with ℓ₀-norm sparse regularization term. The ℓ₀-norm, however, leads to NP-hard problem. Despite some methods such as orthogonal matching pursuit can help solve this problem to some extent, it is quite difficult to find the optimum sparse solution. To overcome this limitation, we propose a method named label embedded dictionary learning (LEDL), which embeds the label information into ℓ₁ regularized dictionary learning algorithm to improve the performance of image classification tasks. Specifically, (i) compared to LC-KSVD, we utilise the ℓ₁-norm to transfer the sparse constraint problem to convex optimization problem; (ii) alternating direction method of multipliers (ADMM) is adopted to solve the sparse constraint problem to improve the optimization speed; (iii) extensive experimental results on six benchmark datasets illustrate that the classification rate of our proposed algorithm exceeds the LC-KSVD algorithm and our proposed algorithm has achieved state-of-the-art performance.
Human action recognition using two-stream attention based LSTM networks
2020, Applied Soft Computing Journal
It is well known that different frames play different roles in feature learning in video based human action recognition task. However, most existing deep learning models put the same weights on different visual and temporal cues in the parameter training stage, which severely affects the feature distinction determination. To address this problem, this paper utilizes the visual attention mechanism and proposes an end-to-end two-stream attention based LSTM network. It can selectively focus on the effective features for the original input images and pay different levels of attentions to the outputs of each deep feature maps. Moreover, considering the correlation between two deep feature streams, a deep feature correlation layer is proposed to adjust the deep learning network parameter based on the correlation judgement. In the end, we evaluate our approach on three different datasets, and the experiments results show that our proposal can achieve the state-of-the-art performance in the common scenarios.
Deep multiple aggregation networks for action recognition
2024, International Journal of Multimedia Information Retrieval
Fault Diagnosis of Complex Industrial Systems Based on Multi-Granularity Dictionary Learning and Its Application
2024, IEEE Transactions on Automation Science and Engineering

View all citing articles on Scopus

Chunfeng Yuan received the B.S. and M.S. degrees in information science and technology from the Qingdao University of Science and Technology, China, in 2004 and 2007, respectively, and the Ph.D. degree in 2010 from the National Laboratory of Pattern Recognition at Institute of Automation, Chinese Academy of Sciences. She is currently working as an assistant professor at Institute of Automation, Chinese Academy of Sciences. Her main research interests include activity analysis and pattern recognition.

Weiming Hu received the Ph.D. degree from the Department of Computer Science and Engineering, Zhejiang University. From April 1998 to March 2000, he was a Postdoctoral Research Fellow with the Institute of Computer Science and Technology, Founder Research and Design Center, Peking University. Since April 2000, he has been with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Now he is a Professor, a Ph.D. Student Supervisor in the laboratory.

Changyin Sun is a professor in School of Automation at the Southeast University, China. He received the M.S. and Ph.D. degrees in Electrical Engineering from the Southeast University, Nanjing, China, respectively, in 2001 and 2003. His research interests include Intelligent Control, Neural Networks, SVM, Pattern Recognition, Optimal Theory, etc. He has received the First Prize of Nature Science of Ministry of Education, China. He has published more than 40 papers. Professor Sun is a member of an IEEE, an Associate Editor of IEEE Transactions on Neural Networks, Neural Processing Letters and International Journal of Swarm Intelligence Research, Recent Patents on Computer Science.

View full text

Supervised class-specific dictionary learning for sparse modeling in action recognition

Abstract

Highlights

Introduction

Section snippets

Related work

Action representation

Sparse representation and dictionary learning

Experiments

Conclusions

Acknowledgment

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

On space–time interest points

International Journal of Computer Vision

Unsupervised learning of human action categories using spatial–temporal words

International Journal of Computer Vision

Efficient sparse coding algorithms

Advances in Neural Information Processing Systems

Supervised dictionary learning

Advances in Neural Information Processing Systems

Differential sparse coding

Advances in Neural Information Processing Systems

Actions as space–time shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence

The recognition of human movement using temporal templates

IEEE Transactions on Pattern Analysis and Machine Intelligence

Human action recognition in videos using kinematic features and multiple instance learning

IEEE Transactions on Pattern Analysis and Machine Intelligence

Human action recognition by semilatent topic models

IEEE Transactions on Pattern Analysis and Machine Intelligence