Supervised learning probabilistic Latent Semantic Analysis for human motion analysis
Introduction
With the advancement of modern imaging technology and reduction in electronic hardware cost, more surveillance systems are being installed in public places. Automatic analysis of human motion in large number of video sequences is a challenging task. When human figures are recorded in a crowded scene, the resulting video sequences often contain occlusions. Moreover, there may exist substantial variations within the same class of motions performed by different subjects. Even the same activities may be performed by the same subjects at different speeds, giving rise to temporal variations of the same activities. Due to these challenges, human motion analysis algorithms that can model complex scenarios, and simultaneously be robust to viewpoints, noise and occlusion are highly desirable. A general framework for automatic human behavior understanding system may consist of video acquisition, human detection, motion representation, motion recognition, and motion semantic description [1], [2], [3]. There exist two important questions associated with the human motion analysis. The first question is how to effectively encode human figures in video sequences. The second question is how to model the temporal dynamic motion sequences so that the variations and similarity of test and reference sequences can be exploited in the training and recognition algorithms.
Our approach is inspired by the success of the bag-of-words method in computer vision fields including image segmentation, object categorization and activity recognition [4], [5], [6], [7], [8]. In this paper, a general framework is proposed for the analysis of human motion in videos based on the bag-of-words representation and the probabilistic Latent Semantic Analysis (pLSA) model (see Fig. 1). This framework consists of detecting human subjects in videos, extracting pyramid Histogram of oriented Gradients (HoG) descriptors, constructing a visual codebook by k-means clustering, and supervised learning the pLSA model for recognition.
Once interest regions containing human figures are extracted by tracking or detection algorithms, such as particle filters [9] and background subtraction [10], we characterize the human figures in each frame at different spatial scales using the pyramid HoG descriptor [11]. The pyramid HoG descriptor can encode a human figure in a compact way without extracting human silhouettes. Moreover, the proposed method is invariant to rotation and translation variations to some extent [11].
Each frame described by the pyramid HoG descriptor is treated as a word in the bag-of-words representation. All the unordered groups of these words from the training video sequences become a bag of words. Although the temporal order information between image frames is lost, the bag-of-words representation remains effective and discriminative due to the characteristics of the supervised learning of the pLSA model. In order to construct the codebook, we cluster the entire pyramid HoG features extracted from training frames using the k-means algorithm based on the Euclidean Distance Metric. The center of each cluster is defined as a codeword, and the centers clustered from the training frames produce the codebook.
The pLSA [12] is firstly proposed to model text collection in an unsupervised way. It assumes that the words are generated from a mixture of latent aspects which can be decomposed from a document. Here, we regard each aspect in the pLSA as one particular motion class. In another word, the number of aspects is equal to the number of motion classes in video. As such, the class label of a new video can be determined by the distribution of the aspects in the pLSA. We notice the importance of the class label information in training data for the classification task. Considering this important information, we propose to learn the pLSA model in a supervised manner, which not only simplifies the learning process of the pLSA, but also improves its recognition accuracy.
The main contribution of this paper is threefold. Firstly, we propose to encode human figures in videos by the pyramid HoG descriptor for motion analysis, which does not require extraction of human silhouettes or contours. Secondly, we extend the standard pLSA model to make use of the class label information in the training data, and propose to train the pLSA in a supervised fashion. Specifically, the parameters are directly counted from training videos and no further iteration is required during training. Furthermore, experiments on two public activity datasets are conducted. We achieved comparative or even higher recognition accuracies compared to the other state-of-the-art methods in the literatures. The remainder of the paper is organized as follows. Section 2 briefly reviews the related work on human motion analysis, and the topic models in computer vision. Section 3 gives the details of the proposed approach including motion representation, codebook formulation and the supervised pLSA model. Section 4 analyzes the experimental results associated with our method on two public datasets. Finally, conclusion remarks are given in Section 5.
Section snippets
Motion representations
Various features and classification methods have been proposed to recognize human motions in video sequences. Appearance-based features focus on the shapes of silhouettes or the contours of the human body, which have the advantage of low computational costs. Wang and Suter [13] divided the raw silhouette sequences into many sub-blocks as visual features. Although relatively simple, the features are effective and efficient. Later, they [14] applied the manifold learning to the Distance Transform
The proposed method
The pyramid HOG is originally used for object retrieval in [11]. It divides a tracked and localized interest area into a number of cells at several pyramid levels. Gradient orientation on all pixels within each cell is accumulated to form a histogram. All the histograms are then concatenated to construct a final histogram.
The basic assumption is that subjects in a video can be tracked and stabilized. The size of the detected region of interest in each frame is varied due to different subjects
Datasets
We evaluated the proposed method on two publicly available standard datasets used in [15] and [36] referred to as the Weizmann dataset and the UMD dataset, respectively. We directly used the tracked interest regions provided by the datasets, as human tracking is not our main concern in this work.
Weizmann dataset: There are 10 actions (bend, jack, jump, pjump, run, side, skip, walk, one-hand wave, and two-hands wave) performed by nine persons. The subjects are captured in a simple background
Conclusion
In this paper, we extended the pLSA model to recognize human activity in video sequences based on the bag-of-words representation. Each frame in a video is encoded using the pyramid HoG descriptor, which requires no extraction of silhouettes or contours. The pyramid HoG descriptor can encode human bodies in different degrees of details according to different pyramid levels. We treat each frame in a video as a word, and cluster the frames in training video sequences to construct the codebook.
Jin Wang received his M.S. in Pattern Recognition and Artificial Intelligence from Huazhong University of Science and Technology (HUST) in 2009. Currently, he is pursuing the Ph.D. degree in the Institute for Technology Research and Innovation (ITRI) at Deakin University, Australia. His major research interests are computer vision, pattern recognition and intelligent system, including automatic video analysis, biomedical time series analysis and intelligent wearable systems.
References (41)
A survey on vision-based human action recognition
Image Vis. Comput.
(2010)- et al.
Recent developments in human motion analysis
Pattern Recognition
(2003) - et al.
Human motion analysis: a review
Comput. Vis. Image Understanding
(1999) - et al.
Visual learning and recognition of sequential data manifolds with applications to human movement analysis
Comput. Vis. Image Understanding
(2008) - et al.
Conditional models for contextual human motion recognition
Comput. Vis. Image Understanding
(2006) - et al.
Action categorization with modified hidden conditional random field
Pattern Recognition
(2010) - et al.
Action categorization by structural probabilistic latent semantic analysis
Comput. Vis. Image Understanding
(2010) - et al.
Human action recognition by feature-reduced Gaussian process classification
Pattern Recognition Lett.
(2009) - L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Conference on...
- J. Sivic, B. Russell, A. Efros, A. Zisserman, W. Freeman, Discovering objects and their location in images, in: Tenth...
Unsupervised learning of human action categories using spatial–temporal words
Int. J. Comput. Vis.
Human action recognition by semilatent topic models
IEEE Trans. Pattern Anal. Mach. Intell.
A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking
IEEE Trans. Signal Process.
Unsupervised learning by probabilistic latent semantic analysis
Mach. Learn.
Actions as space–time shapes
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (12)
A human action recognition approach with a novel reduced feature set based on the natural domain knowledge of the human figure
2015, Signal Processing: Image CommunicationCitation Excerpt :As a consequence, some researchers have used a mixed representation (holistic+part-based) of the human figure. For example, Wang et al. [19] extend the probabilistic Latent Semantic Analysis (pLSA) model to HAR by using bag-of-words. Each frame is encoded using a descriptor pyramid of histogram of oriented gradient (PHOG) which encodes the human figure to multiple degrees of detail according to different pyramid levels.
Biomedical time series clustering based on non-negative sparse coding and probabilistic topic model
2013, Computer Methods and Programs in BiomedicineCitation Excerpt :The occurrence (frequency) of each word in the text document is then counted to represent the text document, i.e., the text document is represented as a sparse vector of occurrence counts of words from a pre-defined dictionary. Recently, the BoW model has been extended in computer vision area such as image classification [18] and activity recognition in videos [19,20]. An image or video is treated as a document, and local patches extracted from the image or video are regarded as words.
Enhancing focus topic findings of discussion forum through corpus classifier algorithm
2019, International Journal of Recent Technology and EngineeringML-HDP: A Hierarchical Bayesian Nonparametric Model for Recognizing Human Actions in Video
2019, IEEE Transactions on Circuits and Systems for Video TechnologyImproving Human Action Recognition through Hierarchical Neural Network Classifiers
2018, Proceedings of the International Joint Conference on Neural NetworksAutomatic Acquisition of Appropriate Codewords Number in BoVW Model and the Corresponding Scene Classification Performance
2018, Chinese Control Conference, CCC
Jin Wang received his M.S. in Pattern Recognition and Artificial Intelligence from Huazhong University of Science and Technology (HUST) in 2009. Currently, he is pursuing the Ph.D. degree in the Institute for Technology Research and Innovation (ITRI) at Deakin University, Australia. His major research interests are computer vision, pattern recognition and intelligent system, including automatic video analysis, biomedical time series analysis and intelligent wearable systems.
Ping Liu got his bachelor degree (EE) in WuHan university of Technology, 2005; master degree (CS) in Huazhong University of Science and Technology, 2008. From 2011, he have been studying in the University of South Carolina. Generally, his interest includes how to apply the modern computer vision and machine learning technique for the human motion analysis, human facial recognition and images/videos retrieval.
Mary F.H. She received her B.Sc. and M.Sc. degrees in Engineering from Donghua University, Shanghai, China and Ph.D. from Deakin University, Victoria, Australia. After graduation, she was awarded Australian Post-doctorial Fellowship by Australian Research Council in 2002 and worked on image analysis and artificial intelligence technologies for materials characterization and animal monitoring in University of South Australia for 4.5 years. She currently holds a position of research fellow in the Institute of Technology and Research Innovation at Deakin University. Her major research interest includes image processing and analysis, pattern recognition, artificial intelligence and intelligent wearable systems.
Abbas Kouzani received his B.Sc. degree in computer engineering from Sharif University of Technology, Iran, 1990, M.Eng. degree in electrical and electronics engineering from the University of Adelaide, Australia, 1995, and Ph.D. degree in electrical and electronics engineering from Flinders University, Australia, 1999. He was a lecturer with the School of Engineering, Deakin University, and then a Senior Lecturer with the School of Electrical Engineering and Computer Science, University of Newcastle, Australia. Currently, he is an Associate Professor with the School of Engineering, Deakin University. He has been involved in several ARC, industry, and university research grants, and more than 150 publications. His research interests include intelligent micro-electromechanical systems.
Saeid Nahavandi (SM07) received the B.Sc. (hons.), M.Sc., and Ph.D. degrees in automation and control from Durham University, Durham, UK. He is the Alfred Deakin Professor, Chair of Engineering, and the Director for the Center for Intelligent Systems Research (CISR), Deakin University, Geelong, VIC, Australia. He has published over 420 peer reviewed papers in various international journals and conferences. His research interests include modeling of complex systems, simulation-based optimization, robotics, haptics and augmented reality. Dr. Nahavandi is the Associate Editor of the IEEE Systems Journal, an Editorial Consultant Board member for the International Journal of Advanced Robotic Systems, an Editor (South Pacific Region) of the International Journal of Intelligent Automation and Soft Computing. He is a Fellow of Engineers Australia (FIEAust), IET (FIET) and Senior member of IEEE (SMIEEE).