A novel mid-level distinctive feature learning for action recognition via diffusion map

doi:10.1016/j.neucom.2016.08.057

Neurocomputing

Volume 218, 19 December 2016, Pages 185-196

https://doi.org/10.1016/j.neucom.2016.08.057 Get rights and content

Abstract

Recent works have shown that mid-level feature is superior to low-level feature, which can not only improve discriminative power, but also enhance descriptive capability. In this paper, the classical STIP, spatial star-graph and temporal star-graph are first extracted to represent human action from multi-perspectives. Then a principled feature learning algorithm is proposed to embed these multi-cues into a unified space and enhance all low-level features using diffusion map. Unlike treating spatio-temporal patch as mid-level primitive, we use a graph to model different types of primitives, then apply graph partitioning to co-cluster them into visual-word clusters called mid-level distinctive feature, which can bridge the semantic gap across low-level features. Experimental results show that our approach can successfully classify human activities with higher accuracies both on single-person actions (KTH and UCF) and complex interactional activities (UT-Interaction and HMDB51).

Introduction

How should a video be represented? It is a question as old as the computer vision discipline itself, and it is also a key issue in human action recognition. Nowadays, a variety of visual features have been proposed spanning a wide spectrum, from very local to global volume or from low-level to high-level concepts.

In terms of spatio-temporal resolution, one extreme is treating the whole video volume [1], [2] as a primitive. However, the holistic method is suffering from two limitations in real world application. First, motion tracking and foreground segmentation is required, which is still a challenge in itself. Second, such representing method is sensitive to occlusions, noise, viewpoint and illumination. At the other extreme, the space-time interest point (STIP) [3] can be as a primitive. But there is generally not enough information at a local level to make a representative feature. For instance, some local models such as applying STIP to “bag-of-features” (BOF) [4] describe action as a collection of spatio-temporal interest points and their spatio-temporal relations are completely neglected. As a result, most researchers have started looking at using features at an intermediate scale: that of mid-level feature.

The goal of this paper is to propose a mid-level representation and learning framework for action recognition which is a balance between local feature and global template. The following desired properties are expected for our proposed mid-level feature: 1) motion saliency, it needs to capture the part with strong motion cue in video; 2) spatio-temporal representation, it requires to model contextual structure in both spatial and temporal domain; 3) to be representative and discriminative, “representative” indicates it needs to occur frequently enough in action video, and “discriminative” implies it needs to be different enough from the rest of features.

However, even if we fix the resolution of the primitive, there is still a wide range of choices regarding how to get these mid-level primitives. Two classical ways can be used to achieve these mid-level features by hand-designing [5], [6], [7] or feature learning [8], [10]. On the one hand, some hand-designed mid-level features are generated by integrating some contextual information into BOF to improve STIP based representing approach. The drawback of the hand-designed relation is that it is only representative but not be discriminative. On the other hand, some feature learning algorithms are proposed, which employ discriminative part detectors as semantic features to describe and reason about actions. But such semantic and discriminative entities are just not representative enough to act as good features. Meanwhile, there is a significant practical barrier that lots of hand-labeled training data of per each semantic entity are required in these supervised approaches.

In this paper, we propose a simple, efficient, and effective method to learn a mid-level distinctive feature (MLDF) which can satisfy the above three properties simultaneously. At first, we estimate motion saliency using spatio-temporal interest points, since they often correspond to regions with high-motion. To conform with the second property, spatial star-graph and temporal star-graph are extracted to capture sufficient spatio-temporal information around these local interest points. Finally, we learn a representative and discriminative primitive based on these three low-level features. To be representative, the standard solution is to employ some form of unsupervised clustering, which will latch onto frequently occurring primitives. However, running clustering algorithm with general distance metric cannot achieve good clusters, since it lakes of discriminating ability and does not work well for data from various sources. Hence, the key insight of this paper is to pose this as an unsupervised clustering problem with a discriminative similarity measurement on multiple but homologous features. In this paper, we apply diffusion map [26], [31] framework to fuse these multi-cues and enhance all low-level features by co-clustering them into a unified visual-word clusters called mid-level distinctive feature. There are three reasons that diffusion distance is employed as a similarity measure here. Firstly, diffusion map can capture intrinsic geometric and semantic relationships between the different low-level features on the manifold. Secondly, it is proper to adopt graph-based distance measure when clustering graph-based features. Thirdly, diffusion map can be used as a multi-scale analysis on the data, and all the paths between two points are employed to compute the distance, so that it is more robust to noise than the geodesic distance (shortest path distance). The underlying idea is to embed low-level features into a unified semantic space, so as to construct a compact yet discriminative mid-level vocabulary.

The proposed algorithm can model both single-person actions and multi-person activities, since mid-level feature often yields a representation with adequate discriminative power for more complex datasets. We conduct experiments on human action recognition on four public datasets: KTH, UCF, HMDB51 and UT-interaction. In experimental results, our MLDF outperforms almost all other related state-of-the-art methods by taking into account all the three properties.

The rest of this paper is organized as follows: The Section 2 focuses on some related works. The Section 3 details how to construct Mid-Level Distinctive Feature (MLDF) for video representation. Experiments and discussions are shown in the Section 4. Finally, the Section 5 concludes this paper.

Section snippets

Related work

Existing approaches using mid-level feature for action recognition can be coarsely lumped into three classes: part-based, spatio-temporal patch based (ST patch-based), and graph-based.

1)
Part-based. The concept of “part” has been successfully and widely used in object detection and even activity recognition, where these parts often do have a semantic connotation. This part-based method employs discriminative part detectors as semantic features to describe and reason about objects or actions.

The mid-level distinctive feature

In this section, we describe how to construct Mid-Level Distinctive Feature (MLDF) for video representation. In our work, we choose three types of low-level features: spatio-temporal interest point (STIP), spatial star-graph (SSG) and temporal star-graph (TSG). Once these features are extracted, diffusion map is adopted to discover the compatibilities among multi-cues and learn the semantic MLDF by embedding them into a common compact space. The main steps of our proposal are: 1) extracting

Experiments and discussions

The MLDF can be used to classify both single-person actions and multi-person activities. We evaluate the performance of our proposal on four benchmark human activity datasets: the KTH action dataset [35], the UCF Sport dataset [36], the HMDB51 dataset [39] and the UT-interaction dataset [6]. Firstly, we compare our action recognition scheme with other related recognition schemes. Secondly, we evaluate and analyze the parameters involved in MLDF learning. Finally, we mainly evaluate the

Conclusion

In this paper, we propose a novel and compact mid-level feature named mid-level distinctive feature (MLDF) for human activity recognition. This MLDF shares advantages of both ST patch-based and graph-based representing methods, which satisfies properties of motion saliency, space-time, representation and discrimination. We first extract some MLDF candidates from the classical STIP, spatial star-graph (SSG) and temporal star-graph (TSG). Then use diffusion map to embed them into a common

Acknowledgment

This work is supported by the NSFC 61273274, 61572064, PXM2016_014219_000025, 973 Program 2011CB302203, National Key Technology R&D Program of China 2012BAH01F03.

Wanru Xu received the B.E. degree from Beijing Jiaotong University, China, in 2011, and she is currently pursuing the pH.D degree in Beijing Jiaotong University, China. Her research interests include pattern recognition, machine learning and computer vision. Her current research mainly focuses on human action recognition for video surveillance and content based video retrieval.

References (43)

L. Shao et al.
Human action segmentation and recognition via motion and shape analysis
Pattern Recognit. Lett.
(2012)
J. Liu et al.
Learning semantic features for action recognition via diffusion maps
Comput. Vis. Image Underst.
(2012)
A.F. Bobick et al.
The recognition of human movement using temporal templates
Pattern Anal. Mach. Intell. IEEE Trans.
(2001)
I. Laptev
On space-time interest points
Int. J. Comput. Vis.
(2005)
P. Dollár, V. Rabaud, G. Cottrell, et al., Behavior recognition via sparse spatio-temporal features, in: Proceedings of...
M. Bregonzio, S. Gong, T. Xiang, Recognising action as clouds of space-time interest points, in: Proceedings of...
M.S. Ryoo, J.K. Aggarwal, Spatio-temporal relationship match: video structure comparison for recognition of complex...
J. Sun, X. Wu, S. Yan, et al., Hierarchical spatio-temporal context modeling for action recognition, in: Proceedings of...
M.A., Sadeghi, A. Farhadi, Recognition using visual phrases, in: Proceedings of Computer Vision and Pattern Recognition...
L. Bourdev, J. Malik, Poselets: body part detectors trained using 3d human pose annotations, in: Proceedings of...

P.F. Felzenszwalb et al.

Object detection with discriminatively trained part-based models

Pattern Anal. Mach. Intell., IEEE Trans.

(2010)

S. Sadanand, J.J. Corso, Action bank: a high-level representation of activity in video, in: Proceedings of Computer...

M. Weber, M. Welling, P. Perona, Towards automatic discovery of object categories, in: Proceedings of Computer Vision...

Q.V. Le, W.Y. Zou, S.Y. Yeung, et al., Learning hierarchical invariant spatio-temporal features for action recognition...

Wang L.M., Qiao Y., Tang X. Motionlets: Mid-level 3d parts for human motion recognition, in: Proceedings of Computer...

A. Jain, A. Gupta, M. Rodriguez, et al., Representing videos using mid-level discriminative patches, in: Proceedings of...

Y. Tian, R. Sukthankar, M. Shah, Spatiotemporal deformable part models for action detection, in: Proceedings of...

S. Ma, J. Zhang, N. Ikizler-Cinbis, et al., Action recognition and localization by hierarchical space-time segments,...

C. Gan, N. Wang, Y. Yang, et al., DevNet: a deep event network for multimedia event detection and evidence recounting,...

G. Gkioxari, J. Malik, Finding action tubes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern...

L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in: Proceedings of...

Cited by (6)

Hyperspectral remote sensing IQA via learning multiple kernels from mid-level features
2020, Signal Processing: Image Communication
Citation Excerpt :
Two weighting schemes were utilized to seamlessly integrate different mid-level features. Xu et al. [12] proposed a principled feature learning algorithm to incorporate multiple features into a unified one. A graph was utilized to in our implementation to represent different primitives and mid-level distinctive feature was produced by graph partitioning-based visual-word clusters.
Hyperspectral image quality assessment (HIQA) is an indispensable technique in both academic and industry domain However, HIQA is still a challenging task since those fine-grained and quality-aware visual details are difficult to be captured. Compared with the conventional low-level features, mid-level features usually contain more semantic and quality clues and exhibit higher discriminant ability. Thus, we aim to leverage the mid-level features for HIQA. More specifically, three-scale superpixel mosaics are generated from the input image pre-processed by PCA. Each superpixel scale corresponds to various homogeneousobject parts. Subsequently, three mid-level visual features (fisher vector, combined mean features, reconstructed image matrix) as well as deep features of hyperspectral images are calculated with three-scale superpixel images to constitute multiple kernels. Afterwards, we integrate these kernels into a multimodal one, which is further integrated into a feature vector by row-wise stacking. The image quality evaluation can be calculated based on the designed similarity metric. Comprehensive experiments have demonstrated the effectiveness of our proposed HIQA algorithm.
Learning part-based mid-level representation for visual recognition
2018, Neurocomputing
Citation Excerpt :
Yuan et al. [26] extracted mid-level activity components hierarchically from videos for activity recognition. Xu et al. [27] built mid-level features with diffusion map to classify single-person actions and complex interactional activities. Mazloom et al. [28] performed feature selection per event on a set of concept detectors for video event classification.
There exists a huge semantic gap between the low-level image representations and high-level semantics. To bridge such a gap, this paper proposes a mid-level image representation for visual recognition, where an image is represented based upon the response maps of local part filters. Each dimension of the mid-level representation indicates the likelihood of seeing a part in the input image. The part filters are trained using external data and need not to be fine-tuned on test data. To eliminate the possibly redundant similar parts occurring in different objects or scenes, we perform unsupervised clustering for part refinement. To alleviate the expensive computation of the response maps of the part filters, we further leverage sparse coding to accelerate the feature extraction process, which is ten times faster without significantly compromising the recognition accuracy. We evaluate the proposed mid-level representation on both image and video content recognition tasks and attain state-of-the-art results.
Extreme Learning Machine Combining Hidden-Layer Feature Weighting and Batch Training for Classification
2023, Neural Processing Letters
Action Recognition and Detection Based on Deep Learning: A Comprehensive Summary
2023, Computers, Materials and Continua
Research Progress on the Human Behavior Recognition Based on Machine Learning Methods
2023, Recent Patents on Mechanical Engineering
Research Advances on Human Activity Recognition Datasets
2018, Zidonghua Xuebao/Acta Automatica Sinica

Zhenjiang Miao (M'11) received the B.E. degree from Tsinghua University, Beijing, China, in 1987, and the M.E. and pH.D. degrees from Northern Jiaotong University, Beijing, in 1990 and 1994, respectively. From 1995 to 1998, he was a Post-Doctoral Fellow with the École Nationale Supérieure d′Electrotechnique, d′Electronique, d′Informatique, d′Hydraulique et des Télécommunications, Institut National Polytechnique de Toulouse, Toulouse, France, and was a Researcher with the Institute National de la Recherche Agronomique, Sophia Antipolis, France. From 1998 to 2004, he was with the Institute of Information Technology, National Research Council Canada, Nortel Networks, Ottawa, Canada. He joined Beijing Jiaotong University, Beijing, in 2004. He is currently a Professor, Director of the Media Computing Center, Beijing Jiaotong University, and Director of the Institute for Digital Culture Research, Center for Ethnic & Folk Literature & Art Development, Ministry Of Culture, P.R.China. His current research interests include image and video processing, multimedia processing, and intelligent human–machine interaction.

Yi Tian received the B.E. degree from Beijing Jiaotong University, China, in 2011, and she is currently pursuing the pH.D degree in Beijing Jiaotong University, China. Her research interests include pattern recognition, machine learning and computer vision. Her current research mainly focuses on human action recognition for video surveillance and human identification.

View full text

A novel mid-level distinctive feature learning for action recognition via diffusion map

Abstract

Introduction

Section snippets

Related work

The mid-level distinctive feature

Experiments and discussions

Conclusion

Acknowledgment

Pattern Recognit. Lett.

Comput. Vis. Image Underst.

The recognition of human movement using temporal templates

Pattern Anal. Mach. Intell. IEEE Trans.

On space-time interest points

Int. J. Comput. Vis.

Object detection with discriminatively trained part-based models

Pattern Anal. Mach. Intell., IEEE Trans.