Action recognition with motion map 3D network

doi:10.1016/j.neucom.2018.02.028

Neurocomputing

Volume 297, 5 July 2018, Pages 33-39

https://doi.org/10.1016/j.neucom.2018.02.028 Get rights and content

Abstract

Recently, deep neural networks have demonstrated remarkable progresses for human action recognition in videos. However, most existing deep frameworks can not handle variable-length videos properly, which leads to the degradation in classification performance. In this paper, we propose a Motion Map 3D ConvNet(MM3D), which can represent the content of a video with arbitrary video length by a motion map. In our MM3D model, a novel generation network is proposed to learn a motion map to represent a video clip by iteratively integrating a current video frame into a previous motion map. A discrimination network is also introduced for classifying actions based on the learned motion map. Experiments on the UCF101 and the HMDB51 datasets prove the effectiveness of our method for human action recognition.

Introduction

Human action recognition aims to automatically classify the action in a video, and it is a fundamental topic in computer vision with many important applications such as video surveillance and video retrieval. As revealed by [1], the quality of action representations has an influence on the performance of action recognition, which means that learning a powerful and compact representation of an action is an important issue in action recognition.

In recent years, many approaches have been proposed to learn deep features of videos for action recognition. A slow fusion method [2] is presented to extend the connectivity of the network in temporal dimension to learn video features. In [3], a two stream network is proposed to learn spatio-temporal features by using the optical flow and the original image at the same time. The C3D method [4] exploits 3-dimensional convolution kernels to directly extend the convolution operation of the image to the operation of the frame sequence. These methods can only learn the feature of the fixed-length video clip. Unfortunately, the lengths of videos are variable, and these existing works need resort to the pooling methods [4] or the feature aggregation methods [5], [6] to generate a final representation of the entire action video.

In order to solve the problem of representing variable length videos, Bilen et al. [7] proposed dynamic images by using dynamic image network to represent action videos, which takes the order of video frames as the supervisory information without considering the category information of actions. The dynamic image is not able to capture the discriminative information of videos, resulting in the degradation of recognition accuracy.

In this paper, we propose a novel Motion Map 3D ConvNet (MM3D) to learn a motion map for representing an action video clip. By removing a large number of information redundancy of an action video, the motion map is a powerful, compact and discriminative representation of a video. As shown in Fig. 1, the motion maps learned by our MM3D model can capture distinguishable trajectories around the human body.

The proposed MM3D model consists of two networks: a generation network and a discrimination network. The framework of our MM3D is illustrated in Fig. 2. The generation network learns the motion maps of variable-length video clips, by integrating the temporal information into a map without losing the discriminative information of video clips. Specifically, it integrates the motion map of previous frames with the current frame to generate a new motion map. After the repetitive integration of the current frame, the final motion map is generated for the entire video clip by capturing the motion information. Besides, the action class labels are used as the supervisory information to train the generation network, so the learned motion map can also exploit the discriminative information of action videos.

Despite the good performance of the motion map on capturing the local temporal information of the video clips, a single motion map is not sufficient to capture the complex dynamics of an entire action. To learn the long term dynamics from the whole video and demonstrate the power of the motion maps on action recognition, a discrimination network is proposed. The architecture of this network is based on the 3D-CNN model [4] which has been shown powerful performance on action recognition. The input of the discrimination network is a sequence of the motion maps and the discriminative action feature based on the motion maps is extracted from the pool5 layer of the network.

The contributions of this work are two-fold. (1) We propose a new network to generate motion maps for action recognition in videos. The generated motion maps contain the temporal information and the discriminative information of the action video with an arbitrary video length. (2) We propose a discrimination network based on the motion maps to deal with the complex and long-term action video. The network can learn the discriminative features of the sequence of motion maps that benefits boosting the accuracy of action recognition.

This paper is an extended version of our prior conference publication [8]. The main differences are as follows. (1) This paper proposes a novel discrimination network that takes a sequence of motion maps as input. The discrimination network can learn the long term dynamics from the whole video and demonstrate the power of the motion maps on action recognition. (2) To show the effect of our discrimination network, an extended experiment is conducted for the comparison of the single image per video setting with the sequence of images per video setting. The observation of the results shows that the use of discrimination network can improve the accuracy of action recognition for our own motion map, up to 17.4% on the UCF101 and 18.7% on the HMDB51. Another extended experiment is provided to compare our method using discrimination network with the state-of-the-art methods. (3) This paper gives a more extensive overview and comparison of the related literature.

Section snippets

Related work

Action recognition has been studied by the computer vision researchers for decades. To address this issue, various methods have been proposed, of which the majority is about action representations. These action representations can be briefly grouped into two categories: hand-crafted features and deep learning-based features.

Hand-crafted feature: Since videos can be taken as a stream of video frames, many video representations are derived from the image domain. Laptev and Lindeberg [9] proposed

Method

In this section, we first introduce the concept of the motion map that is used to represent the video clips. Then, we present the architecture of the proposed network for learning motion map. Finally, we explain in detail the training and prediction procedures of our network.

Experiments

In this section, we validate the proposed network architecture on two standard action classification benchmarks, i.e., the UCF101 and HMDB51 datasets. Our method is firstly compared with two baseline methods, i.e., the single frame method and the dynamic image method. Then, we demonstrate the comparison results between our method and the state-of-the-art methods on the two datasets.

Conclusion

In this paper, we have introduced the concept of a motion map. A motion map is a powerful representation of an arbitrary video which contains both the static and dynamic information. We also propose a Motion Map 3D ConvNet which can generate a motion map for a video clip and an iterative training method to integrate the discriminative information into a single motion map.

In future, we would like to extend our method on other tasks, such as temporal action localization, the action duration of

Acknowledgments

This work was supported in part by the Natural Science Foundation of China (NSFC) under grants Nos. 61673062 and 61472038.

Yuchao Sun received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2016. He is currently pursuing the M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. His research interests include action recognition and computer vision.

References (33)

D. Weinland et al.
A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition
(2011)
A. Karpathy et al.
Large-scale video classification with convolutional neural networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014)
K. Simonyan et al.
Two-stream convolutional networks for action recognition in videos
Proceedings of the Advances in Neural Information Processing Systems
(2014)
D. Tran et al.
Learning spatiotemporal features with 3d convolutional networks
Proceedings of the IEEE International Conference on Computer Vision
(2015)
H. Jégou et al.
Aggregating local descriptors into a compact image representation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2010)
F. Perronnin et al.
Fisher kernels on visual vocabularies for image categorization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07
(2007)
H. Bilen et al.
Dynamic image networks for action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Y. Wennan et al.
Representing discrimination of video by a motion map
Proceedings of the The 2017 Pacific-Rim Conference on Multimedia
(2017)
I. Laptev
On space-time interest points
Int. J. Comput. Vis.
(2005)
P. Scovanner et al.
A 3-dimensional sift descriptor and its application to action recognition
Proceedings of the 15th ACM International Conference on Multimedia
(2007)

A. Klaser et al.

A spatio-temporal descriptor based on 3d-gradients

Proceedings of the BMVC 2008-19th British Machine Vision Conference

(2008)

WangH. et al.

Action recognition with improved trajectories

Proceedings of the IEEE International Conference on Computer Vision

(2013)

S. Ali et al.

Human action recognition in videos using kinematic features and multiple instance learning

IEEE Trans. Pattern Anal Mach. Intell.

(2010)

V. Kellokumpu et al.

Human activity recognition using a dynamic texture based method

Proceedings of the BMVC

(2008)

B. Fernando et al.

Rank pooling for action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

A. Krizhevsky et al.

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

(2012)

Cited by (17)

SparseShift-GCN: High precision skeleton-based action recognition
2022, Pattern Recognition Letters
Citation Excerpt :
Action recognition has also attracted great attention from computer vision researchers. According to the difference of input data, human action recognition can be categorized into RGB-based [11–13] and skeleton-based approaches [20–25]. Compared with RGB images, skeleton data has the advantages of light weight and strong anti-interference.
Skeleton-based action recognition is widely used due to its advantages of lightweight and strong anti-interference. Recently, graph convolutional networks (GCNs) have been applied to action recognition and have made breakthrough progress. The shift convolution operator can effectively replace the spatial convolution and greatly reduce the computational complexity of the algorithm. This article first applies the Conv-Shift-Conv (CSC) module and the Shift-Conv-Shift-Conv (SC²) module to replace the Shift-Conv-Shift (SCS) module in spatial graph convolution of Shift-GCN respectively. This design can reorder the shifted channels more effectively. The experimental results show that the CSC module has the best effect and effectively improves accuracy of model. After that, this article proposes to replace the shift module in the original Shift-GCN with a sparse shift module and named SparseShift-GCN. This structure can reduce the redundancy of features, prevent overfitting and improve the generality of the model. Based on the improvement in the previous step, better results have been achieved. Finally, this paper uses OHEM Loss and Weighted Loss to carefully design the loss function of the model and introduces it into the model proposed in this paper. Experimental results show that OHEM Loss further improves the accuracy of algorithm. After a series of improvements, our proposed model has improved the accuracy of 4 different streams to varying degrees, which improves the overall performance of the network.
Convolutional relation network for skeleton-based action recognition
2019, Neurocomputing
Citation Excerpt :
Particularly, human action recognition has received increasing attention due to potential applications in human–robot interaction, behavior analysis and surveillance. According to the types of input data, human action recognition can be categorized into RGB-based [3,8–12] and skeleton-based approaches [13–21]. Compared with RGB images, skeleton data has the merits of being lightweight and robust against background noise.
In the skeleton-based action recognition, mining information from the joints and limbs of human skeletons plays a key role. Previous studies treated the skeleton data as vectors and could not explicitly capture the joint interactions (e.g., RNN-based methods), or modeled the joint interactions in a local manner and may lose important cues without global response mapping (e.g., CNN and GCN (Graph Convolution Network) based methods). In this work, we address these problems by considering the potential relations of all the node pairs and edge pairs on the skeleton graphs. A dilation group-specific convolution module is proposed to aggregate relation messages of all the unit pairs on the skeleton graphs. By enumerating all the pair relations, the joint interactions could be learned explicitly and globally. It is then enhanced by introducing the attention pooling including temporal attention, spatial attention and channel attention. By stacking such several blocks, the relation messages of the node pairs are augmented by multi-layer propagation. Finally, the late fusion of four streams is used to combine the predictions of different inputs including node pairs, edge pairs and corresponding frame differences. The proposed method, termed conv-relation network, achieves competitive performance on two large scale datasets, NTU RGB+D and Kinetics.
Spatial-temporal pyramid based Convolutional Neural Network for action recognition
2019, Neurocomputing
Citation Excerpt :
In contrast to deep-learning models, S-TPNet boosts the performance to 72.2% and 95.2% on two datasets, respectively, which outperforms the majority of models and achieves comparable results with the state-of-the-arts. Motion map 3D Network [68] pre-trained on the large-scale dataset Kinetics and fine tuned on the target dataset has lower performance compared to our model on UCF101 (91.9% vs. 95.2%). Combining with the features of the traditional method, S-TPNet+iDT achieves promising results, 74.8% on HMDB51 and 96.0% on UCF101.
Convolutional Neural Networks (CNNs) usually use top-level appearance features of video frames for action recognition. However, these methods discard the implicit complementary advantages across different-scale appearance representations which are effective for object detection, instance segmentation and person re-identification. In this paper, a new spatial pyramid module is proposed to take full use of inherent multi-scale information of CNNs with nearly cost-free by which a bottom-up architecture with lateral connections is constructed for combining high-, mid-, low-level representations of CNNs into a hierarchical frame-level feature elaborately. Additionally, temporal relations at appropriate timescale are contributed to the identification of an action. To this end, we also propose a new temporal pyramid module in which frame-level features belonged to one video are reused by various timescale pooling approaches to get different time-grained features of snippets efficiently. Followed by snippet-relation reasoning, different timescale temporal relations are derived and accumulated for the comprehensive prediction. Unifying the proposed spatial and temporal pyramid modules, a novel network, Spatial-Temporal Pyramid Network (S-TPNet), is proposed to extract spatial-temporal pyramid features for action recognition in videos. Unlike previous models which boost performance at the cost of computation, S-TPNet can be trained in an end-to-end fashion with great efficiency. Extensive experiments on Kinetics, UCF101, and HMDB51 demonstrate that S-TPNet displays significant performance improvements compared with existing frameworks and obtains competitive performance with the state-of-the-arts.¹
Detecting action tubes via spatial action estimation and temporal path inference
2018, Neurocomputing
In this paper, we address the problem of action detection in unconstrained video clips. Our approach starts from action detection on object proposals at each frame, then aggregates the frame-level detection results belonging to the same actor across the whole video via linking, associating, and tracking to generate action tubes that are spatially compact and temporally continuous. To achieve the target, a novel action detection model with two-stream architecture is firstly proposed, which utilizes the fused feature from both appearance and motion cues and can be trained end-to-end. Then, the association of the action paths is formulated as a maximum set coverage problem with the results of action detection as a priori. We utilize an incremental search algorithm to obtain all the action proposals at one-pass operation with great efficiency, especially while dealing with the video of long duration or with multiple action instances. Finally, a tracking-by-detection scheme is designed to further refine the generated action paths. Extensive experiments on three validation datasets, UCF-Sports, UCF-101 and J-HMDB, show that the proposed approach advances state-of-the-art action detection performance in terms of both accuracy and proposal quality.
Channel Attention-Based Approach with Autoencoder Network for Human Action Recognition in Low-Resolution Frames
2024, International Journal of Intelligent Systems
Real-time human action recognition using raw depth video-based recurrent neural networks
2023, Multimedia Tools and Applications

View all citing articles on Scopus

Xinxiao Wu received the B.S. degree from the Nanjing University of Information Science and Technology, in 2005, and the Ph.D. degree from the Beijing Institute of Technology, China, in 2010. She is currently an Associate Professor with the Beijing Institute of Technology. Her research interests include machine learning, computer vision and image/video content analysis.

Wennan Yu received the B.S. degree from Beijing Institute of Technology (BIT), Beijing, China, in 2015. He is currently studying for a M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. Under the supervision of A.P. X.Wu. His research interests include computer vision and machine learning.

Feiwu Yu received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2016. She is currently pursuing the M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. Her research interests include action recognition and transfer learning.

View full text

Action recognition with motion map 3D network

Abstract

Introduction

Section snippets

Related work

Method

Experiments

Conclusion

Acknowledgments

A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition

Large-scale video classification with convolutional neural networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Two-stream convolutional networks for action recognition in videos

Proceedings of the Advances in Neural Information Processing Systems

Learning spatiotemporal features with 3d convolutional networks

Proceedings of the IEEE International Conference on Computer Vision

Aggregating local descriptors into a compact image representation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fisher kernels on visual vocabularies for image categorization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07

Dynamic image networks for action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Representing discrimination of video by a motion map

Proceedings of the The 2017 Pacific-Rim Conference on Multimedia

On space-time interest points

Int. J. Comput. Vis.

A 3-dimensional sift descriptor and its application to action recognition

Proceedings of the 15th ACM International Conference on Multimedia

A spatio-temporal descriptor based on 3d-gradients

Proceedings of the BMVC 2008-19th British Machine Vision Conference

Action recognition with improved trajectories

Proceedings of the IEEE International Conference on Computer Vision

Human action recognition in videos using kinematic features and multiple instance learning

IEEE Trans. Pattern Anal Mach. Intell.

Human activity recognition using a dynamic texture based method

Proceedings of the BMVC

Rank pooling for action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems