Action recognition with motion map 3D network
Introduction
Human action recognition aims to automatically classify the action in a video, and it is a fundamental topic in computer vision with many important applications such as video surveillance and video retrieval. As revealed by [1], the quality of action representations has an influence on the performance of action recognition, which means that learning a powerful and compact representation of an action is an important issue in action recognition.
In recent years, many approaches have been proposed to learn deep features of videos for action recognition. A slow fusion method [2] is presented to extend the connectivity of the network in temporal dimension to learn video features. In [3], a two stream network is proposed to learn spatio-temporal features by using the optical flow and the original image at the same time. The C3D method [4] exploits 3-dimensional convolution kernels to directly extend the convolution operation of the image to the operation of the frame sequence. These methods can only learn the feature of the fixed-length video clip. Unfortunately, the lengths of videos are variable, and these existing works need resort to the pooling methods [4] or the feature aggregation methods [5], [6] to generate a final representation of the entire action video.
In order to solve the problem of representing variable length videos, Bilen et al. [7] proposed dynamic images by using dynamic image network to represent action videos, which takes the order of video frames as the supervisory information without considering the category information of actions. The dynamic image is not able to capture the discriminative information of videos, resulting in the degradation of recognition accuracy.
In this paper, we propose a novel Motion Map 3D ConvNet (MM3D) to learn a motion map for representing an action video clip. By removing a large number of information redundancy of an action video, the motion map is a powerful, compact and discriminative representation of a video. As shown in Fig. 1, the motion maps learned by our MM3D model can capture distinguishable trajectories around the human body.
The proposed MM3D model consists of two networks: a generation network and a discrimination network. The framework of our MM3D is illustrated in Fig. 2. The generation network learns the motion maps of variable-length video clips, by integrating the temporal information into a map without losing the discriminative information of video clips. Specifically, it integrates the motion map of previous frames with the current frame to generate a new motion map. After the repetitive integration of the current frame, the final motion map is generated for the entire video clip by capturing the motion information. Besides, the action class labels are used as the supervisory information to train the generation network, so the learned motion map can also exploit the discriminative information of action videos.
Despite the good performance of the motion map on capturing the local temporal information of the video clips, a single motion map is not sufficient to capture the complex dynamics of an entire action. To learn the long term dynamics from the whole video and demonstrate the power of the motion maps on action recognition, a discrimination network is proposed. The architecture of this network is based on the 3D-CNN model [4] which has been shown powerful performance on action recognition. The input of the discrimination network is a sequence of the motion maps and the discriminative action feature based on the motion maps is extracted from the pool5 layer of the network.
The contributions of this work are two-fold. (1) We propose a new network to generate motion maps for action recognition in videos. The generated motion maps contain the temporal information and the discriminative information of the action video with an arbitrary video length. (2) We propose a discrimination network based on the motion maps to deal with the complex and long-term action video. The network can learn the discriminative features of the sequence of motion maps that benefits boosting the accuracy of action recognition.
This paper is an extended version of our prior conference publication [8]. The main differences are as follows. (1) This paper proposes a novel discrimination network that takes a sequence of motion maps as input. The discrimination network can learn the long term dynamics from the whole video and demonstrate the power of the motion maps on action recognition. (2) To show the effect of our discrimination network, an extended experiment is conducted for the comparison of the single image per video setting with the sequence of images per video setting. The observation of the results shows that the use of discrimination network can improve the accuracy of action recognition for our own motion map, up to 17.4% on the UCF101 and 18.7% on the HMDB51. Another extended experiment is provided to compare our method using discrimination network with the state-of-the-art methods. (3) This paper gives a more extensive overview and comparison of the related literature.
Section snippets
Related work
Action recognition has been studied by the computer vision researchers for decades. To address this issue, various methods have been proposed, of which the majority is about action representations. These action representations can be briefly grouped into two categories: hand-crafted features and deep learning-based features.
Hand-crafted feature: Since videos can be taken as a stream of video frames, many video representations are derived from the image domain. Laptev and Lindeberg [9] proposed
Method
In this section, we first introduce the concept of the motion map that is used to represent the video clips. Then, we present the architecture of the proposed network for learning motion map. Finally, we explain in detail the training and prediction procedures of our network.
Experiments
In this section, we validate the proposed network architecture on two standard action classification benchmarks, i.e., the UCF101 and HMDB51 datasets. Our method is firstly compared with two baseline methods, i.e., the single frame method and the dynamic image method. Then, we demonstrate the comparison results between our method and the state-of-the-art methods on the two datasets.
Conclusion
In this paper, we have introduced the concept of a motion map. A motion map is a powerful representation of an arbitrary video which contains both the static and dynamic information. We also propose a Motion Map 3D ConvNet which can generate a motion map for a video clip and an iterative training method to integrate the discriminative information into a single motion map.
In future, we would like to extend our method on other tasks, such as temporal action localization, the action duration of
Acknowledgments
This work was supported in part by the Natural Science Foundation of China (NSFC) under grants Nos. 61673062 and 61472038.
Yuchao Sun received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2016. He is currently pursuing the M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. His research interests include action recognition and computer vision.
References (33)
- et al.
A Survey of Vision-based Methods for Action Representation, Segmentation and Recognition
(2011) - et al.
Large-scale video classification with convolutional neural networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014) - et al.
Two-stream convolutional networks for action recognition in videos
Proceedings of the Advances in Neural Information Processing Systems
(2014) - et al.
Learning spatiotemporal features with 3d convolutional networks
Proceedings of the IEEE International Conference on Computer Vision
(2015) - et al.
Aggregating local descriptors into a compact image representation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2010) - et al.
Fisher kernels on visual vocabularies for image categorization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07
(2007) - et al.
Dynamic image networks for action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Representing discrimination of video by a motion map
Proceedings of the The 2017 Pacific-Rim Conference on Multimedia
(2017) On space-time interest points
Int. J. Comput. Vis.
(2005)- et al.
A 3-dimensional sift descriptor and its application to action recognition
Proceedings of the 15th ACM International Conference on Multimedia
(2007)
A spatio-temporal descriptor based on 3d-gradients
Proceedings of the BMVC 2008-19th British Machine Vision Conference
Action recognition with improved trajectories
Proceedings of the IEEE International Conference on Computer Vision
Human action recognition in videos using kinematic features and multiple instance learning
IEEE Trans. Pattern Anal Mach. Intell.
Human activity recognition using a dynamic texture based method
Proceedings of the BMVC
Rank pooling for action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
Cited by (17)
SparseShift-GCN: High precision skeleton-based action recognition
2022, Pattern Recognition LettersCitation Excerpt :Action recognition has also attracted great attention from computer vision researchers. According to the difference of input data, human action recognition can be categorized into RGB-based [11–13] and skeleton-based approaches [20–25]. Compared with RGB images, skeleton data has the advantages of light weight and strong anti-interference.
Convolutional relation network for skeleton-based action recognition
2019, NeurocomputingCitation Excerpt :Particularly, human action recognition has received increasing attention due to potential applications in human–robot interaction, behavior analysis and surveillance. According to the types of input data, human action recognition can be categorized into RGB-based [3,8–12] and skeleton-based approaches [13–21]. Compared with RGB images, skeleton data has the merits of being lightweight and robust against background noise.
Spatial-temporal pyramid based Convolutional Neural Network for action recognition
2019, NeurocomputingCitation Excerpt :In contrast to deep-learning models, S-TPNet boosts the performance to 72.2% and 95.2% on two datasets, respectively, which outperforms the majority of models and achieves comparable results with the state-of-the-arts. Motion map 3D Network [68] pre-trained on the large-scale dataset Kinetics and fine tuned on the target dataset has lower performance compared to our model on UCF101 (91.9% vs. 95.2%). Combining with the features of the traditional method, S-TPNet+iDT achieves promising results, 74.8% on HMDB51 and 96.0% on UCF101.
Detecting action tubes via spatial action estimation and temporal path inference
2018, NeurocomputingChannel Attention-Based Approach with Autoencoder Network for Human Action Recognition in Low-Resolution Frames
2024, International Journal of Intelligent SystemsReal-time human action recognition using raw depth video-based recurrent neural networks
2023, Multimedia Tools and Applications
Yuchao Sun received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2016. He is currently pursuing the M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. His research interests include action recognition and computer vision.
Xinxiao Wu received the B.S. degree from the Nanjing University of Information Science and Technology, in 2005, and the Ph.D. degree from the Beijing Institute of Technology, China, in 2010. She is currently an Associate Professor with the Beijing Institute of Technology. Her research interests include machine learning, computer vision and image/video content analysis.
Wennan Yu received the B.S. degree from Beijing Institute of Technology (BIT), Beijing, China, in 2015. He is currently studying for a M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. Under the supervision of A.P. X.Wu. His research interests include computer vision and machine learning.
Feiwu Yu received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2016. She is currently pursuing the M.S. degree at Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology. Her research interests include action recognition and transfer learning.