Actionness-pooled Deep-convolutional Descriptor for fine-grained action recognition
Introduction
In the past decades, great efforts have been made for the general action recognition problem. Many successful representations have been proposed, including hand-crafted features such as Space Time Interest Points (STIPs) [1] and Improved Trajectories [2]. Besides, it has become popular to extract action representations by leveraging deep learning techniques. Tran et al. [3] propose a generic spatio-temporal descriptor by performing 3D convolutions and pooling to preserve both spatial and temporal information of the input signals. Simonyan and Zisserman design a two-stream architecture [4] consisting of RGB and optical flow streams to capture the appearance and motion information respectively. These ConvNets-based methods achieve the state-of-the-art performance.
However, the actions in traditional datasets, such as KTH [5] and UCF101 [6], are mostly well-defined with significant appearance and motion differences, which are easy to be distinguished as compared to the realistic ones that are often vague and uncertain. Specifically, the intra-class difference of fine-grained actions is likely to be very large due to the variations of actors, backgrounds, etc. In contrast, the similarity between different actions substantially reduces the inter-class difference. The recognition of fine-grained actions raises more challenges. First, an effective representation should be able to distinguish between actions sharing similar appearance and motion styles but having subtle differences. Take, for example, the Mongolian dance and Uygur dance are shown in Fig. 1(a), for example, the background scenes and costumes are very similar. What is more, the actors are performing very similar “Spinning” actions. Only the poses and actions of arms are slightly different. These tiny differences are very hard to be captured by existing ConvNets which are designed for general action recognition since they adopt globally sampling and pooling on entire frames (see Fig. 1(b)). Second, the lack of training data is another barrier. As we know, only with enough and diverse training samples can deep learning models work better, but there are no large-scale datasets available for fine-grained action recognition. Even the most widely used datasets for general action recognition are not large enough.
In order to circumvent the above-mentioned problems encountered with current approaches, we propose a novel feature sampling and pooling method and propose a novel representation, namely Actionness-pooled Deep-convolutional Descriptor (ADD), inspired by human visual attention mechanism that has been widely used in fine-grained image classification [7]. In this work, we treat actionness maps as the guidance of visual saliency and extract features from more discriminative patches (see Fig. 1(c) and (d)). By this means, for one thing, we endow the proposed feature descriptor with stronger sensitivity to subtle differences in local appearance and motion patterns between fine-grained actions. For another, our proposed actionness-constrained sampling and pooling serves as a kind of feasible data augmentation strategy since it builds complementary representation to traditional end-to-end ConvNet representations.
We evaluate ADD on HIT Dances dataset [8], which is originally constructed for dance video recommendation. The results demonstrate that ADD significantly outperforms 3D convolutional network (C3D) representation [3] and the traditional two-stream model [9] by 6.9% and 5.8%, respectively, on the task of fine-grained dance recognition. Besides, from extensive experiments on two traditional benchmarks, JHMDB [10] and UCF101 dataset [6], we experimentally show that the combination of different sampling and pooling strategies can further improve the performance, which indicates the complementary property between ADD and general end-to-end ConvNets. Furthermore, we take advantage of ADD to derive the segment-level representations of action videos and analyze the contributions of tiny action clips to fine-grained action classification.
In summary, the contributions of this paper are threefold:
- •
We propose an effective model for fine-grained action recognition which integrates a novel discriminative descriptor named ADD. We embed the attention mechanism by aggregating features following the actionness cues.
- •
Experiments demonstrate the superior performance of our method on fine-grained action recognition and the complementary properties between ADD and traditional end-to-end ConvNets-based representations.
- •
We explore the sparsity characteristic of action data by reasoning the temporal importance of long-range actions to action classification and point out a potential direction to promote future action analysis tasks.
One preliminary version of this work has been accepted by IEEE Conference on Multimedia and Expo (ICME 2018). In this paper, we have improvements mainly in the following two aspects: (1) Several additional experiments have been conducted to support our argument that ADD captures complementary information to traditional end-to-end ConvNets representation and the combinations of them contribute to further improve the recognition performance; (2) We carry out an extra exploration experiment taking advantage of our ADD representation. Specifically, we investigate the temporal importance of long-range actions when recognizing an action. By this means, we reveal that only a small subset of the video content contributes to the recognition. And we find that these action patterns are not only representative but also discriminative. This discovery will facilitate other computer vision tasks, such as action compression, summarization, and compact action representation.
The remainder of the paper is organized as follows. In Section 2, we review the related work. And then, we introduce the proposed Actionness-pooled Deep-convolutional Descriptor (ADD) in Section 3. Sections 4 and 5 present the experimental settings and experimental results, respectively. Finally, Section 6 concludes the paper.
Section snippets
Related work
There are many excellent works contributing to this topic. In this section, we review the related work from the following three aspects: action feature descriptors, strategies of sampling and pooling utilized to generate feature representations, and fine-grained activity analysis methods.
Our approach
We seek to build a representation that is sensitive to subtle differences between actions. Motivated by the visual attention mechanism, we propose to sample and pool features from more discriminative spatio-temporal sub-regions. To this end, we introduce a novel Actionness-pooled Deep-convolutional Descriptor (ADD) as outlined in Fig. 2. Specifically, we first generate a group of shift-invariant trajectories under the constraint of actionness maps and then conduct spatio-temporal pooling on the
Experimental settings
In this section, we first introduce the datasets and the evaluation protocol. And then we describe the implementation details of our method. Furthermore, we introduce an integrated action recognition model that effectively combines the benefits of our ADD and traditional end-to-end ConvNets.
Experimental results
We quantitatively evaluate the fine-grained action recognition performance and general action recognition performance of ADD, traditional end-to-end ConvNets, and the integrated recognition models. Several additional experiments are conducted to explore the effects of different experimental settings. We also implement an interesting experiment to explore how action clips contribute to the classification of actions.
Conclusion
In this work, we investigated the problem of fine-grained action recognition. Motivated by the visual attention mechanism, we proposed a new feature descriptor, named Actionness-pooled Deep-convolutional Descriptor (ADD), leveraging actionness as the discrimination and importance cues to sample and pool features. The ADD is capable of capturing subtle differences in local regions between actions similar in overall appearance and motion. The experiments on HIT Dances dataset demonstrate that ADD
Declaration of Competing Interest
The authors declare that they have no competing interests.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Project Nos. 61772158, 61702136, U1711265 and 61472103.
Tingting Han is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. She received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2012, 2014 respectively. She studied in University of Michigan, Ann Arbor, US, as a visiting student from 2015 to 2016. Her research interests include computer vision, multimedia and machine learning, especially focusing on analysis of human
References (74)
- et al.
Fine-grained attention mechanism for neural machine translation
Neurocomputing
(2018) - S. Saha, G. Singh, M. Sapienza, P.H. Torr, F. Cuzzolin, Deep Learning for detecting Multiple Space-Time Action Tubes in...
- et al.
Learning realistic human actions from movies
Proceedings of the 2008 IEEE Computer Vision and Pattern Recognition, CVPR
(2008) - et al.
Action recognition with improved trajectories
Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV
(2013) - et al.
Learning spatiotemporal features with 3D convolutional networks
Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV
(2015) - et al.
Two-stream convolutional networks for action recognition in videos
Proceedings of the 2014 Neural Information Processing Systems, NIPS
(2014) - et al.
Recognizing human actions: a local SVM approach
Proceedings of the 2004 IEEE International Conference on Pattern Recognition, ICPR
(2004) - K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild, (2012) arXiv:...
- Y. Peng, X. He, J. Zhao, Object-Part Attention Model for Fine-Grained Image Classification, (2017). arXiv:...
- et al.
Dancelets mining for video recommendation based on dance styles
IEEE Trans. Multimed.
(2017)
Towards understanding action recognition
Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV
Dense trajectories and motion boundary descriptors for action recognition
Int. J. Comput. Vis.
Multi-task autoencoder model for recovering human poses
IEEE Trans. Ind. Electron.
Multi-modal face pose estimation with multi-task manifold deep learning
IEEE Trans. Ind. Inf.
Efficient pose-based action recognition
Proceedings of the 2014 Asian Conference on Computer Vision, ACCV
p-Laplacian regularized sparse coding for human activity recognition
IEEE Trans. Ind. Electron.
ImageNet classification with deep convolutional neural networks
Proceedings of the 2012 Neural Information Processing Systems, NIPS
Temporal segment networks: towards good practices for deep action recognition
Proceedings of the 2016 European Conference on Computer Vision, ECCV
Spatiotemporal pyramid network for video action recognition
Proceedings of the 2017 IEEE Computer Vision and Pattern Recognition, CVPR
Quo vadis, action recognition? A new model and the kinetics dataset
Proceedings of the 2017 IEEE Computer Vision and Pattern Recognition, CVPR
Action snippets: how many frames does human action recognition require?
Proceedings of the 2008 IEEE Computer Vision and Pattern Recognition, CVPR
Contextual action recognition with R*CNN
Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV
Action recognition with trajectory-pooled deep-convolutional descriptors
Proceedings of the 2015 IEEE Computer Vision and Pattern Recognition, CVPR
A key volume mining deep framework for action recognition
Proceedings of the 2016 IEEE Computer Vision and Pattern Recognition, CVPR
Attentional pooling for action recognition
Proceedings of the 2017 Neural Information Processing Systems, NIPS
Fine-grained image recognition via weakly supervised click data guided bilinear CNN model
Proceedings of the 2017 IEEE International Congress on Mathematical Education, ICME
User-click-data-based fine-grained image recognition via weakly supervised metric learning
ACM Trans. Multimed. Comput. Commun. Appl.
Annotation modification for fine-grained visual recognition
Neurocomputing
Matryoshka peek: towards learning fine-grained, robust, discriminative features for product search
IEEE Trans. Multimed.
Cited by (1)
Modeling long-term video semantic distribution for temporal action proposal generation
2022, NeurocomputingCitation Excerpt :Video semantics contain meaningful temporal clips that might relate to a variety of actions, activities, events, or scenes. Temporal segmentation of video semantics is important for many higher-level video processing tasks, such as action recognition [1–4] and detection [5–7], event detection [8–10], video anomaly detection [11–13], and video captioning [14–16]. Recently, many research efforts have been dedicated to generating temporal action proposals for long and untrimmed videos.
Tingting Han is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. She received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2012, 2014 respectively. She studied in University of Michigan, Ann Arbor, US, as a visiting student from 2015 to 2016. Her research interests include computer vision, multimedia and machine learning, especially focusing on analysis of human activities in video understanding.
Hongxun Yao received the B.S. and M.S. degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and received Ph.D. degree in computer science from Harbin Institute of Technology in 2003. Currently, she is a professor with School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include computer vision, pattern recognition, multimedia computing, human-computer interaction technology. She has 6 books and over 200 scientific papers published, and won both the honor title of the New Century Excellent Talent in China and enjoy special government allowances expert in Heilongjiang Province, China.
Xiaoshuai Sun received the B.S. degree in Computer Science from Harbin Engineering University in 2007. He received the M.S and Ph.D. degree in Computer Science and Technology om Harbin Institute of Technology in 2009 and 2015 respectively. He is currently a lecturer of Harbin Institute of Technology and a Postdoctoral researcher in the University of Queensland. He was a Research Intern with Microsoft Research Asia (2012–2013) and also a winner of Microsoft Research Asia Fellowship in 2011. He owns 2 authorized patents and has authored over 60 referred journals and conference papers in IEEE Transactions on Image Processing, Pattern Recognition, ACM Multimedia, and IEEE CVPR.
Wenlong Xie is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. He received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2013, 2015 respectively. His research interests include computer vision and machine learning, especially focusing on video understanding and analysis.
Sicheng Zhao received the Ph.D. degree from Harbin Institute of Technology in 2016. He is now a postdoctoral research fellow in Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA. He was a research assistant at School of Computing, National University of Singapore from 2013 to 2014, advised by Prof. Tat-Seng Chua and Prof. Yue Gao. His research interests include multimedia affective computing, social media analysis, and computer vision, especially focusing on image emotion analysis and related applications. His research works have been published in top-tier journals such as IEEE TAFFC, TMM, TCSVT, TITS, ACM TOMM, and tor-tier conferences such as ACM MM, AAAI, CVPR.
Wei Yu received the B.S and M.S. degrees from the Harbin Institute of Technology, Harbin, China, in 2009 and 2012, respectively. He received the Ph.D. degree in Computer Science and Technology om Harbin Institute of Technology, Harbin, China in 2017. He was a Research Intern with Web Search and Mining Group, Microsoft Research, Beijing, China, from 2013 to 2015. His research interests include computer vision, multimedia, and machine learning.