Elsevier

Neurocomputing

Volume 398, 20 July 2020, Pages 442-452
Neurocomputing

Actionness-pooled Deep-convolutional Descriptor for fine-grained action recognition

https://doi.org/10.1016/j.neucom.2019.03.099Get rights and content

Abstract

Recognition of general actions has witnessed great success in recent years. However, the existing general action representations cannot work well to recognize fine-grained actions, which usually share high similarities in both appearance and motion pattern. To solve this problem, we introduce the visual attention mechanism into the proposed descriptor, termed Actionness-pooled Deep-convolutional Descriptor (ADD). Instead of pooling features uniformly from the entire video, we aggregate features in sub-regions that are more likely to contain actions according to actionness maps. This endows ADD with the superior capability of capturing the subtle differences between fine-grained actions. We conduct experiments on HIT Dances dataset, one of the few existing datasets for fine-grained action analysis. Quantitative results have demonstrated that ADD remarkably outperforms traditional CNN-based representations. Extensive experiments on two general action benchmarks, JHMDB and UCF101, have additionally proved that combining ADD with end-to-end ConvNet can further boost the recognition performance. Besides, taking advantage of ADD, we reveal the sparsity characteristic existing in actions and point out a potential direction to design more effective action analysis models by extracting both representative and discriminative action patterns.

Introduction

In the past decades, great efforts have been made for the general action recognition problem. Many successful representations have been proposed, including hand-crafted features such as Space Time Interest Points (STIPs) [1] and Improved Trajectories [2]. Besides, it has become popular to extract action representations by leveraging deep learning techniques. Tran et al. [3] propose a generic spatio-temporal descriptor by performing 3D convolutions and pooling to preserve both spatial and temporal information of the input signals. Simonyan and Zisserman design a two-stream architecture [4] consisting of RGB and optical flow streams to capture the appearance and motion information respectively. These ConvNets-based methods achieve the state-of-the-art performance.

However, the actions in traditional datasets, such as KTH [5] and UCF101 [6], are mostly well-defined with significant appearance and motion differences, which are easy to be distinguished as compared to the realistic ones that are often vague and uncertain. Specifically, the intra-class difference of fine-grained actions is likely to be very large due to the variations of actors, backgrounds, etc. In contrast, the similarity between different actions substantially reduces the inter-class difference. The recognition of fine-grained actions raises more challenges. First, an effective representation should be able to distinguish between actions sharing similar appearance and motion styles but having subtle differences. Take, for example, the Mongolian dance and Uygur dance are shown in Fig. 1(a), for example, the background scenes and costumes are very similar. What is more, the actors are performing very similar “Spinning” actions. Only the poses and actions of arms are slightly different. These tiny differences are very hard to be captured by existing ConvNets which are designed for general action recognition since they adopt globally sampling and pooling on entire frames (see Fig. 1(b)). Second, the lack of training data is another barrier. As we know, only with enough and diverse training samples can deep learning models work better, but there are no large-scale datasets available for fine-grained action recognition. Even the most widely used datasets for general action recognition are not large enough.

In order to circumvent the above-mentioned problems encountered with current approaches, we propose a novel feature sampling and pooling method and propose a novel representation, namely Actionness-pooled Deep-convolutional Descriptor (ADD), inspired by human visual attention mechanism that has been widely used in fine-grained image classification [7]. In this work, we treat actionness maps as the guidance of visual saliency and extract features from more discriminative patches (see Fig. 1(c) and (d)). By this means, for one thing, we endow the proposed feature descriptor with stronger sensitivity to subtle differences in local appearance and motion patterns between fine-grained actions. For another, our proposed actionness-constrained sampling and pooling serves as a kind of feasible data augmentation strategy since it builds complementary representation to traditional end-to-end ConvNet representations.

We evaluate ADD on HIT Dances dataset [8], which is originally constructed for dance video recommendation. The results demonstrate that ADD significantly outperforms 3D convolutional network (C3D) representation [3] and the traditional two-stream model [9] by 6.9% and 5.8%, respectively, on the task of fine-grained dance recognition. Besides, from extensive experiments on two traditional benchmarks, JHMDB [10] and UCF101 dataset [6], we experimentally show that the combination of different sampling and pooling strategies can further improve the performance, which indicates the complementary property between ADD and general end-to-end ConvNets. Furthermore, we take advantage of ADD to derive the segment-level representations of action videos and analyze the contributions of tiny action clips to fine-grained action classification.

In summary, the contributions of this paper are threefold:

  • We propose an effective model for fine-grained action recognition which integrates a novel discriminative descriptor named ADD. We embed the attention mechanism by aggregating features following the actionness cues.

  • Experiments demonstrate the superior performance of our method on fine-grained action recognition and the complementary properties between ADD and traditional end-to-end ConvNets-based representations.

  • We explore the sparsity characteristic of action data by reasoning the temporal importance of long-range actions to action classification and point out a potential direction to promote future action analysis tasks.

One preliminary version of this work has been accepted by IEEE Conference on Multimedia and Expo (ICME 2018). In this paper, we have improvements mainly in the following two aspects: (1) Several additional experiments have been conducted to support our argument that ADD captures complementary information to traditional end-to-end ConvNets representation and the combinations of them contribute to further improve the recognition performance; (2) We carry out an extra exploration experiment taking advantage of our ADD representation. Specifically, we investigate the temporal importance of long-range actions when recognizing an action. By this means, we reveal that only a small subset of the video content contributes to the recognition. And we find that these action patterns are not only representative but also discriminative. This discovery will facilitate other computer vision tasks, such as action compression, summarization, and compact action representation.

The remainder of the paper is organized as follows. In Section 2, we review the related work. And then, we introduce the proposed Actionness-pooled Deep-convolutional Descriptor (ADD) in Section 3. Sections 4 and 5 present the experimental settings and experimental results, respectively. Finally, Section 6 concludes the paper.

Section snippets

Related work

There are many excellent works contributing to this topic. In this section, we review the related work from the following three aspects: action feature descriptors, strategies of sampling and pooling utilized to generate feature representations, and fine-grained activity analysis methods.

Our approach

We seek to build a representation that is sensitive to subtle differences between actions. Motivated by the visual attention mechanism, we propose to sample and pool features from more discriminative spatio-temporal sub-regions. To this end, we introduce a novel Actionness-pooled Deep-convolutional Descriptor (ADD) as outlined in Fig. 2. Specifically, we first generate a group of shift-invariant trajectories under the constraint of actionness maps and then conduct spatio-temporal pooling on the

Experimental settings

In this section, we first introduce the datasets and the evaluation protocol. And then we describe the implementation details of our method. Furthermore, we introduce an integrated action recognition model that effectively combines the benefits of our ADD and traditional end-to-end ConvNets.

Experimental results

We quantitatively evaluate the fine-grained action recognition performance and general action recognition performance of ADD, traditional end-to-end ConvNets, and the integrated recognition models. Several additional experiments are conducted to explore the effects of different experimental settings. We also implement an interesting experiment to explore how action clips contribute to the classification of actions.

Conclusion

In this work, we investigated the problem of fine-grained action recognition. Motivated by the visual attention mechanism, we proposed a new feature descriptor, named Actionness-pooled Deep-convolutional Descriptor (ADD), leveraging actionness as the discrimination and importance cues to sample and pool features. The ADD is capable of capturing subtle differences in local regions between actions similar in overall appearance and motion. The experiments on HIT Dances dataset demonstrate that ADD

Declaration of Competing Interest

The authors declare that they have no competing interests.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Project Nos. 61772158, 61702136, U1711265 and 61472103.

Tingting Han is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. She received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2012, 2014 respectively. She studied in University of Michigan, Ann Arbor, US, as a visiting student from 2015 to 2016. Her research interests include computer vision, multimedia and machine learning, especially focusing on analysis of human

References (74)

  • ChoiH. et al.

    Fine-grained attention mechanism for neural machine translation

    Neurocomputing

    (2018)
  • S. Saha, G. Singh, M. Sapienza, P.H. Torr, F. Cuzzolin, Deep Learning for detecting Multiple Space-Time Action Tubes in...
  • I. Laptev et al.

    Learning realistic human actions from movies

    Proceedings of the 2008 IEEE Computer Vision and Pattern Recognition, CVPR

    (2008)
  • WangH. et al.

    Action recognition with improved trajectories

    Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV

    (2013)
  • D. Tran et al.

    Learning spatiotemporal features with 3D convolutional networks

    Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV

    (2015)
  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

    Proceedings of the 2014 Neural Information Processing Systems, NIPS

    (2014)
  • C. Schuldt et al.

    Recognizing human actions: a local SVM approach

    Proceedings of the 2004 IEEE International Conference on Pattern Recognition, ICPR

    (2004)
  • K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild, (2012) arXiv:...
  • Y. Peng, X. He, J. Zhao, Object-Part Attention Model for Fine-Grained Image Classification, (2017). arXiv:...
  • HanT. et al.

    Dancelets mining for video recommendation based on dance styles

    IEEE Trans. Multimed.

    (2017)
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards Good Practices for Very Deep two-Stream ConvNets, (2015). arXiv:...
  • H. Jhuang et al.

    Towards understanding action recognition

    Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV

    (2013)
  • WangH. et al.

    Dense trajectories and motion boundary descriptors for action recognition

    Int. J. Comput. Vis.

    (2013)
  • YuJ. et al.

    Multi-task autoencoder model for recovering human poses

    IEEE Trans. Ind. Electron.

    (2018)
  • HongC. et al.

    Multi-modal face pose estimation with multi-task manifold deep learning

    IEEE Trans. Ind. Inf.

    (2018)
  • A. Eweiwi et al.

    Efficient pose-based action recognition

    Proceedings of the 2014 Asian Conference on Computer Vision, ACCV

    (2014)
  • LiuW. et al.

    p-Laplacian regularized sparse coding for human activity recognition

    IEEE Trans. Ind. Electron.

    (2016)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Proceedings of the 2012 Neural Information Processing Systems, NIPS

    (2012)
  • WangL. et al.

    Temporal segment networks: towards good practices for deep action recognition

    Proceedings of the 2016 European Conference on Computer Vision, ECCV

    (2016)
  • A. Diba, V. Sharma, L. Van Gool, Deep Temporal Linear Encoding Networks, (2016). arXiv:...
  • WangY. et al.

    Spatiotemporal pyramid network for video action recognition

    Proceedings of the 2017 IEEE Computer Vision and Pattern Recognition, CVPR

    (2017)
  • J. Carreira et al.

    Quo vadis, action recognition? A new model and the kinetics dataset

    Proceedings of the 2017 IEEE Computer Vision and Pattern Recognition, CVPR

    (2017)
  • K. Schindler et al.

    Action snippets: how many frames does human action recognition require?

    Proceedings of the 2008 IEEE Computer Vision and Pattern Recognition, CVPR

    (2008)
  • J. Wang, A. Cherian, F. Porikli, S. Gould, Action Representation Using Classifier Decision Boundaries, (2017). arXiv:...
  • A. Kar, N. Rai, K. Sikka, G. Sharma, AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human...
  • A. Piergiovanni, C. Fan, M.S. Ryoo, Learning Latent Sub-Events in Activity Videos Using Temporal Attention Filters,...
  • Y. Shi, Y. Tian, Y. Wang, T. Huang, Joint Network Based Attention for Action Recognition,(2016). arXiv:...
  • G. Gkioxari et al.

    Contextual action recognition with R*CNN

    Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV

    (2015)
  • A. Miech, I. Laptev, J. Sivic, Learnable pooling with context gating for video classification, (2017) arXiv:...
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable Convolutional Networks, (2017) arXiv:...
  • WangL. et al.

    Action recognition with trajectory-pooled deep-convolutional descriptors

    Proceedings of the 2015 IEEE Computer Vision and Pattern Recognition, CVPR

    (2015)
  • ZhuW. et al.

    A key volume mining deep framework for action recognition

    Proceedings of the 2016 IEEE Computer Vision and Pattern Recognition, CVPR

    (2016)
  • R. Girdhar et al.

    Attentional pooling for action recognition

    Proceedings of the 2017 Neural Information Processing Systems, NIPS

    (2017)
  • ZhengG. et al.

    Fine-grained image recognition via weakly supervised click data guided bilinear CNN model

    Proceedings of the 2017 IEEE International Congress on Mathematical Education, ICME

    (2017)
  • TanM. et al.

    User-click-data-based fine-grained image recognition via weakly supervised metric learning

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2018)
  • LuoC. et al.

    Annotation modification for fine-grained visual recognition

    Neurocomputing

    (2016)
  • Z. Kyaw et al.

    Matryoshka peek: towards learning fine-grained, robust, discriminative features for product search

    IEEE Trans. Multimed.

    (2017)
  • Cited by (1)

    • Modeling long-term video semantic distribution for temporal action proposal generation

      2022, Neurocomputing
      Citation Excerpt :

      Video semantics contain meaningful temporal clips that might relate to a variety of actions, activities, events, or scenes. Temporal segmentation of video semantics is important for many higher-level video processing tasks, such as action recognition [1–4] and detection [5–7], event detection [8–10], video anomaly detection [11–13], and video captioning [14–16]. Recently, many research efforts have been dedicated to generating temporal action proposals for long and untrimmed videos.

    Tingting Han is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. She received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2012, 2014 respectively. She studied in University of Michigan, Ann Arbor, US, as a visiting student from 2015 to 2016. Her research interests include computer vision, multimedia and machine learning, especially focusing on analysis of human activities in video understanding.

    Hongxun Yao received the B.S. and M.S. degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and received Ph.D. degree in computer science from Harbin Institute of Technology in 2003. Currently, she is a professor with School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include computer vision, pattern recognition, multimedia computing, human-computer interaction technology. She has 6 books and over 200 scientific papers published, and won both the honor title of the New Century Excellent Talent in China and enjoy special government allowances expert in Heilongjiang Province, China.

    Xiaoshuai Sun received the B.S. degree in Computer Science from Harbin Engineering University in 2007. He received the M.S and Ph.D. degree in Computer Science and Technology om Harbin Institute of Technology in 2009 and 2015 respectively. He is currently a lecturer of Harbin Institute of Technology and a Postdoctoral researcher in the University of Queensland. He was a Research Intern with Microsoft Research Asia (2012–2013) and also a winner of Microsoft Research Asia Fellowship in 2011. He owns 2 authorized patents and has authored over 60 referred journals and conference papers in IEEE Transactions on Image Processing, Pattern Recognition, ACM Multimedia, and IEEE CVPR.

    Wenlong Xie is currently a Ph.D. candidate at Harbin Institute of Technology, Harbin, China. He received the B.S. and M.S. degrees in computer science from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 2013, 2015 respectively. His research interests include computer vision and machine learning, especially focusing on video understanding and analysis.

    Sicheng Zhao received the Ph.D. degree from Harbin Institute of Technology in 2016. He is now a postdoctoral research fellow in Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA. He was a research assistant at School of Computing, National University of Singapore from 2013 to 2014, advised by Prof. Tat-Seng Chua and Prof. Yue Gao. His research interests include multimedia affective computing, social media analysis, and computer vision, especially focusing on image emotion analysis and related applications. His research works have been published in top-tier journals such as IEEE TAFFC, TMM, TCSVT, TITS, ACM TOMM, and tor-tier conferences such as ACM MM, AAAI, CVPR.

    Wei Yu received the B.S and M.S. degrees from the Harbin Institute of Technology, Harbin, China, in 2009 and 2012, respectively. He received the Ph.D. degree in Computer Science and Technology om Harbin Institute of Technology, Harbin, China in 2017. He was a Research Intern with Web Search and Mining Group, Microsoft Research, Beijing, China, from 2013 to 2015. His research interests include computer vision, multimedia, and machine learning.

    View full text