Action Recognition from Still Images Based on Deep VLAD Spatial Pyramids

https://doi.org/10.1016/j.image.2017.03.010Get rights and content

Highlights

  • A novel method of spatial pyramid VLAD encoding using patches of CNN features is proposed for action recognition from still images.

  • Adding a spatial pyramid to VLAD encoding significantly boosts the system performance even VLAD encoding on its own shows improved results compared with CNN feature.

  • The method has been validated on four widely used datasets with competitive results to demonstrate this scheme's applicability on action recognition and attribute classification.

Abstract

The recognition of human actions in images is a challenging task in computer vision. In many applications, actions can be exploited as mid-level semantic features for high level tasks. Actions often appear in fine-grained categorization, where the differences between two categories are small. Recently, deep learning approaches have achieved great success in many vision tasks, e.g., image classification, object detection, and attribute and action recognition. Also, the Bag-of-Visual-Words (BoVW) and its extensions, e.g., Vector of Locally Aggregated Descriptors (VLAD) encoding, have proved to be powerful in identifying global contextual information. In this paper, we propose a new action recognition scheme by combining the powerful feature representational capabilities of Convolutional Neural Networks (CNNs) with the VLAD encoding scheme. Specifically, we encode the CNN features of image patches generated by a region proposal algorithm with VLAD and subsequently represent an image by the compact code, which not only captures the more fine-grained properties of the images but also contains global contextual information. To identify the spatial information, we exploit the spatial pyramid representation and encode CNN features inside each pyramid. Experiments have verified that the proposed schemes are not only suitable for action recognition but also applicable to more general recognition tasks such as attribute classification. The proposed scheme is validated with four benchmark datasets with competitive mAP results of 88.5% on the Stanford 40 Action dataset, 81.3% on the People Playing Musical Instrument dataset, 90.4% on the Berkeley Attributes of People dataset and 74.2% on the 27 Human Attributes dataset.

Introduction

In computer vision, many human actions such as ‘using a mobile phone’, ‘riding a bike’ or ‘reading a book’, provide a natural description for many still images, which could provide significant meta-data to many applications such as automatic scene description, and the indexing and searching of very large image repositories. Compared with more well-established video-based action recognition, these tasks are more difficult as there are a number of possible obstacles to find the satisfactory solutions, e.g., large variances in illumination conditions, the viewpoint, and the human pose, and more importantly, lack of motions.

Unlike the video-based action recognition which heavily relies on the spatial–temporal features, the solutions to human action classification from still images hinge on the acquisition of local and global contextual information. To be more specific, local information associated with discriminative parts provides detailed appearance features which would be particularly pertinent to fine-grained recognition. This is because human actions are often localized in space, e.g., the facial region for expressions and the wrist and hand regions for many common actions. Additionally, the global contextual information about the configuration of objects and scenes is also instrumental. For example, the articulation of body parts, the pose, the objects a person interacts with and the scene in which the action is performed, all contain useful information. This is well illustrated by the action types in sports. For example, for the action of ‘playing football’, the football itself and playground are both strong evidence for this action category.

To represent the contextual information of images, many methods have been proposed. Bangpeng et al. [1] proposed to use probabilistic graphical models, e.g., conditional random fields, to model the mutual contextual information. In this approach, the objects and humans or human body parts are described as nodes in conditional random fields. By modeling the conditional probabilities, the system can generate labels by discriminating not only on input features but also on the relationships between them.

Compared to holistic contextual features, local features or patches have the advantage of being more robust to misalignment and occlusions, and have been widely used for generic image classification. Popular local feature or patches encoding strategies include the Bag of Visual Words (BoVW) [2], Fisher Vectors (FV) [3], and Vector of Locally Aggregated Descriptors (VLAD) [4]. Among these, the FV often perform best on a number of benchmark image datasets. VLAD aggregates information of several features such as Scale-Invariant Feature Transform (SIFT) into a compact and fixed length descriptor, which can be regarded as a simplified non-probabilistic version of FV and also show comparable performance [5]. Another advantage of VLAD is its computational efficiency as it mainly involves primitive operations [6]. Recently, VLAD has been widely applied in computer vision, demonstrating an excellent performance in many tasks including object detection, scene recognition and action recognition [7], [8], [9], [10].

While the dominate patch encoding strategies are all based on hand-crafted features, deep neural networks, and Convolutional Neural Networks (CNN) in particular, emphasize the significance of learning robust feature representations from raw data. Krizhevsky et al. [11] show that CNNs trained with large amounts of labeled data outperforms FV. Since then CNNs have consistently led the classification task in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) [12]. Much of the published work considered the problem of incorporating contextual information in the CNN framework. For example, recurrent neural networks (RNNs) have been proposed to embed the contextual information into CNNs. Bell et al. [13] proposed a deep CNN structure by plugging in the RNNs to integrate contextual information for object detection. In [14], a conditional random field was formulated as RNNs and plugged into the CNN model, which was optimized using mean field for image semantic segmentation.

To date, Convolutional Neural Networks (CNNs) have achieved a considerable success in many vision tasks [11], [15], [16]. Despite these achievements, deep CNN architectures meet with new challenges, which include the requirement for large amounts of training data, and the high computational cost with solutions relying on GPUs and other hardware acceleration techniques. Additionally, Convolutional Neural Networks still have some limitations, e.g., their lack of geometric invariance and their inability in conveying information on local elements. A promising direction for their improvement is to combine the CNN with traditional encoding approaches like VLAD to better express the local information of the images [17], [18], [19]. For example, Gong et al. [5] extracted CNN activations at multiple scale levels, and performed orderless VLAD pooling separately, which were then concatenated together to form a high dimensional feature vector which is more robust to global deformations.

In this paper, we follow that direction to further explore the potential of augmenting CNN with VLAD in the context of human action classification in still images. To take advantages from both CNN and the patch feature encoding strategy, we encode the CNN features upon sub-regions of the image for a compact representation. Our approach shares similarities with [19], in which the FV encoding scheme was applied on CNN features and each image was represented as a bag of windows. Our method can also be regarded as a bag of patches or windows as the image patches are extracted using region proposal algorithms such as Edgeboxes [20], which are subsequently encoded by VLAD for image representation.

Aiming to preserve crucial local features and identify contextual information from neighboring objects and scenes, the proposed approach is more likely to capture the fine-grained properties of an image than the conventional approaches. To take account of the spatial information which is absent in VLAD [17], spatial pyramids of the image were generated and matched to region level CNN features. Then, VLAD encoding was applied on separate pyramids with the resulting VLAD codes concatenated and forwarded to a classifier for final classification. With extensive experiments, we achieved state-of-the-art results on the Stanford 40 Action dataset [1] and People Playing Musical Instrument dataset (PPMI) [21].

For many tasks in computer vision such as video surveillance, image search and human-computer interaction, objects can often be conveniently identified by a set of mid-level, nameable descriptions termed as semantic attributes or attributes [22]. For example, a human object can be described by hair-length, eye color, clothing style, gender, ethnicity and age. Therefore, recognition of visual attributes often directly leads to many high-level tasks. To give an intuition that our proposed approach can also be generalized to attribute classification, we conducted experiments on Berkeley Attributes of People dataset [22] and the 27 Human Attributes dataset (HAT) [23], with promising results.

The rest of the paper is organized as follows. In Section 2, we briefly introduce previous research in action classification, which is followed by our proposed approach explained in Section 3. Section 4 provides our experimental procedure and presents results to prove the effectiveness of the proposed approach on attributes classification, with the conclusions presented in Section 5.

Section snippets

Action recognition

Still image-based human action recognition has been much addressed in recent years [24], [16], [25] due to the potential for providing useful meta-data to many applications such as image understanding, human-computer interaction and the indexing and searching of large-scale image archives.

The most popular conventional method for the task is the BoVW [26], [18], [27], which is capable of achieving a global representation of an image. Delaitre et al. [28] applied a BoVW for image representation

Methods

In this section, the main components of the proposed method will be described, which include patch generation, deep feature extraction and Spatial Pyramid VLAD encoding. The system pipeline is illustrated in Fig. 1.

Experiments and results

In this section, the experimental set up will be briefly described, followed by the details of the experiments on the four benchmark datasets: the Stanford 40 Action dataset, the People Playing Musical Instrument (PPMI) dataset for action recognition, the Berkeley Attributes of People dataset and the 27 Human Attributes (HAT) dataset for attribute classification.

Conclusion

Action recognition in static images is a challenging task, partly due to the fine-grained property and the absence of motion information. Our study indicates that information from local patches and the global contextual information are critically important contributing factors to improve the performance of action recognition. This is validated by our re-implementation of Vector of Locally Aggregated Descriptors (VLAD) on top of a spatial pyramid for CNN features to identify local information

References (71)

  • Z. Zhao et al.

    Semantic parts based top-down pyramid for action recognition

    Pattern Recognit. Lett.

    (2016)
  • B. Yao, X. Jiang, A. Khosla, A.L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning bases of action...
  • L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Computer Society...
  • G. Csurka, F. Perronnin, Fisher vectors: beyond bag-of-visual-words image representations, in: Computer Vision, Imaging...
  • H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a compact image representation, in: 2010...
  • Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in:...
  • M. Harandi, M. Salzmann, F. Porikli, When vlad met Hilbert, arXiv preprint...
  • G. Sharma, F. Jurie, C. Schmid, Discriminative spatial saliency for image classification, in: 2012 IEEE Conference on...
  • V. Delaitre, I. Laptev, J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based...
  • H. Wang et al.

    Dense trajectories and motion boundary descriptors for action recognition

    Int. J. Comput. Vis.

    (2013)
  • X. Peng, C. Zou, Y. Qiao, Q. Peng, Action recognition with stacked fisher vectors, in: Computer Vision—ECCV 2014,...
  • A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances...
  • O. Russakovsky et al.

    Imagenet large scale visual recognition challenge

    Int. J. Comput. Vis.

    (2015)
  • S. Bell, C.L. Zitnick, K. Bala, R. Girshick, Inside–outside net: detecting objects in context with skip pooling and...
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P.H. Torr, Conditional random fields as...
  • R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic...
  • G. Gkioxari, R. Girshick, J. Malik, Contextual action recognition with r* cnn, in: Proceedings of the IEEE...
  • A. Shin, M. Yamaguchi, K. Ohnishi, T. Harada, Dense image representation with spatial pyramid vlad coding of cnn for...
  • D. Oneata, J. Verbeek, C. Schmid, Action and event recognition with fisher vectors on a compact feature set, in:...
  • T. Uricchio, M. Bertini, L. Seidenari, A.D. Bimbo, Fisher encoded convolutional bag-of-windows for efficient image...
  • C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in: Computer Vision—ECCV 2014, Springer,...
  • B. Yao, L. Fei-Fei, Grouplet: a structured image representation for recognizing human and object interactions, in: 2010...
  • L. Bourdev, S. Maji, J. Malik, Describing people: a poselet-based approach to attribute classification, in: 2011...
  • G. Sharma, F. Jurie, Learning discriminative spatial representation for image classification, in: BMVC 2011—British...
  • F.S. Khan et al.

    Recognizing actions through action-specific person detection

    IEEE Trans. Image Process.

    (2015)
  • G. Gkioxari, R. Girshick, J. Malik, Actions and attributes from wholes and parts, in: Proceedings of the IEEE...
  • X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: comprehensive study...
  • M.M. Ullah, S.N. Parizi, I. Laptev, Improving bag-of-features action recognition with non-local cues, in: BMVC, vol....
  • V. Delaitre, I. Laptev, J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based...
  • C. Sun, R. Nevatia, Large-scale web video event classification by use of fisher vectors, in: 2013 IEEE Workshop on...
  • M. Jain, H. Jégou, P. Bouthemy, Better exploiting motion for better action recognition, in: Proceedings of the IEEE...
  • S. Savarese, J. Winn, A. Criminisi, Discriminative object class models of appearance and shape by correlations, in:...
  • F.S. Khan, R.M. Anwer, J. van de Weijer, M. Felsberg, J. Laaksonen, Deep semantic pyramids for human attributes and...
  • B. Yao, L. Fei-Fei, Modeling mutual context of object and human pose in human-object interaction activities, in: 2010...
  • A. Prest et al.

    Weakly supervised learning of interactions between humans and objects

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • Cited by (0)

    View full text