Robust geometric p-norm feature pooling for image classification and action recognition*

https://doi.org/10.1016/j.imavis.2016.04.002Get rights and content

Highlights

  • Consider the spatial distribution information in feature pooling

  • Handle the misalignment problems in image classification and action recognition

  • Improve the discrimination of feature pooling in visual recognition tasks

  • Conduct extensive experiments on visual recognition tasks

Abstract

Feature pooling is a key component in modern visual classification system. However, the conventional two prevailing pooling techniques, namely average and max poolings, are not theoretically optimal, due to the unrecoverable loss of the spatial information during the statistical summarization and the underlying over-simplified assumption about the feature distribution. Addressing these issues, this paper proposes to generalize previous pooling methods toward a weighted p-norm spatial pooling function tailored for class-specific feature spatial distribution. Optimizing such a pooling function toward discriminative class separability that is subject to a spatial smoothness constraint yields a so-called geometric p-norm pooling (GLP) method. Furthermore, to handle the variation of object scale/position, which would affect not only the learning of discriminative pooling weights but also the applicability of the learned weights, we propose a simple yet effective self-alignment step during both learning and testing to adaptively adjust the pooling weights for individual images. Image segmentation and visual saliency map are utilized to construct a directed pixel adjacency graph. The discriminative pooling weights are diffused using random walk on the constructed graph and therefore the discriminative pooling weights are propagated onto the salient and foreground region. This leads to a robust version of GLP (RGLP) which can cope with the misalignment of object position and scale in images. Comprehensive experiments validate the effectiveness of the proposed GLP feature pooling framework. The proposed random walk based self-alignment step can effectively alleviate the image misalignment issue and further boost classification accuracy. State-of-the-art image classification and action recognition performances are attained on several benchmarks.

Introduction

Driven by the increasing amount of image and video data from internet or surveillance cameras, computer vision areas such as image classification [59], [63], image re-ranking [58], [60], and action recognition [45], [61] have made significant progresses in recent years. As an important step in many practical visual recognition tasks, feature selection is of great interests to many researchers [32], [33], [34], [44]. With the prevalence of the bag-of-words (BoW) model [31] for image classification or image-based action recognition [46], feature pooling has become a common practice for image/video feature representation and selection. For a typical image classification task, local image features are first extracted and quantized according to a visual dictionary. Then, the quantization indices of all the local features are summarized to form the global feature representation. A most common summarization method is to form the histogram, i.e. to sum up all the occurrences of each index throughout the entire image in an orderless manner. From the viewpoint of feature pooling [12], [28], histogram representation is equivalent to average pooling. Despite its conceivable ease and compactness, average pooling is not immune to local feature noise.

To overcome this limitation, max pooling has been proposed [39], [40]. Instead of performing averaging operation, max pooling adopts the element-wise maximum values of feature vectors over the whole image or the region of interest as the pooled features. Max pooling has proved to be more robust against local feature noise and can achieve better classification performance [55].

The simple assumption associated with average or max pooling, that the spatial distribution for each visual feature is uniform across different classes, causes severe information loss. However, spatial distribution of available features can be important for visual recognition. In the image classification task, if we assume the objects/regions in the images are roughly aligned, the image local features do possess class-specific discriminative geometric information, i.e. spatial distribution patterns. Fig. 1 illustrates such an issue for the average and max pooling methods. For images from a specific class, their visual features indexed by the same visual word often share similar spatial distribution. Besides, such class-specific spatial distributions are quite distinct from each other and encode discriminative information. However, as shown in this figure, neither average nor max pooling can capture the underlying difference and produce discriminative features due to the loss of spatial information in the pooling process.

Moreover, these two deterministic pooling methods either treat all the local features uniformly or only select the most salient one, and they both assume local features are distributed independently. By comparison, a discriminative pooling scheme is expected to be more flexible and able to capture the spatial correlation of features.

Motivated by the above considerations, we propose a so-called geometric p-norm pooling method. Overall, the proposed method aims to learn a pooling function that implicitly encodes the class-specific geometric information of feature distribution in the form of weighted norm. This function is optimized toward best class separability, and in the meantime, it takes into account the following prior knowledge: nearby local image pixels often present similar characteristics, thus a regularization term is employed that encodes the correlation of local features.

Another inevitable problem for image classification or action recognition is the misalignment of image foreground, which is caused by large variation in object position/scale in each image. Misalignment of the foreground regions/objects in the training image degrades the effectiveness of the learned discriminative feature pooling function. Moreover, if the object position and scale of a testing image is not aligned with those of the training images, the learned common pooling function cannot capture the discriminative features for classification.

In this work, we propose a simple yet effective self-alignment method using the side information from visual saliency [21] and image segmentation [2], which can not only adaptively adjust the discriminative pooling weights for individual images during the training process, but also tailor the learned pooling function for individual testing images. A basic observation is that within a visually consistent (e.g., homogeneous color) image local region, pixels convey similar discriminative information, thus the pooling weights for the pixels within the same local region should be similar. Motivated by this observation, we construct an adjacency graph where nodes represent pixels and edges encode the spatial and color adjacency between pixels. Simple random walk algorithm can effectively and efficiently diffuse and adapt the learned common pooling function onto individual images based on the constructed adjacency graph. Further, visual saliency map is utilized to convert the adjacency graph into a directed graph and it can direct the pooling weights propagation toward the object (foreground) region of the given image. This random walk based self-alignment step results in an image-specific adaptive feature pooling scheme which is robust to image foreground misalignment.

Based on the GLP framework originally developed for image classification [10], we further consider the misalignment problem and propose the RGLP algorithm, which can be then applied to several applications including image classification and action recognition. To this aim, the contents of introduction, the experiments, and other related parts are extended correspondingly. Our experimental results show that the proposed robust geometric p-norm pooling scheme is insensitive to median level image foreground misalignment. To sum up, the proposed robust geometric p-norm pooling framework possesses the following advantages:

  • As the pooling function is learned by directly maximizing the class separability, it is designed to bear good discriminating capability.

  • The pooling function exactly corresponds to the class specific spatial pattern of each visual word, thus the spatial distribution information of visual words is properly utilized.

  • It models the correlations among local features and makes a more reasonable assumption about feature distribution. Also it can naturally unify the average and max pooling in a more flexible framework.

  • Using the simple random walk based self-alignment module, the learning pooling weights can be tailored to individual images according to the object (foreground) position and scale, as well as the image segmentation results. Therefore the object (foreground) misalignment problem is alleviated and the resulting image representation is more robust and the classification performance is further boosted.

The remainder of this paper is organized as follows. The related literature is discussed in Section 2. Section 3 then elaborates on the geometric p-norm feature pooling method and provides the theoretical comparison with the max and average pooling methods. An iterative optimization procedure for learning the discriminative pooling weights is also presented. In Section 5 we introduce the random walk based self-alignment method to alleviate the image misalignment problem, which results in an image-specific adaptive pooling scheme. In Section 6 extensive experimental results on benchmarks are presented and conclusions are drawn in Section 7.

Section snippets

Related work

The idea of feature pooling originates in the research on complex cells in the striate cortex [20]. In [20], they proposed a model in which responses of simple cells are fed into higher complex cells through some pooling operations, thereby endowing the complex cells with phase-invariance. Inspired by this seminal work, several extensions in the direction of pooling mechanisms have been proposed afterwards and widely applied in recent computer recognition systems. In the neocognitron model [12]

Geometric p-norm feature pooling

The pipeline of a popular image classification procedure is shown in Fig. 2. As can be seen from the figure, a multi-stage image classification architecture generally comprises four components. After local features are extracted from the input image, many methods can be used to encode the feature vectors.

The first two building blocks are feature extraction and encoding. We assume that there are nc image classes, and the class index set is denoted as C={1,2,,j,,nc}. Additionally, we denote the

Class separability

To determine the parameters in the GLP, we adopt the class separability as the objective function and optimize it with respect to both w and p. A practical choice of the class separability criterion is the marginal Fisher analysis (MFA) developed in [53]. MFA can well characterize the class separability of the data with more general distributions beyond the Gaussian distribution. More specifically, the objective function is to maximize the inter-class separability scaled by the within-class

Robust adaptive pooling for misaligned image

It is notable that there exist variations of object position/scale in images. However, the discriminative pooling function derived above assumes roughly aligned object foreground region and is not adaptive to the change of object position/scale in testing images. It is therefore preferable to have an adaptive feature pooling scheme where the discriminative pooling function can be tailored for individual images and thus the pooled image representation is robust to misalignment of foreground. In

Experiments

In this section, we evaluate the performance of the proposed GLP method as well as its enhanced version RGLP handling misalignment and compare it with the state-of-the-art average and max pooling methods. First, we investigate the separability of the pooling results produced by GLP and the other two methods on a synthesized dataset, which possesses distinctive spatial distribution patterns for different classes. Then we evaluate GLP and RGLP along with the average and max pooling on real-world

Conclusion

In this work, we first proposed a geometric p-norm pooling (GLP) method to perform feature pooling. Different from traditional feature pooling methods, e.g. the average and max pooling, the GLP method can utilize the geometric information of the feature spatial distributions and thus provide more discriminative pooling results. Second, we proposed a simple yet effective random walk based image self-alignment step to alleviate the foreground misalignment issue in geometric p-norm feature

Acknowledgments

This work is supported by the National Natural Science Foundation (NSF) of China (No. 61572029, No. 61300056), and the Science and Technology Project of Anhui Province (No. 1501b042207).

References (63)

  • K. Fukushima et al.

    Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position

    Pattern Recogn.

    (1982)
  • R. Achanta et al.

    Frequency-tuned salient region detection

  • R. Achanta et al.

    SLIC Superpixels Compared to State-of-the-art Superpixel Methods

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • K. Balasubramanian et al.

    Smooth sparse coding via marginal regression for learning sparse representations

  • O. Boiman et al.

    In defense of Nearest-Neighbor based image classification

  • A. Bosch et al.

    Image Classification using Random Forests and Ferns

  • Y. Boureau et al.

    A Theoretical Analysis of Feature Pooling in Visual Recognition

  • Y. Chai et al.

    BiCoS: a bi-level co-segmentation method for image classification

  • Q. Chen et al.

    Hierarchical Matching with Side Information for Image Classification

  • V. Delaitre et al.

    Recognizing human actions in still images: a study of bag-of-features and part-based representations

  • J. Feng et al.

    Geometric Lp-norm feature pooling for image classification

  • R. Fisher

    The Use of Multiple Measurements in Taxonomic Problems

    Ann. Eugen.

    (1936)
  • S. Gao et al.

    Local features are not lonely — Laplacian sparse coding for image classification

  • P. Gehler et al.

    On feature combination for multiclass object classification

  • J. Gemert et al.

    Kernel Codebooks for Scene Categorization

  • G. Griffin et al.

    Caltech-256 Object Category Dataset

    (2007)
  • A. Gupta et al.

    Observing human–object interactions: using spatial and functional compatibility for recognition

    IEEE T. Pattern Anal.

    (2009)
  • J. Harel et al.

    Graph-Based Visual Saliency

  • X. Hou et al.

    Saliency detection: a spectral residual approach

  • D. Hubel et al.

    Receptive fields, binocular interaction and functional architecture in the cat's visual cortex

    J. Physiol.

    (1962)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • P. Jain et al.

    Fast image search for learned metrics

  • K. Jarrett et al.

    What is the best multi-stage architecture for object recognition?

  • L. jia Li et al.

    Object bank: a high-level image representation for scene classification and semantic feature sparsification

  • C. Kanan et al.

    Robust classification of objects, faces, and flowers using natural image statistics

  • S. Lazebnik et al.

    Supervised Learning of Quantizer Codebooks by Information Loss Minimization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • S. Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

  • Y. LeCun et al.

    Handwritten digit recognition with a back-propagation network

  • F. Li et al.

    Object recognition as ranking holistic figure-ground hypotheses

  • F. Li et al.

    Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories

  • F. Li et al.

    A Bayesian hierarchical model for learning natural scene categories

  • Cited by (0)

    *

    This paper has been recommended for acceptance by Ling Shao.

    View full text