Robust geometric ℓp-norm feature pooling for image classification and action recognition*
Introduction
Driven by the increasing amount of image and video data from internet or surveillance cameras, computer vision areas such as image classification [59], [63], image re-ranking [58], [60], and action recognition [45], [61] have made significant progresses in recent years. As an important step in many practical visual recognition tasks, feature selection is of great interests to many researchers [32], [33], [34], [44]. With the prevalence of the bag-of-words (BoW) model [31] for image classification or image-based action recognition [46], feature pooling has become a common practice for image/video feature representation and selection. For a typical image classification task, local image features are first extracted and quantized according to a visual dictionary. Then, the quantization indices of all the local features are summarized to form the global feature representation. A most common summarization method is to form the histogram, i.e. to sum up all the occurrences of each index throughout the entire image in an orderless manner. From the viewpoint of feature pooling [12], [28], histogram representation is equivalent to average pooling. Despite its conceivable ease and compactness, average pooling is not immune to local feature noise.
To overcome this limitation, max pooling has been proposed [39], [40]. Instead of performing averaging operation, max pooling adopts the element-wise maximum values of feature vectors over the whole image or the region of interest as the pooled features. Max pooling has proved to be more robust against local feature noise and can achieve better classification performance [55].
The simple assumption associated with average or max pooling, that the spatial distribution for each visual feature is uniform across different classes, causes severe information loss. However, spatial distribution of available features can be important for visual recognition. In the image classification task, if we assume the objects/regions in the images are roughly aligned, the image local features do possess class-specific discriminative geometric information, i.e. spatial distribution patterns. Fig. 1 illustrates such an issue for the average and max pooling methods. For images from a specific class, their visual features indexed by the same visual word often share similar spatial distribution. Besides, such class-specific spatial distributions are quite distinct from each other and encode discriminative information. However, as shown in this figure, neither average nor max pooling can capture the underlying difference and produce discriminative features due to the loss of spatial information in the pooling process.
Moreover, these two deterministic pooling methods either treat all the local features uniformly or only select the most salient one, and they both assume local features are distributed independently. By comparison, a discriminative pooling scheme is expected to be more flexible and able to capture the spatial correlation of features.
Motivated by the above considerations, we propose a so-called geometric ℓp-norm pooling method. Overall, the proposed method aims to learn a pooling function that implicitly encodes the class-specific geometric information of feature distribution in the form of weighted norm. This function is optimized toward best class separability, and in the meantime, it takes into account the following prior knowledge: nearby local image pixels often present similar characteristics, thus a regularization term is employed that encodes the correlation of local features.
Another inevitable problem for image classification or action recognition is the misalignment of image foreground, which is caused by large variation in object position/scale in each image. Misalignment of the foreground regions/objects in the training image degrades the effectiveness of the learned discriminative feature pooling function. Moreover, if the object position and scale of a testing image is not aligned with those of the training images, the learned common pooling function cannot capture the discriminative features for classification.
In this work, we propose a simple yet effective self-alignment method using the side information from visual saliency [21] and image segmentation [2], which can not only adaptively adjust the discriminative pooling weights for individual images during the training process, but also tailor the learned pooling function for individual testing images. A basic observation is that within a visually consistent (e.g., homogeneous color) image local region, pixels convey similar discriminative information, thus the pooling weights for the pixels within the same local region should be similar. Motivated by this observation, we construct an adjacency graph where nodes represent pixels and edges encode the spatial and color adjacency between pixels. Simple random walk algorithm can effectively and efficiently diffuse and adapt the learned common pooling function onto individual images based on the constructed adjacency graph. Further, visual saliency map is utilized to convert the adjacency graph into a directed graph and it can direct the pooling weights propagation toward the object (foreground) region of the given image. This random walk based self-alignment step results in an image-specific adaptive feature pooling scheme which is robust to image foreground misalignment.
Based on the GLP framework originally developed for image classification [10], we further consider the misalignment problem and propose the RGLP algorithm, which can be then applied to several applications including image classification and action recognition. To this aim, the contents of introduction, the experiments, and other related parts are extended correspondingly. Our experimental results show that the proposed robust geometric ℓp-norm pooling scheme is insensitive to median level image foreground misalignment. To sum up, the proposed robust geometric ℓp-norm pooling framework possesses the following advantages:
- •
As the pooling function is learned by directly maximizing the class separability, it is designed to bear good discriminating capability.
- •
The pooling function exactly corresponds to the class specific spatial pattern of each visual word, thus the spatial distribution information of visual words is properly utilized.
- •
It models the correlations among local features and makes a more reasonable assumption about feature distribution. Also it can naturally unify the average and max pooling in a more flexible framework.
- •
Using the simple random walk based self-alignment module, the learning pooling weights can be tailored to individual images according to the object (foreground) position and scale, as well as the image segmentation results. Therefore the object (foreground) misalignment problem is alleviated and the resulting image representation is more robust and the classification performance is further boosted.
The remainder of this paper is organized as follows. The related literature is discussed in Section 2. Section 3 then elaborates on the geometric ℓp-norm feature pooling method and provides the theoretical comparison with the max and average pooling methods. An iterative optimization procedure for learning the discriminative pooling weights is also presented. In Section 5 we introduce the random walk based self-alignment method to alleviate the image misalignment problem, which results in an image-specific adaptive pooling scheme. In Section 6 extensive experimental results on benchmarks are presented and conclusions are drawn in Section 7.
Section snippets
Related work
The idea of feature pooling originates in the research on complex cells in the striate cortex [20]. In [20], they proposed a model in which responses of simple cells are fed into higher complex cells through some pooling operations, thereby endowing the complex cells with phase-invariance. Inspired by this seminal work, several extensions in the direction of pooling mechanisms have been proposed afterwards and widely applied in recent computer recognition systems. In the neocognitron model [12]
Geometric ℓp-norm feature pooling
The pipeline of a popular image classification procedure is shown in Fig. 2. As can be seen from the figure, a multi-stage image classification architecture generally comprises four components. After local features are extracted from the input image, many methods can be used to encode the feature vectors.
The first two building blocks are feature extraction and encoding. We assume that there are nc image classes, and the class index set is denoted as . Additionally, we denote the
Class separability
To determine the parameters in the GLP, we adopt the class separability as the objective function and optimize it with respect to both w and p. A practical choice of the class separability criterion is the marginal Fisher analysis (MFA) developed in [53]. MFA can well characterize the class separability of the data with more general distributions beyond the Gaussian distribution. More specifically, the objective function is to maximize the inter-class separability scaled by the within-class
Robust adaptive pooling for misaligned image
It is notable that there exist variations of object position/scale in images. However, the discriminative pooling function derived above assumes roughly aligned object foreground region and is not adaptive to the change of object position/scale in testing images. It is therefore preferable to have an adaptive feature pooling scheme where the discriminative pooling function can be tailored for individual images and thus the pooled image representation is robust to misalignment of foreground. In
Experiments
In this section, we evaluate the performance of the proposed GLP method as well as its enhanced version RGLP handling misalignment and compare it with the state-of-the-art average and max pooling methods. First, we investigate the separability of the pooling results produced by GLP and the other two methods on a synthesized dataset, which possesses distinctive spatial distribution patterns for different classes. Then we evaluate GLP and RGLP along with the average and max pooling on real-world
Conclusion
In this work, we first proposed a geometric ℓp-norm pooling (GLP) method to perform feature pooling. Different from traditional feature pooling methods, e.g. the average and max pooling, the GLP method can utilize the geometric information of the feature spatial distributions and thus provide more discriminative pooling results. Second, we proposed a simple yet effective random walk based image self-alignment step to alleviate the foreground misalignment issue in geometric ℓp-norm feature
Acknowledgments
This work is supported by the National Natural Science Foundation (NSF) of China (No. 61572029, No. 61300056), and the Science and Technology Project of Anhui Province (No. 1501b042207).
References (63)
- et al.
Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position
Pattern Recogn.
(1982) - et al.
Frequency-tuned salient region detection
- et al.
SLIC Superpixels Compared to State-of-the-art Superpixel Methods
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
Smooth sparse coding via marginal regression for learning sparse representations
- et al.
In defense of Nearest-Neighbor based image classification
- et al.
Image Classification using Random Forests and Ferns
- et al.
A Theoretical Analysis of Feature Pooling in Visual Recognition
- et al.
BiCoS: a bi-level co-segmentation method for image classification
- et al.
Hierarchical Matching with Side Information for Image Classification
- et al.
Recognizing human actions in still images: a study of bag-of-features and part-based representations
Geometric Lp-norm feature pooling for image classification
The Use of Multiple Measurements in Taxonomic Problems
Ann. Eugen.
Local features are not lonely — Laplacian sparse coding for image classification
On feature combination for multiclass object classification
Kernel Codebooks for Scene Categorization
Caltech-256 Object Category Dataset
Observing human–object interactions: using spatial and functional compatibility for recognition
IEEE T. Pattern Anal.
Graph-Based Visual Saliency
Saliency detection: a spectral residual approach
Receptive fields, binocular interaction and functional architecture in the cat's visual cortex
J. Physiol.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Fast image search for learned metrics
What is the best multi-stage architecture for object recognition?
Object bank: a high-level image representation for scene classification and semantic feature sparsification
Robust classification of objects, faces, and flowers using natural image statistics
Supervised Learning of Quantizer Codebooks by Information Loss Minimization
IEEE Trans. Pattern Anal. Mach. Intell.
Beyond bags of features: spatial pyramid matching for recognizing natural scene categories
Handwritten digit recognition with a back-propagation network
Object recognition as ranking holistic figure-ground hypotheses
Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories
A Bayesian hierarchical model for learning natural scene categories
Cited by (0)
- *
This paper has been recommended for acceptance by Ling Shao.