Image classification using spatial pyramid robust sparse coding

doi:10.1016/j.patrec.2013.02.013

Pattern Recognition Letters

Volume 34, Issue 9, 1 July 2013, Pages 1046-1052

https://doi.org/10.1016/j.patrec.2013.02.013 Get rights and content

Abstract

Recently, the sparse coding based codebook learning and local feature encoding have been widely used for image classification. The sparse coding model actually assumes the reconstruction error follows Gaussian or Laplacian distribution, which may not be accurate enough. Besides, the ignorance of spatial information during local feature encoding process also hinders the final image classification performance. To address these obstacles, we propose a new image classification method by spatial pyramid robust sparse coding (SP-RSC). The robust sparse coding tries to find the maximum likelihood estimation solution by alternatively optimizing over the codebook and local feature coding parameters, hence is more robust to outliers than traditional sparse coding based methods. Additionally, we adopt the robust sparse coding technique to encode visual features with the spatial constraint. Local features from the same spatial sub-region of images are collected to generate the visual codebook and encode local features. In this way, we are able to generate more discriminative codebooks and encoding parameters which eventually help to improve the image classification performance. Experiments on the Scene 15 dataset and the Caltech 256 dataset demonstrate the effectiveness of the proposed spatial pyramid robust sparse coding method.

Highlights

► We propose a new image classification method by spatial pyramid robust sparse coding. ► Images are spatially partitioned into sub-regions for codebook generation and local feature encoding. ► We alternatively optimize over codebook and encoding parameters by maximum likelihood. ► We achieve comparable performances with other methods on two public datasets.

Introduction

In recent years, the bag-of-visual-word (BoW) model has become popular in image classification. This model extracts appearance descriptors from local patches and quantizes them into discrete “visual words”, and then a compact histogram representation is used to represent images. The descriptive power of the BoW model is severely limited because it discards the spatial information of local descriptors. To overcome this problem, one popular extension method, called the spatial pyramid matching (SPM) by Lazebnik et al. (2006), is proposed and has been shown to be effective for image classification. The SPM partitions an image into several segments in different scales, then computes the BoW histogram within each segment and concatenates all the histograms to form a high dimension vector representation of the image.

To obtain good performance, researchers have empirically found that the SPM should be used together with SVM classifier using nonlinear Mercer kernels. However, the computational complexity is $O (n^{3}$ ) and the memory complexity is $O (n^{2}$ ) in the training phase, where n is the size of training dataset. This constrains the scalability of the SPM-based nonlinear SVM method. To reduce the training complexity and improve image classification performance, sparse coding based linear spatial pyramid matching methods (Yang et al., 2009, Serre et al., 2005, Wang et al., 2010) are proposed which help to improve classification performance. In fact, there is another constraint which was neglected in Yang et al., 2009, Wang et al., 2010, i.e., the spatial locality constraint. For example, ‘sky’ often lies on the upper side of images, while ‘beach’ often lies on the lower side of images. When we try to encode an image region about the upper ‘sky’, it is more semantically meaningful to use the bases which are generated by the local features on the upper side of images. Similarly, it is more meaningful to encode the lower ‘beach’ with the bases generated from the local features on the lower side of images. We believe this spatial information should be combined with the codebook generation in order to encode local features more efficiently.

Besides, the sparse coding used in Yang et al. (2009) for local feature encoding tried to minimize the reconstruction error of local features by learning the optimal codebook and coding parameters simultaneously with sparsity constraints. After the codebook is learned, the rest local features are encoded by minimizing the reconstruction error with the learnt codebook and sparsity constraints. To ensure coding parameter’s smoothness and reduce encoding information loss, Laplacian sparse coding and non-negative sparse coding are proposed by Gao et al. (2010) and Zhang et al. (2011) respectively. Actually, these sparse coding models assume that the reconstruction error should be Gaussian or Laplacian distribution, which is unable to model real world applications. It would be more effective if we can construct a more robust model than simply assuming the Gaussian or Laplacian distribution of reconstruction error.

In this paper, we present a novel image classification method by using spatial pyramid robust sparse coding (SP-RSC). We give the flowchart in Fig. 1. We first partition images into sub-regions on multiple scales. Then we adopt the robust sparse coding approach to generate the codebook and encode local features of images with the spatial constraint. Different from SPM (Lazebnik et al., 2006), the proposed SP-RSC based visual vocabulary is concatenated with each encoding results from the sub-regions which have the same spatial locality and segmentation scale. For the robust sparse coding, we adopt the maximum likelihood estimation (MLE) approach and try to minimize some function of the coding residuals. This function is associated with the distribution of the coding residuals which robustly encodes the given local feature with sparse regression coefficients. Experimental evaluations on two public datasets demonstrate the effectiveness of the proposed method.

Compared with our previous work (Zhang et al., 2010), we extended the spatial pyramid coding by using robust sparse coding instead of sparse coding both for codebook construction and local feature encoding. The sparse coding assumes the reconstruction error follows the Gaussian or Laplacian distribution while the robust sparse coding has no such constraints, hence helps to encode the local features more efficiently. Besides, more experiments are added to clarify the effectiveness of the proposed spatial pyramid robust sparse coding method.

The rest of the paper is organized as follows. Section 2 gives an overview of some related work. In Section 3, we present the details of the proposed spatial pyramid robust sparse coding method. Experimental results and analysis are given in Section 4. Finally, we give the conclusions in Section 5.

Section snippets

Related work

The bag-of-visual-words model (BoW) has been widely used due to its simplicity and good performance. Many works have been done to improve the performance of the traditional bag-of-visual-words model over the past few years. Some literatures devoted to learn discriminative visual vocabulary for object recognition (Perronnin et al., 2006, Jurie and Triggs, 2005, Moosmann et al., 2008). Perronnin et al. (2006) used the Gaussian Mixture Model (GMM) to perform clustering. To alleviate the drawback

Spatial pyramid robust sparse coding for image classification

In this section, we give the details of the proposed spatial pyramid robust sparse coding method for image classification. For each image, we first densely extract local image features and then utilize the spatial pyramid principle to encode local features with robust sparse coding. Then we concatenate the BoW representation of different segments as the final image representation. Fig. 1 shows the flowchart of the proposed spatial pyramid robust sparse coding for image classification method.

Experiments

We evaluate the proposed spatial pyramid robust sparse coding method on the fifteen natural scene dataset provided by Lazebnik et al. (2006) and the Caltech 256 dataset by Griffin et al. (2007). We perform all processing in grayscale of images even when sometimes the color images are provided. As to the feature extraction, we follow Lazebnik et al. (2006) and densely compute SIFT descriptors on overlapping 16 × 16 pixels with an overlap of 8 pixels. Each local feature is normalized with the $L_{2}$

Conclusions

This paper proposed a novel image classification method by spatial pyramid robust sparse coding. We first partition images into sub-regions on multiple scales. Then we use robust sparse coding to generate the codebook and encode local features per sub-region. Besides, we use the robust sparse coding technique to encode visual features with the spatial constraint. Local features from the same spatial sub-region of images are collected to generate the visual codebook for this sub-region. We

Acknowledgement

This work is supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR), China Postdoctoral Science Foundation: 2012M520434, National Basic Research Program of China (973 Program): 2012CB316400, National Natural Science Foundation of China: 61025011, 61272329, 61202325.

References (31)

Boiman, O., Shechtman, E., Irani, M., 2008. In defense of nearest-neighbor based image classification. In: Proc....
A. Bosch et al.
Scene classification using a hybrid generative/discriminative approach
IEEE Trans. Pattern Anal Machine Intell.
(2008)
Boureau, Y-Lan, Bach, Francis, LeCun, Yann, Ponce, Jean, 2010. Learning mid-level features for recognition. In: Proc....
Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A., 2011. The devil is in the details: an evaluation of recent...
Fei-Fei, L., Perona, P., 2005. A Bayesian hierarchical model for learning natural scene categories. In: Proc....
Fei-Fei, L., Fergus, R., Perona, P., 2004. Learning generative visual models from few training examples: an incremental...
Gao, S.H., Tsang, I.W.H., Chia, L., Zhao, P., 2010. Local features are not lonely-Laplacian sparse coding for image...
J. Gemert et al.
Visual word ambiguity
IEEE Trans. Pattern Anal Machine Intell.
(2010)
Grauman, K., Darrell, T., 2005. The pyramid match kernel: discriminative classification with sets of image features....
Griffin, G., Holub, A., Perona, P., 2007. Caltech-256 object category dataset. Technical report,...

Huang, J., Huang, X., Metaxas, D., 2008. Simultaneous image transformation and sparse representation recovery. In:...

Jurie, F., Triggs, B., 2005. Creating efficient codebooks for visual recognition. In: Proc. ICCV, pp....

Kim, B., Park, J., Gilbert, A., Savarese, S., 2011. Hierarchical classification of images by sparse approximation. In:...

Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: spatial pyramid matching for recognizing natural...

Liu, J., Shah, M., Scene modeling using co-clustering. In: Proc....

Cited by (43)

A novel multi-branch wavelet neural network for sparse representation based object classification
2023, Pattern Recognition
Citation Excerpt :
In the following, we will provide an overview of the state-of-the-art classification techniques and then summarize the contributions of the current work. The main idea behind most of the conventional (i.e. non-deep learning) classification techniques relies on the sparse representation tool [6,7]. For instance, the original SRC method [6] aims to estimate the sparsest representation of a test sample using an over-complete dictionary composed of training samples.
Recent advances in acquisition and display technologies have led to an enormous amount of visual data, which requires appropriate storage and management tools. One of the fundamental needs is the design of efficient image classification and recognition solutions. In this paper, we propose a wavelet neural network approach for sparse representation-based object classification. The proposed approach aims to exploit the advantages of sparse coding, multi-scale wavelet representation as well as neural networks. More precisely, a wavelet transform is firstly applied to the image datasets. The generated approximation and detail wavelet subbands are then fed into a multi-branch neural network architecture. This architecture produces multiple sparse codes that are efficiently combined during the classification stage. Extensive experiments, carried out on various types of standard object datasets, have shown the efficiency of the proposed method compared to the existing sparse coding and deep learning-based methods.
T2-FDL: A robust sparse representation method using adaptive type-2 fuzzy dictionary learning for medical image classification
2020, Expert Systems with Applications
In this paper, a robust sparse representation for medical image classification is proposed based on the adaptive type-2 fuzzy learning (T2-FDL) system. In the proposed method, sparse coding and dictionary learning processes are executed iteratively until a near-optimal dictionary is obtained. The sparse coding step aiming at finding a combination of dictionary atoms to represent the input data efficiently, and the dictionary learning step rigorously adjusts a minimum set of dictionary items. The two-step operation helps create an adaptive sparse representation algorithm by involving the type-2 fuzzy sets in the design process of image classification. Since the existing image measurements are not made under the same conditions and with the same accuracy, the performance of medical diagnosis is always affected by noise and uncertainty. By introducing an adaptive type-2 fuzzy learning method, a better approximation in an environment with higher degrees of uncertainty and noise is achieved. The experiments are executed over two open-access brain tumor magnetic resonance image databases, REMBRANDT and TCGA-LGG, from The Cancer Imaging Archive (TCIA). The experimental results of a brain tumor classification task show that the proposed T2-FDL method can adequately minimize the negative effects of uncertainty in the input images. The results demonstrate the outperformance of T2-FDL compared to other important classification methods in the literature, in terms of accuracy, specificity, and sensitivity.
A Gaussian pyramid approach to Bouligand–Minkowski fractal descriptors
2018, Information Sciences
This work proposes a method to extract features from texture images by applying a Gaussian pyramid multiscale approach to the Bouligand–Minkowski fractal descriptors. The proposal starts from the texture image and computes the stack of multi-resolution images that compose the pyramid, in both directions, of reduction and expansion. In the following, each image in the stack is mapped onto a surface, which is dilated by spheres with variable radii and the dilation volumes are used to compute the Bouligand–Minkowski fractal descriptors for each level. Both the descriptors of each level and combinations with descriptors from the original image are verified in the classification of well-known databases of textural images. The proposed method outperformed other classical and state-of-the-art descriptors with a significant advantage in most cases, including situations where random noise is added to the images.
Face recognition using linear representation ensembles
2016, Pattern Recognition
Citation Excerpt :
Differing from other ensemble learning methods for face recognition [16–20], our method focuses on the ensemble of the spatially local regions of the face. This strategy has been extensively used for image classification [21–23]. The experiments demonstrate the high accuracy of proposed algorithms.
In the past decade, linear representation based face recognition has become a very popular research subject in computer vision. This method assumes that faces belonging to one individual reside in a low-dimensional linear subspace. In real-world applications, however, face images usually are of degraded quality due to expression variations, disguises, and partial occlusions. These problems undermine the validity of the subspace assumption and thus the recognition performance deteriorates significantly. In this work, we propose a simple yet effective framework to address the problem. Observing that the linear subspace assumption is more reliable on certain face patches rather than on the holistic face, Probabilistic Patch Representations (PPRs) are randomly generated, according to the Bayesian theory. We then train an ensemble model over the patch-representations by minimizing the empirical risk w.r.t. the “leave-one-out margins”, which we term Linear Representation Ensemble (LRE). In the test stage, to handle the non-facial or novel face patterns, we design a simple inference method to dynamically tune the ensemble weights according to the proposed Generic Face Confidence (GFC). Furthermore, to accommodate immense PPR sets, a boosting-like algorithm is also derived. In addition, we theoretically prove two desirable property of the proposed learning methods. We extensively evaluate the proposed methods on four public face dataset, i.e., Yale-B, AR, FRGC and LFW, and the results demonstrate the superiority of both our two methods over many other state-of-the art algorithms, in terms of both recognition accuracy and computational efficiency.
Monte Carlo Convex Hull Model for classification of traditional Chinese paintings
2016, Neurocomputing
Citation Excerpt :
So, a lot of studies [4–11] have been done to overcome the problems. To reduce the training complexity and improve image classification performance, sparse coding based linear spatial pyramid matching methods [12–15] are proposed which help to improve classification performance. However, Chinese paintings are distinguished from Western art in that it is purely executed with the Chinese brush, Chinese ink, mineral, and vegetable pigments.
While artists demonstrate their individual styles through paintings and drawings, how to describe such artistic styles well selected visual features towards computerized analysis of the arts remains to be a challenging research problem. In this paper, we propose an integrated feature-based artistic descriptor with Monte Carlo Convex Hull (MCCH) feature selection model and support vector machine (SVM) for characterizing the traditional Chinese paintings and validate its effectiveness via automated classification of Chinese paintings authored by well-known Chinese artists. The integrated artistic style descriptor essentially contains a number of visual features including a novel feature of painting composition and object feature, each of which describes one element of the artistic style. In order to ensure an integrated discriminating power and certain level of adaptability to the variety of artistic styles among different artists, we introduce a novel feature selection method to process the correlations and the synergy across all elements inside the integrated feature and hence complete the proposed style-based descriptor design. Experiments on classification of Chinese paintings via a parallel MCCH model illustrate that the proposed descriptor outperforms the existing representative technique in terms of precision and recall rates.
Fine-Grained Image Classification by Class and Image-Specific Decomposition With Multiple Views
2023, IEEE Transactions on Multimedia

View all citing articles on Scopus

View full text