Elsevier

Pattern Recognition Letters

Volume 34, Issue 9, 1 July 2013, Pages 1046-1052
Pattern Recognition Letters

Image classification using spatial pyramid robust sparse coding

https://doi.org/10.1016/j.patrec.2013.02.013Get rights and content

Abstract

Recently, the sparse coding based codebook learning and local feature encoding have been widely used for image classification. The sparse coding model actually assumes the reconstruction error follows Gaussian or Laplacian distribution, which may not be accurate enough. Besides, the ignorance of spatial information during local feature encoding process also hinders the final image classification performance. To address these obstacles, we propose a new image classification method by spatial pyramid robust sparse coding (SP-RSC). The robust sparse coding tries to find the maximum likelihood estimation solution by alternatively optimizing over the codebook and local feature coding parameters, hence is more robust to outliers than traditional sparse coding based methods. Additionally, we adopt the robust sparse coding technique to encode visual features with the spatial constraint. Local features from the same spatial sub-region of images are collected to generate the visual codebook and encode local features. In this way, we are able to generate more discriminative codebooks and encoding parameters which eventually help to improve the image classification performance. Experiments on the Scene 15 dataset and the Caltech 256 dataset demonstrate the effectiveness of the proposed spatial pyramid robust sparse coding method.

Highlights

► We propose a new image classification method by spatial pyramid robust sparse coding. ► Images are spatially partitioned into sub-regions for codebook generation and local feature encoding. ► We alternatively optimize over codebook and encoding parameters by maximum likelihood. ► We achieve comparable performances with other methods on two public datasets.

Introduction

In recent years, the bag-of-visual-word (BoW) model has become popular in image classification. This model extracts appearance descriptors from local patches and quantizes them into discrete “visual words”, and then a compact histogram representation is used to represent images. The descriptive power of the BoW model is severely limited because it discards the spatial information of local descriptors. To overcome this problem, one popular extension method, called the spatial pyramid matching (SPM) by Lazebnik et al. (2006), is proposed and has been shown to be effective for image classification. The SPM partitions an image into several segments in different scales, then computes the BoW histogram within each segment and concatenates all the histograms to form a high dimension vector representation of the image.

To obtain good performance, researchers have empirically found that the SPM should be used together with SVM classifier using nonlinear Mercer kernels. However, the computational complexity is O(n3) and the memory complexity is O(n2) in the training phase, where n is the size of training dataset. This constrains the scalability of the SPM-based nonlinear SVM method. To reduce the training complexity and improve image classification performance, sparse coding based linear spatial pyramid matching methods (Yang et al., 2009, Serre et al., 2005, Wang et al., 2010) are proposed which help to improve classification performance. In fact, there is another constraint which was neglected in Yang et al., 2009, Wang et al., 2010, i.e., the spatial locality constraint. For example, ‘sky’ often lies on the upper side of images, while ‘beach’ often lies on the lower side of images. When we try to encode an image region about the upper ‘sky’, it is more semantically meaningful to use the bases which are generated by the local features on the upper side of images. Similarly, it is more meaningful to encode the lower ‘beach’ with the bases generated from the local features on the lower side of images. We believe this spatial information should be combined with the codebook generation in order to encode local features more efficiently.

Besides, the sparse coding used in Yang et al. (2009) for local feature encoding tried to minimize the reconstruction error of local features by learning the optimal codebook and coding parameters simultaneously with sparsity constraints. After the codebook is learned, the rest local features are encoded by minimizing the reconstruction error with the learnt codebook and sparsity constraints. To ensure coding parameter’s smoothness and reduce encoding information loss, Laplacian sparse coding and non-negative sparse coding are proposed by Gao et al. (2010) and Zhang et al. (2011) respectively. Actually, these sparse coding models assume that the reconstruction error should be Gaussian or Laplacian distribution, which is unable to model real world applications. It would be more effective if we can construct a more robust model than simply assuming the Gaussian or Laplacian distribution of reconstruction error.

In this paper, we present a novel image classification method by using spatial pyramid robust sparse coding (SP-RSC). We give the flowchart in Fig. 1. We first partition images into sub-regions on multiple scales. Then we adopt the robust sparse coding approach to generate the codebook and encode local features of images with the spatial constraint. Different from SPM (Lazebnik et al., 2006), the proposed SP-RSC based visual vocabulary is concatenated with each encoding results from the sub-regions which have the same spatial locality and segmentation scale. For the robust sparse coding, we adopt the maximum likelihood estimation (MLE) approach and try to minimize some function of the coding residuals. This function is associated with the distribution of the coding residuals which robustly encodes the given local feature with sparse regression coefficients. Experimental evaluations on two public datasets demonstrate the effectiveness of the proposed method.

Compared with our previous work (Zhang et al., 2010), we extended the spatial pyramid coding by using robust sparse coding instead of sparse coding both for codebook construction and local feature encoding. The sparse coding assumes the reconstruction error follows the Gaussian or Laplacian distribution while the robust sparse coding has no such constraints, hence helps to encode the local features more efficiently. Besides, more experiments are added to clarify the effectiveness of the proposed spatial pyramid robust sparse coding method.

The rest of the paper is organized as follows. Section 2 gives an overview of some related work. In Section 3, we present the details of the proposed spatial pyramid robust sparse coding method. Experimental results and analysis are given in Section 4. Finally, we give the conclusions in Section 5.

Section snippets

Related work

The bag-of-visual-words model (BoW) has been widely used due to its simplicity and good performance. Many works have been done to improve the performance of the traditional bag-of-visual-words model over the past few years. Some literatures devoted to learn discriminative visual vocabulary for object recognition (Perronnin et al., 2006, Jurie and Triggs, 2005, Moosmann et al., 2008). Perronnin et al. (2006) used the Gaussian Mixture Model (GMM) to perform clustering. To alleviate the drawback

Spatial pyramid robust sparse coding for image classification

In this section, we give the details of the proposed spatial pyramid robust sparse coding method for image classification. For each image, we first densely extract local image features and then utilize the spatial pyramid principle to encode local features with robust sparse coding. Then we concatenate the BoW representation of different segments as the final image representation. Fig. 1 shows the flowchart of the proposed spatial pyramid robust sparse coding for image classification method.

Experiments

We evaluate the proposed spatial pyramid robust sparse coding method on the fifteen natural scene dataset provided by Lazebnik et al. (2006) and the Caltech 256 dataset by Griffin et al. (2007). We perform all processing in grayscale of images even when sometimes the color images are provided. As to the feature extraction, we follow Lazebnik et al. (2006) and densely compute SIFT descriptors on overlapping 16 × 16 pixels with an overlap of 8 pixels. Each local feature is normalized with the L2

Conclusions

This paper proposed a novel image classification method by spatial pyramid robust sparse coding. We first partition images into sub-regions on multiple scales. Then we use robust sparse coding to generate the codebook and encode local features per sub-region. Besides, we use the robust sparse coding technique to encode visual features with the spatial constraint. Local features from the same spatial sub-region of images are collected to generate the visual codebook for this sub-region. We

Acknowledgement

This work is supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR), China Postdoctoral Science Foundation: 2012M520434, National Basic Research Program of China (973 Program): 2012CB316400, National Natural Science Foundation of China: 61025011, 61272329, 61202325.

References (31)

  • Boiman, O., Shechtman, E., Irani, M., 2008. In defense of nearest-neighbor based image classification. In: Proc....
  • A. Bosch et al.

    Scene classification using a hybrid generative/discriminative approach

    IEEE Trans. Pattern Anal Machine Intell.

    (2008)
  • Boureau, Y-Lan, Bach, Francis, LeCun, Yann, Ponce, Jean, 2010. Learning mid-level features for recognition. In: Proc....
  • Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A., 2011. The devil is in the details: an evaluation of recent...
  • Fei-Fei, L., Perona, P., 2005. A Bayesian hierarchical model for learning natural scene categories. In: Proc....
  • Fei-Fei, L., Fergus, R., Perona, P., 2004. Learning generative visual models from few training examples: an incremental...
  • Gao, S.H., Tsang, I.W.H., Chia, L., Zhao, P., 2010. Local features are not lonely-Laplacian sparse coding for image...
  • J. Gemert et al.

    Visual word ambiguity

    IEEE Trans. Pattern Anal Machine Intell.

    (2010)
  • Grauman, K., Darrell, T., 2005. The pyramid match kernel: discriminative classification with sets of image features....
  • Griffin, G., Holub, A., Perona, P., 2007. Caltech-256 object category dataset. Technical report,...
  • Huang, J., Huang, X., Metaxas, D., 2008. Simultaneous image transformation and sparse representation recovery. In:...
  • Jurie, F., Triggs, B., 2005. Creating efficient codebooks for visual recognition. In: Proc. ICCV, pp....
  • Kim, B., Park, J., Gilbert, A., Savarese, S., 2011. Hierarchical classification of images by sparse approximation. In:...
  • Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: spatial pyramid matching for recognizing natural...
  • Liu, J., Shah, M., Scene modeling using co-clustering. In: Proc....
  • Cited by (43)

    • A novel multi-branch wavelet neural network for sparse representation based object classification

      2023, Pattern Recognition
      Citation Excerpt :

      In the following, we will provide an overview of the state-of-the-art classification techniques and then summarize the contributions of the current work. The main idea behind most of the conventional (i.e. non-deep learning) classification techniques relies on the sparse representation tool [6,7]. For instance, the original SRC method [6] aims to estimate the sparsest representation of a test sample using an over-complete dictionary composed of training samples.

    • Face recognition using linear representation ensembles

      2016, Pattern Recognition
      Citation Excerpt :

      Differing from other ensemble learning methods for face recognition [16–20], our method focuses on the ensemble of the spatially local regions of the face. This strategy has been extensively used for image classification [21–23]. The experiments demonstrate the high accuracy of proposed algorithms.

    • Monte Carlo Convex Hull Model for classification of traditional Chinese paintings

      2016, Neurocomputing
      Citation Excerpt :

      So, a lot of studies [4–11] have been done to overcome the problems. To reduce the training complexity and improve image classification performance, sparse coding based linear spatial pyramid matching methods [12–15] are proposed which help to improve classification performance. However, Chinese paintings are distinguished from Western art in that it is purely executed with the Chinese brush, Chinese ink, mineral, and vegetable pigments.

    View all citing articles on Scopus
    View full text