An efficient approach for scene categorization based on discriminative codebook learning in bag-of-words framework

https://doi.org/10.1016/j.imavis.2013.07.001Get rights and content

Abstract

This paper proposes an efficient technique for learning a discriminative codebook for scene categorization. A state-of-the-art approach for scene categorization is the Bag-of-Words (BoW) framework, where codebook generation plays an important role in determining the performance of the system. Traditionally, the codebook generation methods adopted in the BoW techniques are designed to minimize the quantization error, rather than optimize the classification accuracy. In view of this, this paper tries to address the issue by careful design of the codewords such that the resulting image histograms for each category will retain strong discriminating power, while the online categorization of the testing image is as efficient as in the baseline BoW. The codewords are refined iteratively to improve their discriminative power offline. The proposed method is validated on UIUC Scene-15 dataset and NTU Scene-25 dataset and it is shown to outperform other state-of-the-art codebook generation methods in scene categorization.

Introduction

Scene categorization is an important problem in pattern recognition and computer vision. It has many applications such as keyword suggestion (offering some semantic labels that are associated with the image content), retrieval (filtering images in Internet based on the keywords), and browsing (grouping of images based on keywords instead of feature clustering of content). In recent years, many automatic techniques for assigning semantic labels to images have been designed to improve the performance of these applications. Early work on scene categorization used low-level global features extracted from the whole image. However, these representations lacked local information and were only used to classify images into a small number of categories such as indoor/outdoor and man-made/natural. More recent approaches exploit local statistics in images. These representations often model scenes as a collection of local descriptors using interest point detectors, dense sampling patches, or segmentation. For example, bag-of-words models for image categorization Csurka et al. [1], Sivic and Zisserman [2], and Zhang et al. [3] use the following steps: (i) quantizing high-dimensional descriptors of local image patches into discrete codewords, (ii) forming BoW histograms based on the codeword distribution, and (iii) training classifiers based on these histograms. The BoW framework is widely used in multimedia categorizations due to its effectiveness and high efficiency.

Clustering methods are often used to train codebooks, and amongst these, K-means is a popular method used for codebook learning in scene categorization. In the conventional K-mean clustering, the objective in the codebook design is to minimize the expected distortion that is suitable for compressing high-dimensional data. However, the codebook designed in this manner does not ensure that the codewords are effective for categorization. In order to train more effective classifiers using the histograms obtained from the codebook, the codewords should be designed to be discriminative amongst different categories.

There are several existing supervised learning approaches for codebook generation. Kohonen [4] proposed Learning Vector Quantization for supervised quantizer design using Voronoi partitions based on self-organizing maps. A semantic vocabulary is presented by Vogel and Schiele [5]. The authors construct a vocabulary by labeling image patches with a semantic label, e.g. sky, water or vegetation. Jurie and Triggs [6] combined on-line clustering and mean-shift algorithm to generate the codebook for image classification. Zhang et al. [7] proposed a codebook generation method called Codebook + which minimizes the ratio of within-class variation to the between-class variation.

Winn et al. [8] proposed that each visual word is described by a mixture of Gaussians in feature space, and the compact and light-weight vocabulary is constructed by greedily merging an initially large vocabulary. Specifically, by assuming a Gaussian distribution of image histograms, the probability that a histogram belongs to a certain object class can be estimated. Then the goal of the learning algorithm is to find the mapping ϕ between histograms which maximizes the conditional probability of the ground truth labels based on the training data. The mapping ϕ given by the algorithm is actually defined by the merging operations of visual words in the original vocabulary. By observing the conditional probability of the ground truth labels with every possible grouping of cluster bins, the optimal mapping is chosen, producing a better discrimination of the vocabulary.

Perronnin [9] proposed that an image can be characterized by a set of histograms for each class, where each histogram describes whether the image content is best modeled by the universal vocabulary or the corresponding class vocabulary. Gaussian Mixture Models (GMMs) are constructed from the data samples. Then the universal vocabulary is trained using maximum likelihood estimation (MLE) and the class vocabularies are adapted based on the universal vocabulary using the maximum a posteriori (MAP) criterion. When applying the MAP adaption, relevance factors for Gaussian weight, mean and variance parameters are added to the standard expectation maximization procedure. The relevance factors are enforced to be equal and then determined by experiments to maximize the classification accuracy.

All the abovementioned schemes can be regarded as clustering image patch descriptors according to some certain objective functions. In these methods, the codebook is trained to be discriminative, however, the densities of the descriptors over the codewords are estimated by a histogram which is essentially a nearest-neighbor quantization. Recently, Jan C. van Gemert et al. [10] presented a codebook generation method based on the theory of kernel density estimation. Three criteria are used: kernel codebook, codeword plausibility, and codeword uncertainty. In their scheme, ‘codeword uncertainty’ (UNC) based on kernel density estimation shows consistently superior performance over the histogram-based methods in BoW. However, with UNC, the codewords are not trained to be discriminative and the generation of image histograms will involve typically hundreds of thousands of computations of Gaussian function. So the computationally efficient methods need to be investigated.

In this paper, we will present a new efficient method for iteratively refining the discriminative codewords based on the theory of kernel density estimation. The proposed iterative clustering method will be validated to be competitive to state-of-the-art algorithms. In addition, the proposed method imposes no extra computational cost of online feature extraction when compared to the baseline BoW. The main contribution of this paper is to develop a new method for codebook generation in scene categorization that determines the discriminative power of a codeword based on kernel density estimation without increasing the computation of online categorization of testing images. The codewords are tuned iteratively according to the soft relevance of image patch descriptors of different categories. Experimental results demonstrate its superiority in classification accuracy and efficiency when compared to the histogram-based method and the codeword uncertainty.

The rest of this paper is organized as follows. Section 2 outlines the overview of our proposed method. Section 3 illustrates the proposed codebook generation method together with the associated iterative optimization algorithm. Section 4 validates our method of generating effective codewords for bag-of-words image categorization with comparison to other state-of-the-art work. Finally, Section 5 concludes the paper.

Section snippets

Overview of the proposed scheme

The main idea of our paper is to find discriminative codewords, that is, to find codewords that are representative for a specific category and yet sufficiently discriminative from other categories. In order to achieve this goal, first we estimate the density distribution of image patch descriptors using the theory of kernel density estimation (KDE) instead of the conventional BoW histogram. Using these density distributions, we can evaluate the discriminative power of each codeword according to

The proposed codebook generation method

After clustering the SIFT descriptors of training image patches, we obtain the codewords which are the cluster centers. For each codeword c in the codebook CB, traditional codebook model estimates the distribution of codewords in an image by a histogram as follows:Hc=1Ni=1NIc=argminwDw,viwCB,where N is the number of patches in an image, vi is the descriptor of an image patch, D(⋅,⋅) is the Euclidean distance, and I(⋅) is the identity function.

A robust alternative to histograms for estimating

Datasets

This section presents experimental evaluations on two datasets: UIUC Scene-15 dataset [12] and NTU Scene-25 dataset. The Scene-15 dataset is commonly used by many other works as the benchmark for comparison. It consists of 4485 images taken from 15 different scene categories and is quite challenging — for example, it is difficult to distinguish indoor categories such as bedroom and living room. The sample images of the UIUC Scene-15 dataset are shown in Fig. 3.

In order to further validate the

Conclusion

This paper has considered the problem of discriminative codebook design for improving the classification accuracy in the bag-of-words scene categorization. In this work, a kernel density estimator is used to refine codewords iteratively for obtaining more discriminative codewords which enhance the classification. Experimental results show that the proposed method produces better classification performance and lower online computational costs than other state-of-the-art methods.

Acknowledgment

We would like to thank Dr. Jan C. van Gemert for kindly providing the source code of his algorithm. This work is supported by Agency for Science, Technology and Research (A*STAR), Singapore under SERC grant 062 130 0055.

References (13)

  • G. Csurka et al.

    Visual categorization with bags of keypoints

  • J. Sivic et al.

    Video Google: a text retrieval approach to object matching in videos

  • J. Zhang et al.

    Local features and kernels for classification of texture and object categories: a comprehensive study

    Int. J. Comput. Vis.

    (2007)
  • T. Kohonen

    Learning Vector Quantization for Pattern Recognition

    (1986)
  • J. Vogel et al.

    Semantic modeling of natural scenes for content-based image retrieval

    Int. J. Comput. Vis.

    (2007)
  • F. Jurie et al.

    Creating efficient codebooks for visual recognition

There are more references available in the full text version of this article.

Cited by (8)

View all citing articles on Scopus

This paper has been recommended for acceptance by Bastian Leibe.

View full text