An efficient approach for scene categorization based on discriminative codebook learning in bag-of-words framework☆
Introduction
Scene categorization is an important problem in pattern recognition and computer vision. It has many applications such as keyword suggestion (offering some semantic labels that are associated with the image content), retrieval (filtering images in Internet based on the keywords), and browsing (grouping of images based on keywords instead of feature clustering of content). In recent years, many automatic techniques for assigning semantic labels to images have been designed to improve the performance of these applications. Early work on scene categorization used low-level global features extracted from the whole image. However, these representations lacked local information and were only used to classify images into a small number of categories such as indoor/outdoor and man-made/natural. More recent approaches exploit local statistics in images. These representations often model scenes as a collection of local descriptors using interest point detectors, dense sampling patches, or segmentation. For example, bag-of-words models for image categorization Csurka et al. [1], Sivic and Zisserman [2], and Zhang et al. [3] use the following steps: (i) quantizing high-dimensional descriptors of local image patches into discrete codewords, (ii) forming BoW histograms based on the codeword distribution, and (iii) training classifiers based on these histograms. The BoW framework is widely used in multimedia categorizations due to its effectiveness and high efficiency.
Clustering methods are often used to train codebooks, and amongst these, K-means is a popular method used for codebook learning in scene categorization. In the conventional K-mean clustering, the objective in the codebook design is to minimize the expected distortion that is suitable for compressing high-dimensional data. However, the codebook designed in this manner does not ensure that the codewords are effective for categorization. In order to train more effective classifiers using the histograms obtained from the codebook, the codewords should be designed to be discriminative amongst different categories.
There are several existing supervised learning approaches for codebook generation. Kohonen [4] proposed Learning Vector Quantization for supervised quantizer design using Voronoi partitions based on self-organizing maps. A semantic vocabulary is presented by Vogel and Schiele [5]. The authors construct a vocabulary by labeling image patches with a semantic label, e.g. sky, water or vegetation. Jurie and Triggs [6] combined on-line clustering and mean-shift algorithm to generate the codebook for image classification. Zhang et al. [7] proposed a codebook generation method called Codebook + which minimizes the ratio of within-class variation to the between-class variation.
Winn et al. [8] proposed that each visual word is described by a mixture of Gaussians in feature space, and the compact and light-weight vocabulary is constructed by greedily merging an initially large vocabulary. Specifically, by assuming a Gaussian distribution of image histograms, the probability that a histogram belongs to a certain object class can be estimated. Then the goal of the learning algorithm is to find the mapping ϕ between histograms which maximizes the conditional probability of the ground truth labels based on the training data. The mapping ϕ given by the algorithm is actually defined by the merging operations of visual words in the original vocabulary. By observing the conditional probability of the ground truth labels with every possible grouping of cluster bins, the optimal mapping is chosen, producing a better discrimination of the vocabulary.
Perronnin [9] proposed that an image can be characterized by a set of histograms for each class, where each histogram describes whether the image content is best modeled by the universal vocabulary or the corresponding class vocabulary. Gaussian Mixture Models (GMMs) are constructed from the data samples. Then the universal vocabulary is trained using maximum likelihood estimation (MLE) and the class vocabularies are adapted based on the universal vocabulary using the maximum a posteriori (MAP) criterion. When applying the MAP adaption, relevance factors for Gaussian weight, mean and variance parameters are added to the standard expectation maximization procedure. The relevance factors are enforced to be equal and then determined by experiments to maximize the classification accuracy.
All the abovementioned schemes can be regarded as clustering image patch descriptors according to some certain objective functions. In these methods, the codebook is trained to be discriminative, however, the densities of the descriptors over the codewords are estimated by a histogram which is essentially a nearest-neighbor quantization. Recently, Jan C. van Gemert et al. [10] presented a codebook generation method based on the theory of kernel density estimation. Three criteria are used: kernel codebook, codeword plausibility, and codeword uncertainty. In their scheme, ‘codeword uncertainty’ (UNC) based on kernel density estimation shows consistently superior performance over the histogram-based methods in BoW. However, with UNC, the codewords are not trained to be discriminative and the generation of image histograms will involve typically hundreds of thousands of computations of Gaussian function. So the computationally efficient methods need to be investigated.
In this paper, we will present a new efficient method for iteratively refining the discriminative codewords based on the theory of kernel density estimation. The proposed iterative clustering method will be validated to be competitive to state-of-the-art algorithms. In addition, the proposed method imposes no extra computational cost of online feature extraction when compared to the baseline BoW. The main contribution of this paper is to develop a new method for codebook generation in scene categorization that determines the discriminative power of a codeword based on kernel density estimation without increasing the computation of online categorization of testing images. The codewords are tuned iteratively according to the soft relevance of image patch descriptors of different categories. Experimental results demonstrate its superiority in classification accuracy and efficiency when compared to the histogram-based method and the codeword uncertainty.
The rest of this paper is organized as follows. Section 2 outlines the overview of our proposed method. Section 3 illustrates the proposed codebook generation method together with the associated iterative optimization algorithm. Section 4 validates our method of generating effective codewords for bag-of-words image categorization with comparison to other state-of-the-art work. Finally, Section 5 concludes the paper.
Section snippets
Overview of the proposed scheme
The main idea of our paper is to find discriminative codewords, that is, to find codewords that are representative for a specific category and yet sufficiently discriminative from other categories. In order to achieve this goal, first we estimate the density distribution of image patch descriptors using the theory of kernel density estimation (KDE) instead of the conventional BoW histogram. Using these density distributions, we can evaluate the discriminative power of each codeword according to
The proposed codebook generation method
After clustering the SIFT descriptors of training image patches, we obtain the codewords which are the cluster centers. For each codeword c in the codebook , traditional codebook model estimates the distribution of codewords in an image by a histogram as follows:where N is the number of patches in an image, vi is the descriptor of an image patch, D(⋅,⋅) is the Euclidean distance, and I(⋅) is the identity function.
A robust alternative to histograms for estimating
Datasets
This section presents experimental evaluations on two datasets: UIUC Scene-15 dataset [12] and NTU Scene-25 dataset. The Scene-15 dataset is commonly used by many other works as the benchmark for comparison. It consists of 4485 images taken from 15 different scene categories and is quite challenging — for example, it is difficult to distinguish indoor categories such as bedroom and living room. The sample images of the UIUC Scene-15 dataset are shown in Fig. 3.
In order to further validate the
Conclusion
This paper has considered the problem of discriminative codebook design for improving the classification accuracy in the bag-of-words scene categorization. In this work, a kernel density estimator is used to refine codewords iteratively for obtaining more discriminative codewords which enhance the classification. Experimental results show that the proposed method produces better classification performance and lower online computational costs than other state-of-the-art methods.
Acknowledgment
We would like to thank Dr. Jan C. van Gemert for kindly providing the source code of his algorithm. This work is supported by Agency for Science, Technology and Research (A*STAR), Singapore under SERC grant 062 130 0055.
References (13)
- et al.
Visual categorization with bags of keypoints
- et al.
Video Google: a text retrieval approach to object matching in videos
- et al.
Local features and kernels for classification of texture and object categories: a comprehensive study
Int. J. Comput. Vis.
(2007) Learning Vector Quantization for Pattern Recognition
(1986)- et al.
Semantic modeling of natural scenes for content-based image retrieval
Int. J. Comput. Vis.
(2007) - et al.
Creating efficient codebooks for visual recognition
Cited by (8)
Classification of melanoma based on feature similarity measurement for codebook learning in the bag-of-features model
2019, Biomedical Signal Processing and ControlCitation Excerpt :In addition, the k-means clustering algorithm is time-consuming, since it is a process of solving the optimal solution by multiple iterations. Actually, Li et al. [20] pointed out that the codebook designed in the conventional manner does not ensure that the codewords are effective for categorization. The fusion of multiple features has been turned out to be an effective method to boost the melanoma classification performance in [18].
Automatic Acquisition of Appropriate Codewords Number in BoVW Model and the Corresponding Scene Classification Performance
2018, Chinese Control Conference, CCCImage retrieval by information fusion based on scalable vocabulary tree and robust Hausdorff distance
2017, Eurasip Journal on Advances in Signal ProcessingFusing multiple features and spatial information for image classification via codebook ensemble
2017, International Journal of Embedded SystemsCategory related bow model for image classification
2015, Journal of Information and Computational Science
- ☆
This paper has been recommended for acceptance by Bastian Leibe.