Birds of a feather flock together: Visual representation with scale and class consistency
Introduction
Local features have recently demonstrated their effectiveness with regard to image classification. Local features are often used in a bag-of-visual-words (BoW) manner [41]. However, the quantization loss is heavy when local features are assigned to the nearest visual word. To reduce the amount of quantization loss, researchers have proposed various soft-assignment based methods [7], [9], [20], [25], [29], [44], of which a sparse coding based technique is widely used. Sparse coding attempts to minimize the summed reconstruction error with a sparsity constraint. Max pooling is then used to extract the image representations.
To cope with image variances, researchers often densely extract local features with multiple scales [12], [14]. However, the scale information is simply discarded. Object detection based methods [23], [26] help to alleviate this problem. However, they rely heavily on the accuracy of the detection results. In addition, they require more labeled samples and computational power. To avoid the explicit detection of objects, researchers have also attempted to use salience-based methods. However, the aim of a salience measurement is inconsistent with image classification. In fact, the local feature extraction process has plenty of information that can be used. For example, we often densely extract local features with multiple scales. However, the scale information, which can be used to measure the relative sizes of local regions, is often ignored. If two images of the same class have different sizes, the scale information will help represent them discriminatively. If we can make use of the scale information along with visual similarity in a unified framework, we will be able to represent images more effectively.
Thousands of local features are often extracted from a single image. To encode these features, researchers often iteratively encode each local feature [25], or use an online feature encoding technique [12]. However, local features are inherently correlated. To jointly encode local features, the nearest neighbor information is used [7], [44]. In addition, the spatial relationships of local features have also been widely explored [14], [21], [34]. However, visually similar features may belong to different classes. When considering only the visual similarities of local features, it might not be possible to fully explore the useful information of the local features. The class information should therefore also be used to boost the classification performance [18].
To make full use of local feature information for image representation and classification, in this paper, we propose a novel discriminative scale and class consistent local feature encoding technique for image representation. Instead of only using the visual information of the local features, we also consider their scale and class information, which is achieved by first mapping the original image to multi-scale spaces through a convolution, and then densely extracting the local features. To explore the correlations of the local features, we jointly optimize for the encoding parameters with scale consistency. In addition, the class information of the local features is also combined to encode the local features. In this manner, we are able to obtain more effective image representations than other local feature-encoding based methods. Image classification experiments on several public image datasets have proven the effectiveness of the proposed method. Fig.Ā 1 provides a flowchart of the proposed method.
There are three main contributions of this study:
- ā¢
First, we use the scale information of the local features along with visual similarities to encode the local features.
- ā¢
Second, the class information of the local features is also used to boost the discriminative power of the image representations.
- ā¢
Third, we jointly encode the local features with a sparsity constraint to achieve a superior image classification performance over other baseline methods.
The rest of this paper is organized as follows. Related works are described in SectionĀ 2. The details of the proposed scale and class consistent visual representation method are provided in SectionĀ 3. Image classification experiments conducted on several public image datasets are described in SectionĀ 4. Finally, some concluding remarks are given in SectionĀ 5.
Section snippets
Related work
The BoW model has been widely used for image representations. It first extracts the local features and then encodes them based on a nearest neighbor assignment. To reduce the quantization loss, the use of a soft-encoding strategy has become popular [7], [9], [20], [22], [25], [26], [44], [50]. Gemert et. al [9] used a kernel trick by softly assigning a number of visual words instead of only one. Zhang etĀ al. [35] made use of the sparse coding technique for image classification. Wang etĀ al. [25]
Discriminative visual representation with scale and class consistency
In this section, we provide details on the proposed discriminative scale and class consistent visual representation method for image classification.
Experiments
To evaluate the effectiveness of the proposed scale and class consistent sparse coding method (SC2), we conducted image classification experiments on several public datasets that are widely used by researchers: the Scene-15 dataset [14], UIUC-Sports dataset [16], Caltech-256 dataset [10] and Flower-17 dataset [19]. To show the influences of the scale and class consistency more clearly, we also provide the performances when using only the scale consistency (SSC) and class consistency(CSC). There
Conclusion
In this paper, we proposed a novel discriminative sparse coding using a scale and class consistency method for image classification. The images are first mapped into the scale spaces through a convolution, and the local features are then densely extracted. We encode the local features by minimizing the reconstruction error along with scale consistency for the image representation. In this way, we are able to cope with the variations in the objects to a certain extent. In addition, instead of
Acknowledgments
This work is supported by the National Natural Science Foundation of China, Nos. 61303154 and 61332016, and the Scientific Research Key Program of Beijing Municipal Commission of Education (KZ201610005012). Dr. Qi Tian has been supported in part through ARO grant W911NF-15-1-0290, Faculty Research Gift Awards by NEC Laboratories America, and Blippar. This work is also supported in part by the National Science Foundation of China (NSFC), No. 61429201.
References (50)
- et al.
Image-based facial sketch-to-photo synthesis via online coupled dictionary learning
Inf. Sci. (NY)
(2012) - et al.
Image-specific classification with local and global discriminations
IEEE Trans. Neural Netw. Learn. Syst.
(2017) - et al.
Contextual exemplar classifier based image representation for classification
IEEE Trans. Circ. Syst. Video Technol.
(2017) - et al.
Image classification by non-negative sparse coding, low-rank and sparse decomposition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2011) - et al.
Image classification by search with explicitly and implicitly semantic representations
Inf. Sci. (NY)
(2017) - et al.
Image class prediction by joint object, context and background modeling
IEEE Trans. Circuits Syst. Video Technol.
(2016) - et al.
Multiple kernel learning, conic duality, and the SMO algorithm
Proceedings of the International Conference on Machine Learning, ICML
(2004) - et al.
In defense of nearest-neighbor based image classification
Proceedings of the Computer Vision and Pattern Recognition, CVPR
(2008) - et al.
The devil is in the details: an evaluation of recent feature encoding methods
Proceedings of the British Machine Vision Conference, BMVC
(2011) - et al.
Histograms of oriented gradients for human detection
Proceedings of the Computer Vision and Pattern Recognition, CVPR
(2005)