Elsevier

Information Sciences

Volumes 460ā€“461, September 2018, Pages 115-127
Information Sciences

Birds of a feather flock together: Visual representation with scale and class consistency

https://doi.org/10.1016/j.ins.2018.05.048Get rights and content

Highlights

  • ā€¢

    We make use of the scale information of local features along with visual similarities.

  • ā€¢

    The class information of local features are also used to boost the discriminative power.

  • ā€¢

    We encode local features with sparsity constraints jointly instead of independently.

Abstract

There are three problems with a local-feature based representation scheme. First, local regions are often densely extracted or determined through detection without considering the scales of local regions. Second, local features are encoded separately, leaving the relationship among them unconsidered. Third, local features are simply encoded without considering the class information. To solve these problems, in this paper, we propose a scale and class consistent local-feature encoding method for image representation, which is achieved through the dense extraction of local features in different scale spaces, and the subsequent learning of the encoding parameters. In addition, instead of encoding each local feature independently, we jointly optimize the encoding parameters of the local features. Moreover, we also impose class consistency during the local-feature encoding process. We test the discriminative power of image representations on image classification tasks. Experiments on several public image datasets demonstrate that the proposed method achieves a superior performance compared with many other local-feature based methods.

Introduction

Local features have recently demonstrated their effectiveness with regard to image classification. Local features are often used in a bag-of-visual-words (BoW) manner [41]. However, the quantization loss is heavy when local features are assigned to the nearest visual word. To reduce the amount of quantization loss, researchers have proposed various soft-assignment based methods [7], [9], [20], [25], [29], [44], of which a sparse coding based technique is widely used. Sparse coding attempts to minimize the summed reconstruction error with a sparsity constraint. Max pooling is then used to extract the image representations.

To cope with image variances, researchers often densely extract local features with multiple scales [12], [14]. However, the scale information is simply discarded. Object detection based methods [23], [26] help to alleviate this problem. However, they rely heavily on the accuracy of the detection results. In addition, they require more labeled samples and computational power. To avoid the explicit detection of objects, researchers have also attempted to use salience-based methods. However, the aim of a salience measurement is inconsistent with image classification. In fact, the local feature extraction process has plenty of information that can be used. For example, we often densely extract local features with multiple scales. However, the scale information, which can be used to measure the relative sizes of local regions, is often ignored. If two images of the same class have different sizes, the scale information will help represent them discriminatively. If we can make use of the scale information along with visual similarity in a unified framework, we will be able to represent images more effectively.

Thousands of local features are often extracted from a single image. To encode these features, researchers often iteratively encode each local feature [25], or use an online feature encoding technique [12]. However, local features are inherently correlated. To jointly encode local features, the nearest neighbor information is used [7], [44]. In addition, the spatial relationships of local features have also been widely explored [14], [21], [34]. However, visually similar features may belong to different classes. When considering only the visual similarities of local features, it might not be possible to fully explore the useful information of the local features. The class information should therefore also be used to boost the classification performance [18].

To make full use of local feature information for image representation and classification, in this paper, we propose a novel discriminative scale and class consistent local feature encoding technique for image representation. Instead of only using the visual information of the local features, we also consider their scale and class information, which is achieved by first mapping the original image to multi-scale spaces through a convolution, and then densely extracting the local features. To explore the correlations of the local features, we jointly optimize for the encoding parameters with scale consistency. In addition, the class information of the local features is also combined to encode the local features. In this manner, we are able to obtain more effective image representations than other local feature-encoding based methods. Image classification experiments on several public image datasets have proven the effectiveness of the proposed method. Fig.Ā 1 provides a flowchart of the proposed method.

There are three main contributions of this study:

  • ā€¢

    First, we use the scale information of the local features along with visual similarities to encode the local features.

  • ā€¢

    Second, the class information of the local features is also used to boost the discriminative power of the image representations.

  • ā€¢

    Third, we jointly encode the local features with a sparsity constraint to achieve a superior image classification performance over other baseline methods.

The rest of this paper is organized as follows. Related works are described in SectionĀ 2. The details of the proposed scale and class consistent visual representation method are provided in SectionĀ 3. Image classification experiments conducted on several public image datasets are described in SectionĀ 4. Finally, some concluding remarks are given in SectionĀ 5.

Section snippets

Related work

The BoW model has been widely used for image representations. It first extracts the local features and then encodes them based on a nearest neighbor assignment. To reduce the quantization loss, the use of a soft-encoding strategy has become popular [7], [9], [20], [22], [25], [26], [44], [50]. Gemert et. al [9] used a kernel trick by softly assigning a number of visual words instead of only one. Zhang etĀ al. [35] made use of the sparse coding technique for image classification. Wang etĀ al. [25]

Discriminative visual representation with scale and class consistency

In this section, we provide details on the proposed discriminative scale and class consistent visual representation method for image classification.

Experiments

To evaluate the effectiveness of the proposed scale and class consistent sparse coding method (SC2), we conducted image classification experiments on several public datasets that are widely used by researchers: the Scene-15 dataset [14], UIUC-Sports dataset [16], Caltech-256 dataset [10] and Flower-17 dataset [19]. To show the influences of the scale and class consistency more clearly, we also provide the performances when using only the scale consistency (SSC) and class consistency(CSC). There

Conclusion

In this paper, we proposed a novel discriminative sparse coding using a scale and class consistency method for image classification. The images are first mapped into the scale spaces through a convolution, and the local features are then densely extracted. We encode the local features by minimizing the reconstruction error along with scale consistency for the image representation. In this way, we are able to cope with the variations in the objects to a certain extent. In addition, instead of

Acknowledgments

This work is supported by the National Natural Science Foundation of China, Nos. 61303154 and 61332016, and the Scientific Research Key Program of Beijing Municipal Commission of Education (KZ201610005012). Dr. Qi Tian has been supported in part through ARO grant W911NF-15-1-0290, Faculty Research Gift Awards by NEC Laboratories America, and Blippar. This work is also supported in part by the National Science Foundation of China (NSFC), No. 61429201.

References (50)

  • J. Donahue et al.

    DeCAF: a deep convolutional activation feature for generic visual recognition

    Proceedings of International Conference on Machine Learning

    (2014)
  • M. Everingham et al.

    The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results

    (2007)
  • GaoS. et al.

    Laplacian sparse coding, hypergraph Laplacian sparse coding, and applications

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • P. Gehler et al.

    On feature combination for multiclass object classification

    Proceedings of the International Conference on Computer Vision, ICCV

    (2009)
  • J. Gemert et al.

    Visual word ambiguity

    (2010)
  • G. Griffin, A. Holub, P. Perona, The Caltech 256 Dataset, 2006, Caltech Technical...
  • HanY. et al.

    Compact and discriminative descriptor inference using multi-cues

    IEEE Trans. Image Process.

    (2015)
  • N. Kulkarni et al.

    Discriminative affine sparse codes for image classification

    Proceedings of the Computer Vision and Pattern Recognition, CVPR

    (2011)
  • I. Kuzborskij et al.

    When Naive Bayes nearest neighbors meet convolutional neural networks

    Proceedings of the Computer Vision and Pattern Recognition, CVPR

    (2016)
  • S. Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

    Proceedings of the Computer Vision and Pattern Recognition, CVPR

    (2006)
  • LeeH. et al.

    Efficient sparse coding algorithms

    Proceedings of the Neural Information Processing Systems, NIPS

    (2006)
  • LiL. et al.

    What, where and who? Clasifying events by scene and object recognition

    Proceedings of the International Conference on Computer Vision, ICCV

    (2007)
  • D. Lowe

    Distinctive Image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • F. Moosmann et al.

    Randomized clustering forests for image classification

    (2008)
  • M. Nilsback et al.

    A visual vocabulary for flower classification

    Proceedings of the Computer Vision and Pattern Recognition, CVPR

    (2006)
  • Cited by (0)

    View full text