Elsevier

Pattern Recognition Letters

Volume 129, January 2020, Pages 158-165
Pattern Recognition Letters

A bag of constrained informative deep visual words for image retrieval

https://doi.org/10.1016/j.patrec.2019.11.011Get rights and content

Highlights

  • A new model for image retrieval

  • Combination of deep features, information theory and constrained clustering

  • Unsupervised clustering constraints built from mutual information

Abstract

In this paper, we propose a bag of constrained informative deep visual words (BoCIDVW) model for image retrieval. Informative patches from each image are first obtained using patch entropy values. Each such patch is represented by deep features extracted through VGG16-Net. Two sets of constraints, namely, the must-link (ML) and the cannot-link (CL), are obtained for each deep informative patch in an unsupervised manner from its mutual information values (with other patches). The patches are then quantized using the Linear-time Constrained Vector Quantization Error (LCVQE), a fast yet accurate constrained K-means algorithm. The resulting clusters, which we term constrained informative deep visual words, are employed to label each patch. Finally, a bag (histogram) of constrained informative visual words is developed for image retrieval. Experiments on three different publicly available datasets demonstrate the merit of the proposed formulation.

Introduction

Image retrieval [7] is a well studied problem in the pattern recognition community. Bag of Visual Words (BoVW) and its variants are effectively being used for image retrieval for quite some time [28]. In the basic BoVW model, the constituent patches in an image are first represented by hand-crafted features, like SURF [1] or SIFT [15]. These patches are then quantized in the feature space by the K-means algorithm [13]. Finally, each image patch is marked with the label of the nearest cluster (visual words) and an image is represented by a bag (histogram) of visual words. Some works are reported on improved BoVW models in connection with image retrieval. For example, Dimitrovski et al. have applied BoVW model with predictive clustering tree for improving image retrieval [10]. On the other hand, deep learning based approaches have become increasingly popular for solving the retrieval problem as well [12], [26].

In this paper, we propose a new patch based model for image retrieval using deep features, information theoretic measures and constrained clustering. All patches extracted from an image do not contain significant information. Essentially, the patches from more homogeneous regions carry less information. So, entropy is used to select informative patches, i.e., from object regions with higher entropy values. These patches are then represented by deep features through VGG16-Net. To develop the constraints for the patches in a supervised manner, one needs to store all the patch labels. This becomes a cumbersome process. In this paper, mutual information is employed to develop these clustering constraints in an unsupervised manner. Linear-time Constrained Vector Quantization Error (LCVQE) [23], a fast yet accurate constrained K-means algorithm is used to quantize the informative image patches. We term the resulting image representation model as Bag of Constrained Informative Deep Visual Words (BoCIDVW). For a preliminary version of this work, please see [19], where we developed a Bag of Constrained Visual Words (BoCVW) model. There are several key differences between the previous conference version and the current journal version. In the conference version, we used SURF features. In contrast, in the journal version, we have replaced SURF features by deep features. Secondly, we have used information theory (entropy, mutual information) at different stages of the solution. Furthermore, we have now added an algorithm, time-complexity analysis and more theoretical expositions. Finally, exhaustive comparisons with several existing and more recent approaches are included. Now, we summarize the contributions of this work:

  • We use mean and standard deviation of entropy values of the patches appearing in an image to filter out the more informative patches. In a given image, we initially calculate entropy values of all the constituent patches and compute their mean and standard deviation. We filter out the informative patches, i.e., patches whose entropy values are greater than or equal to the mean entropy value plus standard deviation of the entropy values. These patches are then represented using deep features through the VGG16 network. So, we build deep informative patches.

  • Mutual information is employed to develop two sets of clustering constraints (must-link and cannot-link) in an unsupervised manner. Linear-time Constrained Vector Quantization Error (LCVQE), a constrained K-means clustering algorithm is then applied to quantize the deep informative patches. Finally, a Bag of Constrained Informative Deep Visual Words (BoCIDVW) is developed for the purpose of image retrieval. This proposed model, a combination of deep features, information theoretic measures, and constrained clustering, is shown to yield very competitive results on the publicly available Coil-100, Oxford5K and Paris-6K datasets in terms of mean average precision (mAP) values.

The rest of the paper is organized as follows: in Section 2, we discuss the related work. In Section 3, we describe our method. In Section 4, we present the experimental results. The paper is concluded in Section 5 with directions of future research.

Section snippets

Related work

Different classes of solutions exist for the image retrieval problem [7]. We first discuss existing BoVW models for image retrieval. Then, we present some important deep learning based approaches in image retrieval.

For initial research on BoVW based solutions, please see [28]. The authors in [3] improved the retrieval results by introducing a fuzzy visual word assignment model. Mukherjee et al. [17] designed an assignment model based on an affinity function between a patch and a cluster which

Proposed method

In this section, we describe in details different components of the proposed BoCIDVW model.

Experimental results

We have evaluated our proposed Image Retrievel framework on three benchmark datasets, namely, Coil 100 [20], Oxford-5K [24] and Paris6k [25]. All experiments are carried out in MATLAB R2018a environment on a desktop PC with Intel Xeon(R) CPU E5-2690 v4 @ 2.60GHz, 16 Core and 128GB of DDR2-memory with NVIDIA Quadro K2200 GPU.

Conclusion

In this work, we have proposed a new model of image retrieval based on deep features, information theory and constrained clustering. Entropy is used to filter out more informative patches. These more informative patches are then represented using deep features through a pre-trained VGG16-Net. Two types of clustering constraints, namely, must-link and cannot-link are captured in an unsupervised manner using mutual information. LCVQE algorithm is used to cluster the deep informative patches. The

Declaration of Competing Interest

We hereby declare that we do not have any conflict of interest for this manuscript.

References (33)

  • I. Davidson et al.

    Clustering with constraints: feasibility issues and the k-means algorithm

    Proceedings of the 2005 SIAM international conference on data mining

    (2005)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
  • H. Fu et al.

    Fast semantic image retrieval based on random forest

    Proceedings of the 20th ACM international conference on Multimedia

    (2012)
  • A. Gordo et al.

    Deep image retrieval: learning global representations for image search

    ECCV

    (2016)
  • J.A. Hartigan et al.

    Algorithm as 136: a k-means clustering algorithm

    J. R. Stat. Soc. Ser. C

    (1979)
  • X. Li et al.

    Pairwise geometric matching for large-scale object retrieval

    CVPR

    (2015)
  • Cited by (12)

    • Image retrieval for Structure-from-Motion via Graph Convolutional Network

      2021, Information Sciences
      Citation Excerpt :

      Then, Term Frequency Inverse Document Frequency (TF-IDF) is utilized to efficiently score the similarity of images with inverted files. The research community has reached a higher level of maturity by improving the quantization procedure [10,11,22,25,45], adopting compact representations [15,16,26,35], incorporating geometric cues [7,15,25], applying query expansion [6,38], and conducting image re-ranks [36,40]. Although the vocabulary tree assists SfM pipelines in eliminating computational costs, substantial memory footprints are still required during both constructing and indexing processes.

    • Deep Learning for Instance Retrieval: A Survey

      2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
    • The Effect of Variance-Based Patch Selection on No-Reference Image Quality Assessment

      2023, Proceedings of 2023 6th International Conference on Pattern Recognition and Image Analysis, IPRIA 2023
    View all citing articles on Scopus
    View full text