SURF binarization and fast codebook construction for image retrieval

https://doi.org/10.1016/j.jvcir.2017.08.006Get rights and content

Highlights

  • SURF binarization and dimensionality reduction are proposed to reduce the 64-dimensional SURF descriptors into 8-dimensional descriptors.

  • A two-step clustering algorithm is proposed for clustering large scale samples.

  • A scalable overlapping partition method is proposed to reduce the computational cost for object search.

Abstract

A new framework for image retrieval/object search is proposed based on the VLAD model and SURF descriptors to improve the codebook construction speed, the image matching accuracy, and the online retrieval speed and to reduce the data storage. First, SURF binarization and dimensionality reduction methods are proposed to convert a 64-dimensional SURF descriptor into an 8-dimensional descriptor. Second, a two-step clustering algorithm is proposed for codebook construction to significantly reduce the computational cost of clustering while maintaining the accuracy of the clustering results. Moreover, for object search, a scalable overlapping partition method is proposed to segment an image into 65 patches with different sizes so that the object can be matched quickly and efficiently. Finally, a feature fusion strategy is employed to compensate the performance degradation caused by the information loss of our proposed dimensionality reduction method. Experiments on the Holidays and Oxford datasets demonstrate the effectiveness and efficiency of the proposed algorithms.

Introduction

Despite tremendous improvement in whole-image retrieval techniques [1], [2], [3], [4], [5], [6], visual object search in large-scale image sets remains a challenging problem. Content-based image retrieval denotes searching for similar images in a large set of images. Similar images are defined as images of the same object or similar scenarios viewed under different imaging conditions [3], [4], [5]. Object search denote using a given query object [7] to search a large corpus of image sets and to determine the subset of images that include the query object and locate it [7], [8], [9]. In contrast to similar image retrieval, in object search, the query object usually occupies only a small portion of an image with object deformations, 3-D viewpoint and lighting changes, occlusions and cluttered background, which differ significantly from the query with respect to scale, viewpoint, color and orientation [8], [9]. These characteristics lead to difficulties in object searching.

The fundamental problem of similar image search and object search based on local descriptors is to determine the true matching features between images. Most state-of-the-art image retrieval/object searching systems, such as bag of visual words (BoVW) [1], vector of locally aggregated descriptors (VLAD) [2], Fisher vectors (FV) [10], etc., are based on the local SIFT [11] descriptors or local SURF [12] descriptors. Both SIFT and SURF descriptors are scale invariant, rotation invariant and illumination invariant [11], [12]. However, based on the VLAD representation approach and mean average precision [7] (mAP) evaluating indicators, Spyromitros-Xioufis et al. [5], [13] showed that the performance of SURF is better than that of SIFT. Furthermore, SURF can be computed several times faster than SIFT because SURF uses integral images for image convolution and a fast Hessian matrix-based measurement for the detector. Additionally, SURF is regarded as a better descriptor than SIFT with respect to distinctiveness, robustness and repeatability, and the dimension of the SURF descriptor is only half that of the SIFT descriptor, indicating a significant advantage for both memory usage and time cost.

Image representation by codebook is a critical and time-consuming task in image retrieval based on image local descriptors. In both the BoVW and VLAD models, an image is directly represented by the approximate nearest neighbor through a codebook. However, the number of local descriptors extracted from an image may be thousands or even tens of thousands, which leads to the problem of the very slow speed of image representation based on the local descriptors. This problem includes two aspects, the slow speed of codebook construction and the slow speed of local descriptor quantization. If the dimensions of the local descriptors can be reduced, the image representation will be accelerated. In addition, the VLAD model demonstrated its advantages in vector dimensionality reduction, storage and online retrieval speed. Thus, in this paper, we focus on accelerating the image representation and improving the accuracy of image retrieval based on the VLAD model.

The construction of the codebook is one of the key problems in image representation. Due to the high dimensions of local descriptors (128-dimensional SIFT [11] descriptors, 64-dimensional SURF [12] descriptors, etc.) and the large size of the descriptor set, the computation is usually intensive and time consuming when the commonly used methods, such as K-means, hierarchical K-Means (HKM) [14] and approximate K-means (AKM) [7], are applied to cluster a large codebook (usually 20 k or larger for BoVW). The time complexity and space complexity of the K-means, HKM and AKM clustering methods are related to the size of the dataset. To increase the clustering speed and decrease memory usage, we propose two solutions: (1) to feature binarization and dimensionality reduction; (2) to propose a two-step clustering algorithm so that clustering is mainly related to the number of clusters, which is rarely related to the number of samples in the dataset. These two improvements significantly improve the clustering speed and reduce the memory usage of codebook construction.

Most previous image search methods group features [15] of the whole image set as a basic unit. However, for object search, only a portion of the image features needs to be matched, and some objects need to be localized. A universal idea for object matching is bundling the co-occurring visual words within a specified spatial distance into a visual phrase [16], [17], [18]. Thus, many methods based on image patches, which divide an image into hundreds even thousands of patches, have been proposed [8], [9], [19], [20], [21]. Although these methods enhance the discriminative power of visual words, they are costly in terms of both memory and time. To overcome this defect, a scalable overlapping partition method is proposed in our object retrieval framework, which segments an image into only 65 patches with different sizes so that the object can be matched quickly and efficiently.

The main contributions of this paper include four aspects. (1) To accelerate the image representation process based on the SURF descriptors, SURF binarization and dimensionality reduction are proposed to reduce the 64-dimensional SURF descriptors into 8-dimensional descriptors. (2) A two-step clustering algorithm is proposed to accelerate the computational speed and decrease the memory usage for clustering large-scale and high-dimensional samples (e.g., the construction of a large vocabulary or codebook). (3) For the task of object search, scalable overlapping partition is proposed to segment an image into only 65 patches of different sizes to greatly reduce the computational cost of matching. (4) Since the information feature will be lost after SURF binarization and dimensionality reduction, a feature fusion strategy is utilized to improve the retrieval accuracy.

The remainder of this paper is organized as follows. In Section 2, related works of content-based image search systems are reviewed. In Section 3, the binary dimensionality reduction and two-step clustering algorithm are presented. Then, the scalable overlapping segmentation algorithm is proposed. The similarity measurement and object localization are also introduced. Finally, our image retrieval framework is described. The experimental results are presented in Section 4, and Section 5 concludes the paper.

Section snippets

Related works

Most of the state-of-the-art feature representation techniques are based on BoVW [1], VLAD [2] or FV [10]. For BoVW and VLAD, a codebook or visual vocabulary consisting of visual words is constructed off-line by unsupervised learning algorithms (e.g., K-means, HKM [14] or AKM [7]). Then, the local descriptors are extracted from each image, and each descriptor is quantized into the nearest visual word in the codebook. Finally, the histogram of the local descriptor number assigned to each visual

Proposed methods

Given a database D={Id}d=1N of N images, our goal includes two aspects: (1) Similar image retrieval, which retrieves all the images that are similar to the query image Iq; (2) Object search, which retrieves all the images Ig that contain the query object Q and identifies the object location Lg, where LgIg is a segmentation or sub-region of Ig.

Experimental results

Based on the VLAD and BoVW representation techniques, respectively, we verify our proposed algorithm for similar image search and object search. Our algorithms are implemented in MATLAB R2015a on a laptop with a 2.5G Intel i5 CPU, 8G memory and 64-bit Windows operating system. The searching results are evaluated according to mean recall, mean precision and mean average precision [7] (mAP). Two benchmark databases are used:

INRIA Holidays database [4] contains 1491 high-resolution personal

Conclusion

In this paper, we propose SURF binarization, fast codebook construction and efficient object matching methods. First, BDR is proposed to convert a 64-dimensional SURF descriptor into an 8-dimensional descriptor. The experimental results show that although the BDR algorithm leads to information loss, feature fusion can compensate for the loss, and our binarization method outperforms the mainstream hashing-based methods when the feature fusion strategy is used. Second, when the two-step

Acknowledgements

This work was supported by National Natural Science Foundation of China (61572067, 61602538); International (Regional) Project Cooperation and Exchanges of National Nature Science Foundation of China (61611530710); Beijing Municipal Natural Science Foundation (4162050); The Natural Science Foundation of Guangdong Province (2016A030313708) and the Fundamental Research Funds for the Central Universities (2017JBZ108).

References (54)

  • H. Bay et al.

    Speeded-up robust features (surf)

    Comput. Vis. Image Underst.

    (2008)
  • Y. Wang et al.

    Separable vocabulary and features fusion for image retrieval based on sparse representation

    Neurocomputing

    (2017)
  • Z. Liu et al.

    Fine-residual vlad for image retrieval

    Neurocomputing

    (2016)
  • J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: IEEE Int. Conf....
  • H. Jegou, M. Douze, C. Schmid, P. Perez, Aggregating local descriptors into a compact image representation, in: IEEE...
  • F. Perronnin, Y. Liu, J. Sanchez, H. Poirier, Large-scale image retrieval with compressed fisher vectors, in: 2013 IEEE...
  • H. Jegou, M. Douze, C. Schmid, Hamming embedding and weak geometric consistency for large scale image search, in: Proc....
  • E. Spyromitros-Xioufis et al.

    A comprehensive study over vlad and product quantization in large-scale image retrieval

    IEEE Trans. Multimedia

    (2014)
  • H. Jegou et al.

    Aggregating local image descriptors into compact codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial...
  • Y. Jiang et al.

    Randomized spatial context for object search

    IEEE Trans. Image Process.

    (2015)
  • Y. Jiang, J. Meng, J. Yuan, Randomized visual phrases for object search, in: IEEE Conf. Comput Vis. Pattern Recognit....
  • F. Perronnin, D. Dance, Fisher kernels on visual vocabularies for image categorization, in: IEEE Conference on Computer...
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vision

    (2004)
  • E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, I. Vlahavas, An empirical study on the...
  • D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: IEEE Conf. Comput Vis. Pattern Recognit in...
  • S. Zhang, Q. Huang, G. Hua, S.J. ang, W. Gao, Q. Tian, Building contextual visual vocabulary for large-scale image...
  • J. Yuan, Y. Wu, M. Yang, Discovery of collocation patterns: from visual words to visual phrases, in: IEEE Conf. Comput...
  • S. Zhang, Q. Tian, G. Hua, Q. Huang, S. Li, Descriptive visual words and visual phrases for image applications, in:...
  • J. Yuan, Y. Wu, M. Yang, Image retrieval with geometry-preserving visual phrases, in: IEEE Conf. Comput Vis. Pattern...
  • Y. Jiang, J. Meng, J. Yuan, Grid-based local feature bundling for efficient object search and localization, in: IEEE...
  • C.H. Lampert, Detecting objects in large image collections and videos by efficient subimage retrieval, in: IEEE 12th...
  • C.H. Lampert, M.B. Blaschko, T. Hofmann, Beyond sliding windows: object localization by efficient subwindow search, in:...
  • J. Farquhar et al.

    Improving Bag-of-keypoints Image Categorisation. Technical Report

    (2005)
  • T. Jaakkola et al.

    Exploiting generative models in discriminative classifiers

    NIPS

    (1998)
  • Xiaojiang Peng, Boosting vlad with supervised dictionary learning and high-order statistics, in: Springer International...
  • Z. Liang et al.

    Coupled binary embedding for large-scale image retrieval

    IEEE Trans. Image Process.

    (2014)
  • Cited by (27)

    • Block-based image matching for image retrieval

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Different image partition strategies will directly impact on the accuracy of image block matching. Kan et al. [19] proposed a multi-scale overlapping partition (SOP) algorithm, which divides an image into multi-scale blocks by a coarse-to-fine way. It can achieve a balance between accuracy and computational time.

    • Metric learning-based kernel transformer with triplets and label constraints for feature fusion

      2020, Pattern Recognition
      Citation Excerpt :

      For handcrafted features of 4-RootHSV and SURF features, 512-dimensional HSV feature and 64-dimensional SURF features are firstly extracted, respectively. Then l1 normalization and fourth root scaling are applied to the 512-dimensional HSV histogram to obtain a 4-RootHSV feature [23]. For the 64-dimensional SURF features, first, the Two-step clustering algorithm [23] is used to generate a codebook that contains 64 codewords, then the vector of locally aggregated descriptors (VLAD) [25] method is used to encode the local SURF descriptors into a 4096-dimensional feature.

    • Compressed sensing based feature fusion for image retrieval

      2023, Journal of Ambient Intelligence and Humanized Computing
    • Contrastive Bayesian Analysis for Deep Metric Learning

      2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text