SURF binarization and fast codebook construction for image retrieval☆
Introduction
Despite tremendous improvement in whole-image retrieval techniques [1], [2], [3], [4], [5], [6], visual object search in large-scale image sets remains a challenging problem. Content-based image retrieval denotes searching for similar images in a large set of images. Similar images are defined as images of the same object or similar scenarios viewed under different imaging conditions [3], [4], [5]. Object search denote using a given query object [7] to search a large corpus of image sets and to determine the subset of images that include the query object and locate it [7], [8], [9]. In contrast to similar image retrieval, in object search, the query object usually occupies only a small portion of an image with object deformations, 3-D viewpoint and lighting changes, occlusions and cluttered background, which differ significantly from the query with respect to scale, viewpoint, color and orientation [8], [9]. These characteristics lead to difficulties in object searching.
The fundamental problem of similar image search and object search based on local descriptors is to determine the true matching features between images. Most state-of-the-art image retrieval/object searching systems, such as bag of visual words (BoVW) [1], vector of locally aggregated descriptors (VLAD) [2], Fisher vectors (FV) [10], etc., are based on the local SIFT [11] descriptors or local SURF [12] descriptors. Both SIFT and SURF descriptors are scale invariant, rotation invariant and illumination invariant [11], [12]. However, based on the VLAD representation approach and mean average precision [7] (mAP) evaluating indicators, Spyromitros-Xioufis et al. [5], [13] showed that the performance of SURF is better than that of SIFT. Furthermore, SURF can be computed several times faster than SIFT because SURF uses integral images for image convolution and a fast Hessian matrix-based measurement for the detector. Additionally, SURF is regarded as a better descriptor than SIFT with respect to distinctiveness, robustness and repeatability, and the dimension of the SURF descriptor is only half that of the SIFT descriptor, indicating a significant advantage for both memory usage and time cost.
Image representation by codebook is a critical and time-consuming task in image retrieval based on image local descriptors. In both the BoVW and VLAD models, an image is directly represented by the approximate nearest neighbor through a codebook. However, the number of local descriptors extracted from an image may be thousands or even tens of thousands, which leads to the problem of the very slow speed of image representation based on the local descriptors. This problem includes two aspects, the slow speed of codebook construction and the slow speed of local descriptor quantization. If the dimensions of the local descriptors can be reduced, the image representation will be accelerated. In addition, the VLAD model demonstrated its advantages in vector dimensionality reduction, storage and online retrieval speed. Thus, in this paper, we focus on accelerating the image representation and improving the accuracy of image retrieval based on the VLAD model.
The construction of the codebook is one of the key problems in image representation. Due to the high dimensions of local descriptors (128-dimensional SIFT [11] descriptors, 64-dimensional SURF [12] descriptors, etc.) and the large size of the descriptor set, the computation is usually intensive and time consuming when the commonly used methods, such as K-means, hierarchical K-Means (HKM) [14] and approximate K-means (AKM) [7], are applied to cluster a large codebook (usually 20 k or larger for BoVW). The time complexity and space complexity of the K-means, HKM and AKM clustering methods are related to the size of the dataset. To increase the clustering speed and decrease memory usage, we propose two solutions: (1) to feature binarization and dimensionality reduction; (2) to propose a two-step clustering algorithm so that clustering is mainly related to the number of clusters, which is rarely related to the number of samples in the dataset. These two improvements significantly improve the clustering speed and reduce the memory usage of codebook construction.
Most previous image search methods group features [15] of the whole image set as a basic unit. However, for object search, only a portion of the image features needs to be matched, and some objects need to be localized. A universal idea for object matching is bundling the co-occurring visual words within a specified spatial distance into a visual phrase [16], [17], [18]. Thus, many methods based on image patches, which divide an image into hundreds even thousands of patches, have been proposed [8], [9], [19], [20], [21]. Although these methods enhance the discriminative power of visual words, they are costly in terms of both memory and time. To overcome this defect, a scalable overlapping partition method is proposed in our object retrieval framework, which segments an image into only 65 patches with different sizes so that the object can be matched quickly and efficiently.
The main contributions of this paper include four aspects. (1) To accelerate the image representation process based on the SURF descriptors, SURF binarization and dimensionality reduction are proposed to reduce the 64-dimensional SURF descriptors into 8-dimensional descriptors. (2) A two-step clustering algorithm is proposed to accelerate the computational speed and decrease the memory usage for clustering large-scale and high-dimensional samples (e.g., the construction of a large vocabulary or codebook). (3) For the task of object search, scalable overlapping partition is proposed to segment an image into only 65 patches of different sizes to greatly reduce the computational cost of matching. (4) Since the information feature will be lost after SURF binarization and dimensionality reduction, a feature fusion strategy is utilized to improve the retrieval accuracy.
The remainder of this paper is organized as follows. In Section 2, related works of content-based image search systems are reviewed. In Section 3, the binary dimensionality reduction and two-step clustering algorithm are presented. Then, the scalable overlapping segmentation algorithm is proposed. The similarity measurement and object localization are also introduced. Finally, our image retrieval framework is described. The experimental results are presented in Section 4, and Section 5 concludes the paper.
Section snippets
Related works
Most of the state-of-the-art feature representation techniques are based on BoVW [1], VLAD [2] or FV [10]. For BoVW and VLAD, a codebook or visual vocabulary consisting of visual words is constructed off-line by unsupervised learning algorithms (e.g., K-means, HKM [14] or AKM [7]). Then, the local descriptors are extracted from each image, and each descriptor is quantized into the nearest visual word in the codebook. Finally, the histogram of the local descriptor number assigned to each visual
Proposed methods
Given a database of N images, our goal includes two aspects: (1) Similar image retrieval, which retrieves all the images that are similar to the query image ; (2) Object search, which retrieves all the images that contain the query object Q and identifies the object location , where is a segmentation or sub-region of .
Experimental results
Based on the VLAD and BoVW representation techniques, respectively, we verify our proposed algorithm for similar image search and object search. Our algorithms are implemented in MATLAB R2015a on a laptop with a 2.5G Intel i5 CPU, 8G memory and 64-bit Windows operating system. The searching results are evaluated according to mean recall, mean precision and mean average precision [7] (mAP). Two benchmark databases are used:
INRIA Holidays database [4] contains 1491 high-resolution personal
Conclusion
In this paper, we propose SURF binarization, fast codebook construction and efficient object matching methods. First, BDR is proposed to convert a 64-dimensional SURF descriptor into an 8-dimensional descriptor. The experimental results show that although the BDR algorithm leads to information loss, feature fusion can compensate for the loss, and our binarization method outperforms the mainstream hashing-based methods when the feature fusion strategy is used. Second, when the two-step
Acknowledgements
This work was supported by National Natural Science Foundation of China (61572067, 61602538); International (Regional) Project Cooperation and Exchanges of National Nature Science Foundation of China (61611530710); Beijing Municipal Natural Science Foundation (4162050); The Natural Science Foundation of Guangdong Province (2016A030313708) and the Fundamental Research Funds for the Central Universities (2017JBZ108).
References (54)
- et al.
Speeded-up robust features (surf)
Comput. Vis. Image Underst.
(2008) - et al.
Separable vocabulary and features fusion for image retrieval based on sparse representation
Neurocomputing
(2017) - et al.
Fine-residual vlad for image retrieval
Neurocomputing
(2016) - J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: IEEE Int. Conf....
- H. Jegou, M. Douze, C. Schmid, P. Perez, Aggregating local descriptors into a compact image representation, in: IEEE...
- F. Perronnin, Y. Liu, J. Sanchez, H. Poirier, Large-scale image retrieval with compressed fisher vectors, in: 2013 IEEE...
- H. Jegou, M. Douze, C. Schmid, Hamming embedding and weak geometric consistency for large scale image search, in: Proc....
- et al.
A comprehensive study over vlad and product quantization in large-scale image retrieval
IEEE Trans. Multimedia
(2014) - et al.
Aggregating local image descriptors into compact codes
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial...
Randomized spatial context for object search
IEEE Trans. Image Process.
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vision
Improving Bag-of-keypoints Image Categorisation. Technical Report
Exploiting generative models in discriminative classifiers
NIPS
Coupled binary embedding for large-scale image retrieval
IEEE Trans. Image Process.
Cited by (27)
Block-based image matching for image retrieval
2021, Journal of Visual Communication and Image RepresentationCitation Excerpt :Different image partition strategies will directly impact on the accuracy of image block matching. Kan et al. [19] proposed a multi-scale overlapping partition (SOP) algorithm, which divides an image into multi-scale blocks by a coarse-to-fine way. It can achieve a balance between accuracy and computational time.
Metric learning-based kernel transformer with triplets and label constraints for feature fusion
2020, Pattern RecognitionCitation Excerpt :For handcrafted features of 4-RootHSV and SURF features, 512-dimensional HSV feature and 64-dimensional SURF features are firstly extracted, respectively. Then l1 normalization and fourth root scaling are applied to the 512-dimensional HSV histogram to obtain a 4-RootHSV feature [23]. For the 64-dimensional SURF features, first, the Two-step clustering algorithm [23] is used to generate a codebook that contains 64 codewords, then the vector of locally aggregated descriptors (VLAD) [25] method is used to encode the local SURF descriptors into a 4096-dimensional feature.
A supervised learning to index model for approximate nearest neighbor image retrieval
2019, Signal Processing: Image CommunicationCompressed sensing based feature fusion for image retrieval
2023, Journal of Ambient Intelligence and Humanized ComputingTarget Search for Joint Local and High-Level Semantic Information Based on Image Preprocessing Enhancement in Indoor Low-Light Environments
2023, ISPRS International Journal of Geo-InformationContrastive Bayesian Analysis for Deep Metric Learning
2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
- ☆
This paper has been recommended for acceptance by Zicheng Liu.