Large-scale instance-level image retrieval
Introduction
Full-text search engines on the Web have achieved great results in terms of efficiency thanks to the use of inverted index technology (Arroyuelo, Oyarzún, González, Sepulveda, 2018, Lashkari, Bagheri, Ghorbani, 2019, Zobel, Moffat, 2006). In the last years, we experienced an increasing interest of the research community in the case of retrieval of other forms of expression, such as images (Novak, Batko, Zezula, 2012, Pandey, Khanna, Yokota, 2016); nevertheless, the development in those cases was not as rapid as text-based paradigms. In the field of image retrieval, this was initially due in part to the ineffectiveness of hand-crafted features used by instance-level and content-based image retrieval. However, since 2014 we have had a great development of new learned features obtained by training neural networks – in particular convolutional neural networks (CNN). Differently from text, in which inverted indexes perfectly marry the sparse document representation in standard vector models, learned image descriptors tend to be dense and compact, thus making directly unfeasible the usage of mature text-tailored index technologies. While efficient index structures for this type of data exist (Johnson, Douze, Jégou, 2017, Liu, Wei, Zhao, Yang, 2018, Mohedano, McGuinness, O’Connor, Salvador, Marques, Giro-i Nieto, 2016), they usually come with caveats that prevent their usage in very large-scale scenarios, such as main-memory-only implementations and computationally expensive indexing or codebook-learning phases.
The aim of this article is to explore new approaches to make image retrieval as similar as possible to text retrieval so as to reuse the technologies and platforms exploited today for text retrieval without the need for dedicated access methods. In a nutshell, the idea is to use image representations extracted from a CNN, often referred to as Deep Features, and to transform them into text so that they can be indexed with a standard text search engine.
The application focus of this work is on a scenario of image retrieval in a large-scale context with an eye to scalability. This aspect is often overlooked by the literature, most of the image retrieval systems are designed to work in main memory and many of these cannot be distributed across a cluster of nodes (Navarro & Reyes, 2016). Many techniques present in literature try to tackle this problem by heavily compressing the representation of visual features to adapt more and more data to the secondary memory. However, these approaches to indexing are not able to scale because sooner or later response times become unacceptable as the size of the data to be managed increases.
In particular, our general approach is based on the transformation of deep features, which are (dense) vectors of real numbers, into sparse vectors of integer numbers. The transformation in integers is necessary to deal with textual representations of the vectors, as it will be explained better below, in which the components of the vectors are in fact translated by “term frequency” of these textual representations. Sparseness is necessary to achieve sufficient levels of efficiency exactly as it does for search engines for text documents. To obtain this two-fold result, we will analyze two approaches: one based on permutations and one based on scalar quantization.
The present paper is the evolution of previous works (Amato, Bolettieri, Carrara, Falchi, Gennaro, 2018, Amato, Carrara, Falchi, Gennaro, 2017, Amato, Falchi, Gennaro, Vadicamo, 2016, Amato, Gennaro, Savino, 2014, Gennaro, Amato, Bolettieri, Savino, 2010). In Amato, Gennaro et al. (2014) the idea of representing metric objects as permutations of reference objects to construct an inverted index that allows us to perform approximate nearest neighbor queries was presented. In Gennaro et al. (2010), this method was extended by transforming permutations into surrogate text representations and allowing us to take advantage of a standard text search engine without having to implement the inverted index. In Amato, Falchi et al. (2016), the authors introduced the idea of Deep Permutations that applies to the deep feature vectors and in which the components of the vectors themselves are permuted. In Amato et al. (2017) and Amato et al. (2018) an extension of the technique of Deep Permutations is presented, in the former using the surrogate text representation and R-MAC, and in the latter taking into account the negative components of R-MAC. In Amato et al. (2018), we have also proved that this general approach can be implemented on top of Elasticsearch by showing how such a retrieval system is able to scale to multiple nodes. In the earlier attempt (Amato, Debole, Falchi, Gennaro, & Rabitti, 2016), we have presented a preliminary draft of quantization approach on deep features extracted from the Hybrid CNN,1 which is less effective but has the advantage of being partly spare.
The original contribution of the present work consists in introducing a new approach of surrogate representation for deep features based on Scalar Quantization. We present this approach in a unified framework of representation of deep features using surrogate text together with the technique based on Deep Permutations, and we compare it with the scalar quantization technique. We have also extended the experimental evaluation by adding two more benchmarks and, regarding efficiency, we also considered the size of the indexes as well as their percentage of use.
The rest of the paper is organized as follows: Section 2 surveys the relevant related work. Section 3 provides a brief background about the Deep Features. In Section 4, the main contribution of this paper, namely the Surrogate Text Representation is presented. Section 5 shows experimental results, and finally Section 6 gives concluding remarks. Table 1 summarizes the notation used throughout this manuscript.
Section snippets
Related work
To frame our work in the context of scientific literature, we refer to the survey of Zheng, Yang, and Tian (2018), which organizes the literature according to the codebook size, i.e. large/medium-sized/small codebooks. Although this organization, according to the authors, is relevant to local features (which were defined as “sift-based” by the authors), we think it can be extended to deep features and representation-focused neural models in general (Nakamura, Calais, de Castro Reis, & Lemos,
Deep features
Recently, a new class of image descriptor built upon Convolutional Neural Networks have been used as an effective alternative to descriptors built using local features such as SIFT, ORB, BRIEF, etc. CNNs have attracted an enormous interest within the Computer Vision community because of the state-of-the-art results achieved in challenging image classification tasks such as the ImageNet Large Scale Visual Recognition Challenge (http://www.image-net.org). In computer vision, CNNs have been used
Surrogate text representation
As we explained in the introduction, we aim to index and search a data set of feature vectors by exploiting off-the-shelf text search engines. So our main goal is to define a family of transformations that map a feature vector into a textual representation without the need for tedious training procedures. Of course, we also require that such transformations preserve as much as possible the proximity relations between the data, i.e. similar feature vectors are mapped to similar textual documents.
Experimental evaluation
The aim of this section is to assess the performance of the proposed solution in a content-based retrieval task both in terms of effectiveness and efficiency. To this end, we are able to evaluate the approximation introduced with respect to the exact similarity search algorithm against the impact of this approximation with respect to the user perception of the retrieval task. We extracted the R-MAC features from the images of two different benchmarks: INRIA Holidays and Oxford Buildings. INRIA
Conclusions and future works
This paper has proposed a simple and effective methodology to index and retrieve convolutional features without the need for a time-consuming codebook learning step. To get an idea, consider that FAISS takes about three hours for learning the codebook from about a million R-MAC features with the configuration used in the experiments. However, our approach clearly has lower performance as FAISS is able to generate very compact and specialized codes for the set of data to be indexed. This is also
Acknowledgements
The work was partially supported by Smart News, “Social sensing for breaking news”, CUP CIPE D58C15000270008, by VISECH, ARCO-CNR, CUP B56J17001330004, and by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020 - Contract n. 825619). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
References (46)
- et al.
Hybrid compression of inverted lists for reordered document collections
Information Processing & Management
(2018) - et al.
Dynamic two-stage image retrieval from large multimedia databases
Information Processing & Management
(2013) - et al.
Neural embedding-based indices for semantic search
Information Processing & Management
(2019) - et al.
An anatomy for neural search engines
Information Sciences
(2019) - et al.
New dynamic metric indices for secondary memory
Information Systems
(2016) - et al.
Large-scale similarity data management with distributed metric index
Information Processing & Management
(2012) - et al.
A semantics and image retrieval system for hierarchical image databases
Information Processing & Management
(2016) - et al.
A signature-based bag of visual words method for image indexing and search
Pattern Recognition Letters
(2015) - et al.
Visione at vbs2019
International conference on multimedia modeling
(2019) - et al.
Large-scale image retrieval with elasticsearch
Proceeding of the 41st international acm sigir conference on research & development in information retrieval
(2018)
Using apache lucene to search vector of locally aggregated descriptors
Proceedings of the 11th joint conference on computer vision, imaging and computer graphics theory and applications-volume 4: VISAPP
Efficient indexing of regional maximum activations of convolutions using full-text search engines
Proceedings of the ACM international conference on multimedia retrieval
Large scale indexing and searching deep convolutional neural network features
Proceedings of the international conference on big data analytics and knowledge discovery
Deep permutations: Deep convolutional neural networks and permutation-based indexing
Proceedings of the 9th international conference on similarity search and applications
Some theoretical and experimental observations on permutation spaces and similarity search
MI-File: Using inverted files for scalable approximate similarity search
Multimedia Tools and Applications
NetVLAD: CNN architecture for weakly supervised place recognition
Proceedings of the ieee conference on computer vision and pattern recognition
Neural codes for image retrieval
Proceedings of 13th European conference on computer vision
Effective proximity retrieval by ordering permutations
IEEE Transactions on Pattern Analysis and Machine Intelligence
Group representations in probability and statistics
Lecture Notes-Monograph Series
DeCAF: A deep convolutional activation feature for generic visual recognition
CoRR
MiPai: Using the PP-Index to build an efficient and scalable similarity search system
Proceedings of the 2009 second international workshop on similarity search and applications
An approach to content-based image retrieval based on the lucene search engine library
Proceedings of the international conference on theory and practice of digital libraries
Cited by (31)
Image retrieval using compact deep semantic correlation descriptors
2024, Information Processing and ManagementInduced permutations for approximate metric search
2023, Information SystemsTurning backdoors for efficient privacy protection against image retrieval violations
2023, Information Processing and ManagementAdaptive multi-feature fusion via cross-entropy normalization for effective image retrieval
2023, Information Processing and ManagementCitation Excerpt :Researchers also employ global descriptors for image retrieval, such as HSV, GIST, and other visual attribute features. Moreover, with the development of Artificial Intelligence, CNN-based high-level semantic feature image retrieval methods are considered as the common practice in image retrieval, which almost completely replaces the traditional low-level image descriptors for image retrieval (Amato, Carrara, Falchi, Gennaro, & Vadicamo, 2020; Ge, Wei, Yu, Singh, & Xiong, 2021; Pandey, Khanna, & Yokota, 2016; Zheng et al., 2018). High-level semantic features are used to replace traditional low-level image descriptors for image retrieval and are divided into two categories: (1) Convolutional layer features and (2) Full connection layer features.
A privacy-preserving content-based image retrieval method based on deep learning in cloud computing
2022, Expert Systems with ApplicationsCitation Excerpt :The kernel difference is that feature extraction is performed before or after the encrypted images are uploaded and we detailedly discuss these schemes in the following. To retrieve similar images quickly from a large number of images, some promising CBIR techniques have been proposed (Amato et al., 2020; Gkelios et al., 2021; Zheng et al., 2018). However, the images always contain rich sensitive information and directly uploading unencrypted images to the cloud is unsafe.
Webpage retrieval based on query by example for think tank construction
2022, Information Processing and ManagementCitation Excerpt :Different from traditional retrieval methods called query by keyword (QBK), QBE takes a given example as a query and returns similar candidates from a target collection. Since manual filtering and high-quality queries are no longer needed, QBE can largely lower the demands on users’ information literacy and has become an ideal substitute of QBK in various real-world applications, including image retrieval (Amato, Carrara, Falchi, Gennaro, & Vadicamo, 2020; Li, Yang, & Ma, 2021), talent searching (Ha-Thuc et al., 2017), cross-platform product matching (Li, Dou, Zhu, Zuo, & Wen, 2019; Li, Xu, Luo, & Lin, 2014), document retrieval (Lopez-Otero, Parapar, & Barreiro, 2019; Weng et al., 2011) as well as cross-lingual tasks (Roostaee, Sadreddini, & Fakhrahmad, 2020; Sarwar & Allan, 2020), etc. A series of studies focused on webpage similarity were conducted, trying to apply QBE in webpage retrieval, denoted as query by webpage (QBW), among which hyperlinks between webpages (Jeh & Widom, 2002; Lin, Lyu, & King, 2006, 2009) and HTML documents (Bohunsky & Gatterbauer, 2010; Bozkır & Sezer, 2014; Bozkir & Sezer, 2018; Gowda & Mattmann, 2016; Li, Yang, Chen, Yuan, & Liu, 2019; Takama & Mitsuhashi, 2005) were considered.