Large-scale instance-level image retrieval

https://doi.org/10.1016/j.ipm.2019.102100Get rights and content

Highlights

  • We propose a novel approach to tackle large-scale image retrieval on deep-learned image descriptors by transforming the vectorial descriptors into surrogate text encodings.

  • This transformation is based on a scalar quantization approach that is specifically designed to generate text suitable for scalable indexing in secondary memory.

  • Our approach allows us to conveniently reuse mature and scalable full-text search engine technology (e.g. Elasticsearch, Apache Lucene) for retrieving images on a large scale without the need for dedicated structures.

  • We compare our proposal to other works in a unified framework for representing surrogate text representation transformations.

  • We performed an extensive experimental evaluation to assess the effectiveness and efficiency of our proposal, and compare it to other state-of-the-art vector-tailored indexing approaches.

Abstract

The great success of visual features learned from deep neural networks has led to a significant effort to develop efficient and scalable technologies for image retrieval. Nevertheless, its usage in large-scale Web applications of content-based retrieval is still challenged by their high dimensionality. To overcome this issue, some image retrieval systems employ the product quantization method to learn a large-scale visual dictionary from a training set of global neural network features. These approaches are implemented in main memory, preventing their usage in big-data applications. The contribution of the work is mainly devoted to investigating some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea of our approaches relies on a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. Its effectiveness has been proved through several instance-level retrieval benchmarks. An extensive experimental evaluation conducted on the standard benchmarks shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.

Introduction

Full-text search engines on the Web have achieved great results in terms of efficiency thanks to the use of inverted index technology (Arroyuelo, Oyarzún, González, Sepulveda, 2018, Lashkari, Bagheri, Ghorbani, 2019, Zobel, Moffat, 2006). In the last years, we experienced an increasing interest of the research community in the case of retrieval of other forms of expression, such as images (Novak, Batko, Zezula, 2012, Pandey, Khanna, Yokota, 2016); nevertheless, the development in those cases was not as rapid as text-based paradigms. In the field of image retrieval, this was initially due in part to the ineffectiveness of hand-crafted features used by instance-level and content-based image retrieval. However, since 2014 we have had a great development of new learned features obtained by training neural networks – in particular convolutional neural networks (CNN). Differently from text, in which inverted indexes perfectly marry the sparse document representation in standard vector models, learned image descriptors tend to be dense and compact, thus making directly unfeasible the usage of mature text-tailored index technologies. While efficient index structures for this type of data exist (Johnson, Douze, Jégou, 2017, Liu, Wei, Zhao, Yang, 2018, Mohedano, McGuinness, O’Connor, Salvador, Marques, Giro-i Nieto, 2016), they usually come with caveats that prevent their usage in very large-scale scenarios, such as main-memory-only implementations and computationally expensive indexing or codebook-learning phases.

The aim of this article is to explore new approaches to make image retrieval as similar as possible to text retrieval so as to reuse the technologies and platforms exploited today for text retrieval without the need for dedicated access methods. In a nutshell, the idea is to use image representations extracted from a CNN, often referred to as Deep Features, and to transform them into text so that they can be indexed with a standard text search engine.

The application focus of this work is on a scenario of image retrieval in a large-scale context with an eye to scalability. This aspect is often overlooked by the literature, most of the image retrieval systems are designed to work in main memory and many of these cannot be distributed across a cluster of nodes (Navarro & Reyes, 2016). Many techniques present in literature try to tackle this problem by heavily compressing the representation of visual features to adapt more and more data to the secondary memory. However, these approaches to indexing are not able to scale because sooner or later response times become unacceptable as the size of the data to be managed increases.

In particular, our general approach is based on the transformation of deep features, which are (dense) vectors of real numbers, into sparse vectors of integer numbers. The transformation in integers is necessary to deal with textual representations of the vectors, as it will be explained better below, in which the components of the vectors are in fact translated by “term frequency” of these textual representations. Sparseness is necessary to achieve sufficient levels of efficiency exactly as it does for search engines for text documents. To obtain this two-fold result, we will analyze two approaches: one based on permutations and one based on scalar quantization.

The present paper is the evolution of previous works (Amato, Bolettieri, Carrara, Falchi, Gennaro, 2018, Amato, Carrara, Falchi, Gennaro, 2017, Amato, Falchi, Gennaro, Vadicamo, 2016, Amato, Gennaro, Savino, 2014, Gennaro, Amato, Bolettieri, Savino, 2010). In Amato, Gennaro et al. (2014) the idea of representing metric objects as permutations of reference objects to construct an inverted index that allows us to perform approximate nearest neighbor queries was presented. In Gennaro et al. (2010), this method was extended by transforming permutations into surrogate text representations and allowing us to take advantage of a standard text search engine without having to implement the inverted index. In Amato, Falchi et al. (2016), the authors introduced the idea of Deep Permutations that applies to the deep feature vectors and in which the components of the vectors themselves are permuted. In Amato et al. (2017) and Amato et al. (2018) an extension of the technique of Deep Permutations is presented, in the former using the surrogate text representation and R-MAC, and in the latter taking into account the negative components of R-MAC. In Amato et al. (2018), we have also proved that this general approach can be implemented on top of Elasticsearch by showing how such a retrieval system is able to scale to multiple nodes. In the earlier attempt (Amato, Debole, Falchi, Gennaro, & Rabitti, 2016), we have presented a preliminary draft of quantization approach on deep features extracted from the Hybrid CNN,1 which is less effective but has the advantage of being partly spare.

The original contribution of the present work consists in introducing a new approach of surrogate representation for deep features based on Scalar Quantization. We present this approach in a unified framework of representation of deep features using surrogate text together with the technique based on Deep Permutations, and we compare it with the scalar quantization technique. We have also extended the experimental evaluation by adding two more benchmarks and, regarding efficiency, we also considered the size of the indexes as well as their percentage of use.

The rest of the paper is organized as follows: Section 2 surveys the relevant related work. Section 3 provides a brief background about the Deep Features. In Section 4, the main contribution of this paper, namely the Surrogate Text Representation is presented. Section 5 shows experimental results, and finally Section 6 gives concluding remarks. Table 1 summarizes the notation used throughout this manuscript.

Section snippets

Related work

To frame our work in the context of scientific literature, we refer to the survey of Zheng, Yang, and Tian (2018), which organizes the literature according to the codebook size, i.e. large/medium-sized/small codebooks. Although this organization, according to the authors, is relevant to local features (which were defined as “sift-based” by the authors), we think it can be extended to deep features and representation-focused neural models in general (Nakamura, Calais, de Castro Reis, & Lemos,

Deep features

Recently, a new class of image descriptor built upon Convolutional Neural Networks have been used as an effective alternative to descriptors built using local features such as SIFT, ORB, BRIEF, etc. CNNs have attracted an enormous interest within the Computer Vision community because of the state-of-the-art results achieved in challenging image classification tasks such as the ImageNet Large Scale Visual Recognition Challenge (http://www.image-net.org). In computer vision, CNNs have been used

Surrogate text representation

As we explained in the introduction, we aim to index and search a data set of feature vectors by exploiting off-the-shelf text search engines. So our main goal is to define a family of transformations that map a feature vector into a textual representation without the need for tedious training procedures. Of course, we also require that such transformations preserve as much as possible the proximity relations between the data, i.e. similar feature vectors are mapped to similar textual documents.

Experimental evaluation

The aim of this section is to assess the performance of the proposed solution in a content-based retrieval task both in terms of effectiveness and efficiency. To this end, we are able to evaluate the approximation introduced with respect to the exact similarity search algorithm against the impact of this approximation with respect to the user perception of the retrieval task. We extracted the R-MAC features from the images of two different benchmarks: INRIA Holidays and Oxford Buildings. INRIA

Conclusions and future works

This paper has proposed a simple and effective methodology to index and retrieve convolutional features without the need for a time-consuming codebook learning step. To get an idea, consider that FAISS takes about three hours for learning the codebook from about a million R-MAC features with the configuration used in the experiments. However, our approach clearly has lower performance as FAISS is able to generate very compact and specialized codes for the set of data to be indexed. This is also

Acknowledgements

The work was partially supported by Smart News, “Social sensing for breaking news”, CUP CIPE D58C15000270008, by VISECH, ARCO-CNR, CUP B56J17001330004, and by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020 - Contract n. 825619). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

References (46)

  • G. Amato et al.

    Using apache lucene to search vector of locally aggregated descriptors

    Proceedings of the 11th joint conference on computer vision, imaging and computer graphics theory and applications-volume 4: VISAPP

    (2016)
  • G. Amato et al.

    Efficient indexing of regional maximum activations of convolutions using full-text search engines

    Proceedings of the ACM international conference on multimedia retrieval

    (2017)
  • G. Amato et al.

    Large scale indexing and searching deep convolutional neural network features

    Proceedings of the international conference on big data analytics and knowledge discovery

    (2016)
  • G. Amato et al.

    Deep permutations: Deep convolutional neural networks and permutation-based indexing

    Proceedings of the 9th international conference on similarity search and applications

    (2016)
  • G. Amato et al.

    Some theoretical and experimental observations on permutation spaces and similarity search

    (2014)
  • G. Amato et al.

    MI-File: Using inverted files for scalable approximate similarity search

    Multimedia Tools and Applications

    (2014)
  • R. Arandjelović et al.

    NetVLAD: CNN architecture for weakly supervised place recognition

    Proceedings of the ieee conference on computer vision and pattern recognition

    (2016)
  • A. Babenko et al.

    Neural codes for image retrieval

    Proceedings of 13th European conference on computer vision

    (2014)
  • E. Chavez et al.

    Effective proximity retrieval by ordering permutations

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2008)
  • P. Diaconis

    Group representations in probability and statistics

    Lecture Notes-Monograph Series

    (1988)
  • J. Donahue et al.

    DeCAF: A deep convolutional activation feature for generic visual recognition

    CoRR

    (2013)
  • A. Esuli

    MiPai: Using the PP-Index to build an efficient and scalable similarity search system

    Proceedings of the 2009 second international workshop on similarity search and applications

    (2009)
  • C. Gennaro et al.

    An approach to content-based image retrieval based on the lucene search engine library

    Proceedings of the international conference on theory and practice of digital libraries

    (2010)
  • Cited by (31)

    • Adaptive multi-feature fusion via cross-entropy normalization for effective image retrieval

      2023, Information Processing and Management
      Citation Excerpt :

      Researchers also employ global descriptors for image retrieval, such as HSV, GIST, and other visual attribute features. Moreover, with the development of Artificial Intelligence, CNN-based high-level semantic feature image retrieval methods are considered as the common practice in image retrieval, which almost completely replaces the traditional low-level image descriptors for image retrieval (Amato, Carrara, Falchi, Gennaro, & Vadicamo, 2020; Ge, Wei, Yu, Singh, & Xiong, 2021; Pandey, Khanna, & Yokota, 2016; Zheng et al., 2018). High-level semantic features are used to replace traditional low-level image descriptors for image retrieval and are divided into two categories: (1) Convolutional layer features and (2) Full connection layer features.

    • A privacy-preserving content-based image retrieval method based on deep learning in cloud computing

      2022, Expert Systems with Applications
      Citation Excerpt :

      The kernel difference is that feature extraction is performed before or after the encrypted images are uploaded and we detailedly discuss these schemes in the following. To retrieve similar images quickly from a large number of images, some promising CBIR techniques have been proposed (Amato et al., 2020; Gkelios et al., 2021; Zheng et al., 2018). However, the images always contain rich sensitive information and directly uploading unencrypted images to the cloud is unsafe.

    • Webpage retrieval based on query by example for think tank construction

      2022, Information Processing and Management
      Citation Excerpt :

      Different from traditional retrieval methods called query by keyword (QBK), QBE takes a given example as a query and returns similar candidates from a target collection. Since manual filtering and high-quality queries are no longer needed, QBE can largely lower the demands on users’ information literacy and has become an ideal substitute of QBK in various real-world applications, including image retrieval (Amato, Carrara, Falchi, Gennaro, & Vadicamo, 2020; Li, Yang, & Ma, 2021), talent searching (Ha-Thuc et al., 2017), cross-platform product matching (Li, Dou, Zhu, Zuo, & Wen, 2019; Li, Xu, Luo, & Lin, 2014), document retrieval (Lopez-Otero, Parapar, & Barreiro, 2019; Weng et al., 2011) as well as cross-lingual tasks (Roostaee, Sadreddini, & Fakhrahmad, 2020; Sarwar & Allan, 2020), etc. A series of studies focused on webpage similarity were conducted, trying to apply QBE in webpage retrieval, denoted as query by webpage (QBW), among which hyperlinks between webpages (Jeh & Widom, 2002; Lin, Lyu, & King, 2006, 2009) and HTML documents (Bohunsky & Gatterbauer, 2010; Bozkır & Sezer, 2014; Bozkir & Sezer, 2018; Gowda & Mattmann, 2016; Li, Yang, Chen, Yuan, & Liu, 2019; Takama & Mitsuhashi, 2005) were considered.

    View all citing articles on Scopus
    View full text