MI-File: using inverted files for scalable approximate similarity search

Amato, Giuseppe; Gennaro, Claudio; Savino, Pasquale

doi:10.1007/s11042-012-1271-1

MI-File: using inverted files for scalable approximate similarity search

Published: 06 November 2012

Volume 71, pages 1333–1362, (2014)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Giuseppe Amato¹,
Claudio Gennaro¹ &
Pasquale Savino¹

802 Accesses
52 Citations
Explore all metrics

Abstract

We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects. We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

This can be obtained by maintaining the entries of the posting list ordered according to the position and by using a small index for each posting list that gives the offset corresponding to the possible positions (see Section 5.3).

References

Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: InfoScale ’08: proceedings of the 3rd international conference on scalable information systems, ICST, pp 1–10
Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Syst 21(2):192–227
Article Google Scholar
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communication of the ACM 51(1):117–122. doi:10.1145/1327452.1327494
Article Google Scholar
Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In: WWW (International World Wide Web Conference), ACM Press, pp 651–660. doi:10.1145/1060745.1060840
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT ’99. Proceedings of the 7th international conference, Jerusalem, Israel, 10–12 Jan 1999. Ser. Lecture notes in computer science, vol 1540. Springer, pp 217–235
Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Rabitti F (2009) Enabling content-based image retrieval in very large digital libraries. In: Second workshop on very large digital libraries, pp 43–50
Bozkaya T, Özsoyoglu, ZM (1997) Distance-based indexing for high-dimensional metric spaces. In: SIGMOD conference, pp 357–368
Brin S (1995) Near neighbor search in large metric spaces. In: VLDB, pp 574–584
Ciaccia P, Patella M (2000) Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp 244–255
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) VLDB’97. Proceedings of 23rd international conference on very large data bases, Athens, Greece, 25–29 Aug 1997. Morgan Kaufmann, pp 426–435
Diaconis P (1988) Group representations in probability and statistics. In: Ser. IMS Lecture notes—monograph series, vol 11. Institute of Mathematical Statistics, Hawyard, CA
Google Scholar
Egecioglu Ö, Ferhatosmanoglu H (2000) Dimensionality reduction and similarity computation by inner product approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2000), McLean, Virginia, USA, 6–11 Nov 2000. ACM Press, pp 219–226
Esuli A (2012) Use of permutation prefixes for efficient and scalable approximate similarity search. Inf Process Manag 48(5):889–902
Article Google Scholar
Faloutsos C, Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Carey MJ, Schneider DA (eds) Proceedings of the 18th ACM international conference on management of data (SIGMOD 1995), San Jose, California, USA, 22–25 May 1995. ACM Press, pp 163–174
Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE (2001) Approximate nearest neighbor searching in multimedia databases. In: Proceedings of the 17th international conference on data engineering, Heidelberg, Germany, 2–6 April 2001. IEEE Computer Society, pp 503–511
Flickr (2012). http://www.flickr.com/. Accessed 26 Nov 2012
Gennaro C, Amato G, Bolettieri P, Savino P (2010) An approach to content-based image retrieval based on the lucene search engine library. In: Lalmas M, Jose J, Rauber A, Sebastiani F, Frommholz I (eds) Research and advanced technology for digital libraries. Ser. Lecture notes in computer science, vol 6273. Springer, Berlin, pp 55–66
Chapter Google Scholar
Chávez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658
Article Google Scholar
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th symposium on theory of computing, pp 604–613
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Mpeg-7 (2004) ISO/IEC JTC1/SC29/WG11N6828. http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm. Accessed 26 Nov 2012
Ogras ÜY, Ferhatosmanoglu H (2003) Dimensionality reduction using magnitude and shape approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2003), New Orleans, Louisiana, USA, 3–8 Nov 2003. ACM Press, pp 99–107
Patella M, Ciaccia P (2009) Approximate similarity search: a multi-faceted problem. J Discrete Algorithms 7(1):36–48
Article MATH MathSciNet Google Scholar
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill
SAPIR: Search In Audio Visual Content Using Peer-to-peer IR (2009) Project Web Site. http://sysrun.haifa.il.ibm.com/sapir/. Accessed 26 Nov 2012
Seward HH (1954) Information sorting in the application of electronic digital computers to business operations. Master Thesis, MIT
Shapiro MB (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343
Article Google Scholar
Skala M (2009) Counting distance permutations. J Discrete Algorithms 7:49–61. [Online]. Available: http://portal.acm.org/citation.cfm?id=1501025.1501131
Article MATH MathSciNet Google Scholar
Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179
Article MATH Google Scholar
Wang X, Wang JT-L, Lin K-I, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. In: Knowledge and information systems, vol 2. Springer, pp 161–184
Weber R, Böhm K (2000) Trading quality for time with nearest neighbor search. In: Proceedings of the 7th International Conference on Extending Database Technology, pp 21–35
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta A, Shmueli O, Widom J (eds) VLDB’98. Proceedings of 24rd international conference on very large data bases, New York City, New York, USA, 24–27 Aug 1998. Morgan Kaufmann, pp 194–205
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: NIPS, pp 1753–1760
Witten IH, Moffat A, Bell TC (1999) Bell: managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. In: Ser. advances in database systems, vol 32. Springer
Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293
Article Google Scholar

Download references

Author information

Authors and Affiliations

ISTI-CNR, Via G. Moruzzi, 1, 56124, Pisa, Italy
Giuseppe Amato, Claudio Gennaro & Pasquale Savino

Authors

Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Gennaro
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale Savino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Amato.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amato, G., Gennaro, C. & Savino, P. MI-File: using inverted files for scalable approximate similarity search. Multimed Tools Appl 71, 1333–1362 (2014). https://doi.org/10.1007/s11042-012-1271-1

Download citation

Published: 06 November 2012
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11042-012-1271-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MI-File: using inverted files for scalable approximate similarity search

Abstract

Access this article

Similar content being viewed by others

Distance-Based Index Structures for Fast Similarity Search

Quantifying the Invariance and Robustness of Permutation-Based Indexing Schemes

Shortening the Candidate List for Similarity Searching Using Inverted Index

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MI-File: using inverted files for scalable approximate similarity search

Abstract

Access this article

Similar content being viewed by others

Distance-Based Index Structures for Fast Similarity Search

Quantifying the Invariance and Robustness of Permutation-Based Indexing Schemes

Shortening the Candidate List for Similarity Searching Using Inverted Index

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation