Skip to main content
Log in

MI-File: using inverted files for scalable approximate similarity search

Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects. We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. This can be obtained by maintaining the entries of the posting list ordered according to the position and by using a small index for each posting list that gives the offset corresponding to the possible positions (see Section 5.3).

References

  1. Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: InfoScale ’08: proceedings of the 3rd international conference on scalable information systems, ICST, pp 1–10

  2. Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Syst 21(2):192–227

    Article  Google Scholar 

  3. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communication of the ACM 51(1):117–122. doi:10.1145/1327452.1327494

    Article  Google Scholar 

  4. Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In: WWW (International World Wide Web Conference), ACM Press, pp 651–660. doi:10.1145/1060745.1060840

  5. Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT ’99. Proceedings of the 7th international conference, Jerusalem, Israel, 10–12 Jan 1999. Ser. Lecture notes in computer science, vol 1540. Springer, pp 217–235

  6. Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Rabitti F (2009) Enabling content-based image retrieval in very large digital libraries. In: Second workshop on very large digital libraries, pp 43–50

  7. Bozkaya T, Özsoyoglu, ZM (1997) Distance-based indexing for high-dimensional metric spaces. In: SIGMOD conference, pp 357–368

  8. Brin S (1995) Near neighbor search in large metric spaces. In: VLDB, pp 574–584

  9. Ciaccia P, Patella M (2000) Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp 244–255

  10. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) VLDB’97. Proceedings of 23rd international conference on very large data bases, Athens, Greece, 25–29 Aug 1997. Morgan Kaufmann, pp 426–435

  11. Diaconis P (1988) Group representations in probability and statistics. In: Ser. IMS Lecture notes—monograph series, vol 11. Institute of Mathematical Statistics, Hawyard, CA

    Google Scholar 

  12. Egecioglu Ö, Ferhatosmanoglu H (2000) Dimensionality reduction and similarity computation by inner product approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2000), McLean, Virginia, USA, 6–11 Nov 2000. ACM Press, pp 219–226

  13. Esuli A (2012) Use of permutation prefixes for efficient and scalable approximate similarity search. Inf Process Manag 48(5):889–902

    Article  Google Scholar 

  14. Faloutsos C, Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Carey MJ, Schneider DA (eds) Proceedings of the 18th ACM international conference on management of data (SIGMOD 1995), San Jose, California, USA, 22–25 May 1995. ACM Press, pp 163–174

  15. Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE (2001) Approximate nearest neighbor searching in multimedia databases. In: Proceedings of the 17th international conference on data engineering, Heidelberg, Germany, 2–6 April 2001. IEEE Computer Society, pp 503–511

  16. Flickr (2012). http://www.flickr.com/. Accessed 26 Nov 2012

  17. Gennaro C, Amato G, Bolettieri P, Savino P (2010) An approach to content-based image retrieval based on the lucene search engine library. In: Lalmas M, Jose J, Rauber A, Sebastiani F, Frommholz I (eds) Research and advanced technology for digital libraries. Ser. Lecture notes in computer science, vol 6273. Springer, Berlin, pp 55–66

    Chapter  Google Scholar 

  18. Chávez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658

    Article  Google Scholar 

  19. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580

    Article  Google Scholar 

  20. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th symposium on theory of computing, pp 604–613

  21. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  22. Mpeg-7 (2004) ISO/IEC JTC1/SC29/WG11N6828. http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm. Accessed 26 Nov 2012

  23. Ogras ÜY, Ferhatosmanoglu H (2003) Dimensionality reduction using magnitude and shape approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2003), New Orleans, Louisiana, USA, 3–8 Nov 2003. ACM Press, pp 99–107

  24. Patella M, Ciaccia P (2009) Approximate similarity search: a multi-faceted problem. J Discrete Algorithms 7(1):36–48

    Article  MATH  MathSciNet  Google Scholar 

  25. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill

  26. SAPIR: Search In Audio Visual Content Using Peer-to-peer IR (2009) Project Web Site. http://sysrun.haifa.il.ibm.com/sapir/. Accessed 26 Nov 2012

  27. Seward HH (1954) Information sorting in the application of electronic digital computers to business operations. Master Thesis, MIT

  28. Shapiro MB (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343

    Article  Google Scholar 

  29. Skala M (2009) Counting distance permutations. J Discrete Algorithms 7:49–61. [Online]. Available: http://portal.acm.org/citation.cfm?id=1501025.1501131

    Article  MATH  MathSciNet  Google Scholar 

  30. Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179

    Article  MATH  Google Scholar 

  31. Wang X, Wang JT-L, Lin K-I, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. In: Knowledge and information systems, vol 2. Springer, pp 161–184

  32. Weber R, Böhm K (2000) Trading quality for time with nearest neighbor search. In: Proceedings of the 7th International Conference on Extending Database Technology, pp 21–35

  33. Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta A, Shmueli O, Widom J (eds) VLDB’98. Proceedings of 24rd international conference on very large data bases, New York City, New York, USA, 24–27 Aug 1998. Morgan Kaufmann, pp 194–205

  34. Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: NIPS, pp 1753–1760

  35. Witten IH, Moffat A, Bell TC (1999) Bell: managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann

  36. Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321

  37. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. In: Ser. advances in database systems, vol 32. Springer

  38. Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Amato.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amato, G., Gennaro, C. & Savino, P. MI-File: using inverted files for scalable approximate similarity search. Multimed Tools Appl 71, 1333–1362 (2014). https://doi.org/10.1007/s11042-012-1271-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1271-1

Keywords

Navigation