Abstract
We propose a new efficient and accurate technique for generic approximate similarity searching, based on the use of inverted files. We represent each object of a dataset by the ordering of a number of reference objects according to their distance from the object itself. In order to compare two objects in the dataset, we compare the two corresponding orderings of the reference objects. We show that this representation enables us to use inverted files to obtain very efficiently a very small set of good candidates for the query result. The candidate set is then reordered using the original similarity function to obtain the approximate similarity search result. The proposed technique performs several orders of magnitude better than exact similarity searches, still guaranteeing high accuracy. To also demonstrate the scalability of the proposed approach, tests were executed with various dataset sizes, ranging from 200,000 to 100 million objects.
Similar content being viewed by others
Notes
This can be obtained by maintaining the entries of the posting list ordered according to the position and by using a small index for each posting list that gives the offset corresponding to the possible positions (see Section 5.3).
References
Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: InfoScale ’08: proceedings of the 3rd international conference on scalable information systems, ICST, pp 1–10
Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Syst 21(2):192–227
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communication of the ACM 51(1):117–122. doi:10.1145/1327452.1327494
Bawa M, Condie T, Ganesan P (2005) Lsh forest: self-tuning indexes for similarity search. In: WWW (International World Wide Web Conference), ACM Press, pp 651–660. doi:10.1145/1060745.1060840
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT ’99. Proceedings of the 7th international conference, Jerusalem, Israel, 10–12 Jan 1999. Ser. Lecture notes in computer science, vol 1540. Springer, pp 217–235
Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Rabitti F (2009) Enabling content-based image retrieval in very large digital libraries. In: Second workshop on very large digital libraries, pp 43–50
Bozkaya T, Özsoyoglu, ZM (1997) Distance-based indexing for high-dimensional metric spaces. In: SIGMOD conference, pp 357–368
Brin S (1995) Near neighbor search in large metric spaces. In: VLDB, pp 574–584
Ciaccia P, Patella M (2000) Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp 244–255
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) VLDB’97. Proceedings of 23rd international conference on very large data bases, Athens, Greece, 25–29 Aug 1997. Morgan Kaufmann, pp 426–435
Diaconis P (1988) Group representations in probability and statistics. In: Ser. IMS Lecture notes—monograph series, vol 11. Institute of Mathematical Statistics, Hawyard, CA
Egecioglu Ö, Ferhatosmanoglu H (2000) Dimensionality reduction and similarity computation by inner product approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2000), McLean, Virginia, USA, 6–11 Nov 2000. ACM Press, pp 219–226
Esuli A (2012) Use of permutation prefixes for efficient and scalable approximate similarity search. Inf Process Manag 48(5):889–902
Faloutsos C, Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Carey MJ, Schneider DA (eds) Proceedings of the 18th ACM international conference on management of data (SIGMOD 1995), San Jose, California, USA, 22–25 May 1995. ACM Press, pp 163–174
Ferhatosmanoglu H, Tuncel E, Agrawal D, Abbadi AE (2001) Approximate nearest neighbor searching in multimedia databases. In: Proceedings of the 17th international conference on data engineering, Heidelberg, Germany, 2–6 April 2001. IEEE Computer Society, pp 503–511
Flickr (2012). http://www.flickr.com/. Accessed 26 Nov 2012
Gennaro C, Amato G, Bolettieri P, Savino P (2010) An approach to content-based image retrieval based on the lucene search engine library. In: Lalmas M, Jose J, Rauber A, Sebastiani F, Frommholz I (eds) Research and advanced technology for digital libraries. Ser. Lecture notes in computer science, vol 6273. Springer, Berlin, pp 55–66
Chávez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th symposium on theory of computing, pp 604–613
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Mpeg-7 (2004) ISO/IEC JTC1/SC29/WG11N6828. http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm. Accessed 26 Nov 2012
Ogras ÜY, Ferhatosmanoglu H (2003) Dimensionality reduction using magnitude and shape approximations. In: Proceedings of the ACM international conference on information and knowledge management (CIKM 2003), New Orleans, Louisiana, USA, 3–8 Nov 2003. ACM Press, pp 99–107
Patella M, Ciaccia P (2009) Approximate similarity search: a multi-faceted problem. J Discrete Algorithms 7(1):36–48
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill
SAPIR: Search In Audio Visual Content Using Peer-to-peer IR (2009) Project Web Site. http://sysrun.haifa.il.ibm.com/sapir/. Accessed 26 Nov 2012
Seward HH (1954) Information sorting in the application of electronic digital computers to business operations. Master Thesis, MIT
Shapiro MB (1977) The choice of reference points in best-match file searching. Commun ACM 20(5):339–343
Skala M (2009) Counting distance permutations. J Discrete Algorithms 7:49–61. [Online]. Available: http://portal.acm.org/citation.cfm?id=1501025.1501131
Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179
Wang X, Wang JT-L, Lin K-I, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. In: Knowledge and information systems, vol 2. Springer, pp 161–184
Weber R, Böhm K (2000) Trading quality for time with nearest neighbor search. In: Proceedings of the 7th International Conference on Extending Database Technology, pp 21–35
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta A, Shmueli O, Widom J (eds) VLDB’98. Proceedings of 24rd international conference on very large data bases, New York City, New York, USA, 24–27 Aug 1998. Morgan Kaufmann, pp 194–205
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: NIPS, pp 1753–1760
Witten IH, Moffat A, Bell TC (1999) Bell: managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. In: Ser. advances in database systems, vol 32. Springer
Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Amato, G., Gennaro, C. & Savino, P. MI-File: using inverted files for scalable approximate similarity search. Multimed Tools Appl 71, 1333–1362 (2014). https://doi.org/10.1007/s11042-012-1271-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-012-1271-1