Abstract
This manuscript presents the extreme pivots (EP) metric index, a data structure, to speed up exact proximity searching in the metric space model. For the EP, we designed an automatic rule to select the best pivots for a dataset working on limited memory resources. The net effect is that our approach solves queries efficiently with a small memory footprint, and without a prohibitive construction time. In contrast with other related structures, our performance is achieved automatically without dealing directly with the index’s parameters, using optimization techniques over a model of the index. The EP’s model is studied in-depth in this contribution. In practical terms, an interested user only needs to provide the available memory and a sample of the query distribution as parameters. The resulting index is quickly built, and has a good trade-off among memory usage, preprocessing, and search time. We provide an extensive experimental comparison with state-of-the-art searching methods. We also carefully compared the performance of metric indexes in several scenarios, firstly with synthetic data to characterize performance as a function of the intrinsic dimension and the size of the database, and also in different real-world datasets with excellent results.
Similar content being viewed by others
Notes
For a random variable Z with mean \(\mu _Z\) and variance \(\sigma ^2_Z\), \({\mathrm{Pr}}(|Z-\mu _Z|>\epsilon )<\sigma ^2_Z/\epsilon ^2\).
The precise values 0.16 and 0.34 can be numerically computed or retrieved in traditional pre-computed tables for the normal distribution.
References
Arya S, Mount D, Netanyahu N, Silverman R, Wu Y (1998) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6):891–923
Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373. https://doi.org/10.1145/502807.502809
Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Piccioli T, Rabitti F (2009) CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627v2. http://cophir.isti.cnr.it
Burges CJC (2010) Dimension reduction: a guided tour (foundations and trends(r) in machine learning), 1st edn. Now Publishers Inc, Microsoft Research, Boston. https://doi.org/10.1561/2200000002
Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14):2357–2366
Celik C (2002) Priority vantage points structures for similarity queries in metric spaces. In: EurAsia-ICT ’02: proceedings of the 1st EurAsian conference on information and communication technology. Springer, London, pp 256–263
Celik C (2008) Effective use of space for pivot-based metric indexing structures. In: SISAP ’08: proceedings of the 1st international workshop on similarity search and applications (sisap 2008). IEEE Computer Society, Washington, pp 113–120. https://doi.org/10.1109/SISAP.2008.22
Chávez E, Marroquin JL, Baeza-Yates R (1999) Spaghettis: an array based algorithm for similarity queries in metric spaces. In: String processing and information retrieval symposium, 1999 and international workshop on groupware, pp 38–46. IEEE
Chávez E, Navarro G (2003) Probabilistic proximity search: fighting the curse of dimensionality in metric spaces. Inf Process Lett 85:39–46
Chávez E, Navarro G (2005) A compact space decomposition for effective metric indexing. Pattern Recognit Lett 26:1363–1376. https://doi.org/10.1016/j.patrec.2004.11.014
Chavez E, Navarro G, Baeza-Yates R, Marroquin JL (2001) Searching in metric spaces. ACM Comput Surv 33(3):273–321. https://doi.org/10.1145/502807.502808
Chen L, Gao Y, Zheng B, Jensen CS, Yang H, Yang K (2017) Pivot-based metric indexing. Proc VLDB Endow 10(10):1058–1069. https://doi.org/10.14778/3115404.3115411
Chávez E, Ludueña V, Reyes N, Roggero P (2016) Faster proximity searching with the distal sat. Inf Syst 59:15–47. https://doi.org/10.1016/j.is.2015.10.014
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, VLDB ’97. Morgan Kaufmann Publishers Inc., San Francisco, pp 426–435. http://dl.acm.org/citation.cfm?id=645923.671005
Cormen TH, Leiserson C, Rivest RL, Stein CELC (2001) Introduction to algorithms, 2nd edn. McGraw-Hill Inc, New York
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580. https://doi.org/10.1145/958942.958948
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst (TODS) 28(4):517–580
Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) idistance: an adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397. https://doi.org/10.1145/1071610.1071612
Micó ML, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recognit Lett 15:9–17. https://doi.org/10.1016/0167-8655(94)90095-7
Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701. https://doi.org/10.1007/s10618-016-0484-8
Navarro G (2002) Searching in metric spaces by spatial approximation. Very Large Databases J (VLDBJ) 11(1):28–46
Novak D, Batko M (2009) Metric index: an efficient and scalable solution for similarity search. In: Second international workshop on similarity search and applications, 2009. SISAP ’09, pp. 65–73. https://doi.org/10.1109/SISAP.2009.26
Pedreira O, Brisaboa N (2007) Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen J, Italiano G, van der Hoek W, Meinel C, Sack H, Plášil F (eds) SOFSEM 2007: theory and practice of computer science. Lecture notes in computer science, vol 4362. Springer, Berlin, pp 434–445. https://doi.org/10.1007/978-3-540-69507-3_37
Pestov V (2007) Intrinsic dimension of a dataset: what properties does one expect? In: Proceedings of 20th International Joint Conference on Neural Networks, pp 1775–1780
Pestov V (2008) An axiomatic approach to intrinsic dimension of a dataset. Neural Netw 21(2–3):204–213
Pestov V (2010) Indexability, concentration, and VC theory. In: Proceedings of 3rd international conference on similarity search and applications (SISAP), pp 3–12
Pestov V (2010) Intrinsic dimensionality. ACM SIGSPATIAL 2:8–11. https://doi.org/10.1145/1862413.1862416
Ruiz G, Santoyo F, Chávez E, Figueroa K, Tellez ES (2013) Extreme pivots for faster metric indexes. In: Brisaboa N, Pedreira O, Zezula P (eds) Similarity search and applications. Springer, Berlin, pp 115–126
Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann, Los Altos
Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31:814–838. https://doi.org/10.1145/1166074.1166077
Skopal T (2004) Pivoting m-tree: a metric access method for efficient similarity search. In: DATESO’04, pp 27–37
Skopal T (2010) Where are you heading, metric access methods?: a provocative survey. In: Proceedings of the 3rd international conference on similarity search and applications, SISAP’10. ACM, New York, pp 13–21. https://doi.org/10.1145/1862344.1862347
Skopal T, Bustos B (2011) On nonmetric similarity search problems in complex domains. ACM Comput Surv 43(4), art. 34
Tellez E, Ruiz G, Chavez E (2016) Singleton indexes for nearest neighbor search. Inf Syst 60:50–68. https://doi.org/10.1016/j.is.2016.03.003
Theiler J (1990) Estimating fractal dimension. J Opt Soc Am A 7(6):1055–1073. https://doi.org/10.1364/JOSAA.7.001055
Vidal Ruiz E (1986) An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognit Lett 4:145–157
Volnyansky I, Pestov V (2009) Curse of dimensionality in pivot based indexes. In: Proceedings of 2nd international workshop on similarity search and applications (SISAP), pp 39–46. https://doi.org/10.1109/SISAP.2009.9
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th annual ACM-SIAM symposium on discrete algorithms, SODA ’93. Society for Industrial and Applied Mathematics, Philadelphia, pp 311–321. http://dl.acm.org/citation.cfm?id=313559.313789
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. Advances in database systems, vol 32. Springer, Belrin
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ruiz, G., Chavez, E., Ruiz, U. et al. Extreme pivots: a pivot selection strategy for faster metric search. Knowl Inf Syst 62, 2349–2382 (2020). https://doi.org/10.1007/s10115-019-01423-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01423-5