Skip to main content
Log in

Extreme pivots: a pivot selection strategy for faster metric search

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This manuscript presents the extreme pivots (EP) metric index, a data structure, to speed up exact proximity searching in the metric space model. For the EP, we designed an automatic rule to select the best pivots for a dataset working on limited memory resources. The net effect is that our approach solves queries efficiently with a small memory footprint, and without a prohibitive construction time. In contrast with other related structures, our performance is achieved automatically without dealing directly with the index’s parameters, using optimization techniques over a model of the index. The EP’s model is studied in-depth in this contribution. In practical terms, an interested user only needs to provide the available memory and a sample of the query distribution as parameters. The resulting index is quickly built, and has a good trade-off among memory usage, preprocessing, and search time. We provide an extensive experimental comparison with state-of-the-art searching methods. We also carefully compared the performance of metric indexes in several scenarios, firstly with synthetic data to characterize performance as a function of the intrinsic dimension and the size of the database, and also in different real-world datasets with excellent results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. For a random variable Z with mean \(\mu _Z\) and variance \(\sigma ^2_Z\), \({\mathrm{Pr}}(|Z-\mu _Z|>\epsilon )<\sigma ^2_Z/\epsilon ^2\).

  2. The precise values 0.16 and 0.34 can be numerically computed or retrieved in traditional pre-computed tables for the normal distribution.

  3. https://github.com/sadit/natix/.

References

  1. Arya S, Mount D, Netanyahu N, Silverman R, Wu Y (1998) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6):891–923

    Article  MathSciNet  Google Scholar 

  2. Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373. https://doi.org/10.1145/502807.502809

    Article  Google Scholar 

  3. Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Piccioli T, Rabitti F (2009) CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627v2. http://cophir.isti.cnr.it

  4. Burges CJC (2010) Dimension reduction: a guided tour (foundations and trends(r) in machine learning), 1st edn. Now Publishers Inc, Microsoft Research, Boston. https://doi.org/10.1561/2200000002

    Book  Google Scholar 

  5. Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14):2357–2366

    Article  Google Scholar 

  6. Celik C (2002) Priority vantage points structures for similarity queries in metric spaces. In: EurAsia-ICT ’02: proceedings of the 1st EurAsian conference on information and communication technology. Springer, London, pp 256–263

  7. Celik C (2008) Effective use of space for pivot-based metric indexing structures. In: SISAP ’08: proceedings of the 1st international workshop on similarity search and applications (sisap 2008). IEEE Computer Society, Washington, pp 113–120. https://doi.org/10.1109/SISAP.2008.22

  8. Chávez E, Marroquin JL, Baeza-Yates R (1999) Spaghettis: an array based algorithm for similarity queries in metric spaces. In: String processing and information retrieval symposium, 1999 and international workshop on groupware, pp 38–46. IEEE

  9. Chávez E, Navarro G (2003) Probabilistic proximity search: fighting the curse of dimensionality in metric spaces. Inf Process Lett 85:39–46

    Article  MathSciNet  Google Scholar 

  10. Chávez E, Navarro G (2005) A compact space decomposition for effective metric indexing. Pattern Recognit Lett 26:1363–1376. https://doi.org/10.1016/j.patrec.2004.11.014

    Article  Google Scholar 

  11. Chavez E, Navarro G, Baeza-Yates R, Marroquin JL (2001) Searching in metric spaces. ACM Comput Surv 33(3):273–321. https://doi.org/10.1145/502807.502808

    Article  Google Scholar 

  12. Chen L, Gao Y, Zheng B, Jensen CS, Yang H, Yang K (2017) Pivot-based metric indexing. Proc VLDB Endow 10(10):1058–1069. https://doi.org/10.14778/3115404.3115411

    Article  Google Scholar 

  13. Chávez E, Ludueña V, Reyes N, Roggero P (2016) Faster proximity searching with the distal sat. Inf Syst 59:15–47. https://doi.org/10.1016/j.is.2015.10.014

    Article  Google Scholar 

  14. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, VLDB ’97. Morgan Kaufmann Publishers Inc., San Francisco, pp 426–435. http://dl.acm.org/citation.cfm?id=645923.671005

  15. Cormen TH, Leiserson C, Rivest RL, Stein CELC (2001) Introduction to algorithms, 2nd edn. McGraw-Hill Inc, New York

    MATH  Google Scholar 

  16. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580. https://doi.org/10.1145/958942.958948

    Article  Google Scholar 

  17. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst (TODS) 28(4):517–580

    Article  Google Scholar 

  18. Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) idistance: an adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397. https://doi.org/10.1145/1071610.1071612

    Article  Google Scholar 

  19. Micó ML, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recognit Lett 15:9–17. https://doi.org/10.1016/0167-8655(94)90095-7

    Article  Google Scholar 

  20. Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701. https://doi.org/10.1007/s10618-016-0484-8

    Article  MathSciNet  MATH  Google Scholar 

  21. Navarro G (2002) Searching in metric spaces by spatial approximation. Very Large Databases J (VLDBJ) 11(1):28–46

    Article  Google Scholar 

  22. Novak D, Batko M (2009) Metric index: an efficient and scalable solution for similarity search. In: Second international workshop on similarity search and applications, 2009. SISAP ’09, pp. 65–73. https://doi.org/10.1109/SISAP.2009.26

  23. Pedreira O, Brisaboa N (2007) Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen J, Italiano G, van der Hoek W, Meinel C, Sack H, Plášil F (eds) SOFSEM 2007: theory and practice of computer science. Lecture notes in computer science, vol 4362. Springer, Berlin, pp 434–445. https://doi.org/10.1007/978-3-540-69507-3_37

  24. Pestov V (2007) Intrinsic dimension of a dataset: what properties does one expect? In: Proceedings of 20th International Joint Conference on Neural Networks, pp 1775–1780

  25. Pestov V (2008) An axiomatic approach to intrinsic dimension of a dataset. Neural Netw 21(2–3):204–213

    Article  Google Scholar 

  26. Pestov V (2010) Indexability, concentration, and VC theory. In: Proceedings of 3rd international conference on similarity search and applications (SISAP), pp 3–12

  27. Pestov V (2010) Intrinsic dimensionality. ACM SIGSPATIAL 2:8–11. https://doi.org/10.1145/1862413.1862416

    Article  Google Scholar 

  28. Ruiz G, Santoyo F, Chávez E, Figueroa K, Tellez ES (2013) Extreme pivots for faster metric indexes. In: Brisaboa N, Pedreira O, Zezula P (eds) Similarity search and applications. Springer, Berlin, pp 115–126

    Chapter  Google Scholar 

  29. Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann, Los Altos

    MATH  Google Scholar 

  30. Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31:814–838. https://doi.org/10.1145/1166074.1166077

    Article  Google Scholar 

  31. Skopal T (2004) Pivoting m-tree: a metric access method for efficient similarity search. In: DATESO’04, pp 27–37

  32. Skopal T (2010) Where are you heading, metric access methods?: a provocative survey. In: Proceedings of the 3rd international conference on similarity search and applications, SISAP’10. ACM, New York, pp 13–21. https://doi.org/10.1145/1862344.1862347

  33. Skopal T, Bustos B (2011) On nonmetric similarity search problems in complex domains. ACM Comput Surv 43(4), art. 34

  34. Tellez E, Ruiz G, Chavez E (2016) Singleton indexes for nearest neighbor search. Inf Syst 60:50–68. https://doi.org/10.1016/j.is.2016.03.003

    Article  Google Scholar 

  35. Theiler J (1990) Estimating fractal dimension. J Opt Soc Am A 7(6):1055–1073. https://doi.org/10.1364/JOSAA.7.001055

    Article  MathSciNet  Google Scholar 

  36. Vidal Ruiz E (1986) An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognit Lett 4:145–157

    Article  Google Scholar 

  37. Volnyansky I, Pestov V (2009) Curse of dimensionality in pivot based indexes. In: Proceedings of 2nd international workshop on similarity search and applications (SISAP), pp 39–46. https://doi.org/10.1109/SISAP.2009.9

  38. Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th annual ACM-SIAM symposium on discrete algorithms, SODA ’93. Society for Industrial and Applied Mathematics, Philadelphia, pp 311–321. http://dl.acm.org/citation.cfm?id=313559.313789

  39. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. Advances in database systems, vol 32. Springer, Belrin

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric S. Tellez.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ruiz, G., Chavez, E., Ruiz, U. et al. Extreme pivots: a pivot selection strategy for faster metric search. Knowl Inf Syst 62, 2349–2382 (2020). https://doi.org/10.1007/s10115-019-01423-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01423-5

Keywords

Navigation