skip to main content
10.1145/2247596.2247654acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

SIMP: accurate and efficient near neighbor search in high dimensional spaces

Published:27 March 2012Publication History

ABSTRACT

Near neighbor search in high dimensional spaces is useful in many applications. Existing techniques solve this problem efficiently only for the approximate cases. These solutions are designed to solve r-near neighbor queries for a fixed query range or for a set of query ranges with probabilistic guarantees, and then extended for nearest neighbor queries. Solutions supporting a set of query ranges suffer from prohibitive space cost. There are many applications which are quality sensitive and need to efficiently and accurately support near neighbor queries for all query ranges. In this paper, we propose a novel indexing and querying scheme called Spatial Intersection and Metric Pruning (SIMP). It efficiently supports r-near neighbor queries in very high dimensional spaces for all query ranges with 100% quality guarantee and with practical storage costs. Our empirical studies on three real datasets having dimensions between 32 and 256 and sizes up to 10 million show a superior performance of SIMP over LSH, Multi-Probe LSH, LSB tree, and iDistance. Our scalability tests on real datasets having as many as 100 million points of dimensions up to 256 establish that SIMP scales linearly with query range, dataset dimension, and dataset size.

References

  1. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, pages 651--660, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. L. Bentley. Multidimensional binary search trees used for associative searching. 18(9):509--517, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Berchtold, C. Böhm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In PODS, pages 78--86, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Berchtold, D. A. Keim, H.-P. Kriegel, and T. Seidl. Indexing the solution space: A new technique for nearest neighbor search in high-dimensional space. IEEE TKDE, 12(1):45--57, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Böhm. A cost model for query processing in high-dimensional data. ACM TDS, 25:129--178, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--428, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In ACM STOC, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Ciaccia, M. Patella, and P. Zezula. A cost model for similarity queries in metric spaces. In PODS, pages 59--68, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling lsh for performance tuning. In CIKM, pages 669--678, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226--231. AAAI Press, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. Gaede and O. Günther. Multidimensional access methods. ACM Comput. Surv., 30(2):170--231, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47--57, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM TDS, 30(2):364--397, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 2010.Google ScholarGoogle Scholar
  20. A. Joly and O. Buisson. A posteriori multi-probe locality sensitive hashing. In ACM MM, pages 209--218, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Koudas, B. C. Ooi, H. T. Shen, and A. K. H. Tung. Ldc: Enabling search by partial distance in a hyper-dimensional space. In ICDE, pages 6--17, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. A. Lang and A. K. Singh. Modeling high-dimensional index structures using sampling. In SIGMOD, pages 389--400, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. A. Lang and A. K. Singh. Faster similarity search for multimedia data via query transformations. Int. J. Image Graphics, pages 3--30, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. K. Lawder and P. J. H. King. Using space-filling curves for multi-dimensional indexing. In BNCOD, pages 20--35, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. S. Manjunath, P. Salembier, and T. Sikora. Introduction to MPEG-7: Multimedia Content Description Interface. Wiley, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In SCG '06, pages 154--157, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, pages 1186--1195, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results. In Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases, pages 236--256, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Singh, A. Bhattacharya, and A. K. Singh. Querying spatial patterns. In EDBT, pages 418--429, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SIMP: accurate and efficient near neighbor search in high dimensional spaces

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
        March 2012
        643 pages
        ISBN:9781450307901
        DOI:10.1145/2247596

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 March 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate7of10submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader