skip to main content
10.1145/2247596.2247654acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

SIMP: accurate and efficient near neighbor search in high dimensional spaces

Published: 27 March 2012 Publication History

Abstract

Near neighbor search in high dimensional spaces is useful in many applications. Existing techniques solve this problem efficiently only for the approximate cases. These solutions are designed to solve r-near neighbor queries for a fixed query range or for a set of query ranges with probabilistic guarantees, and then extended for nearest neighbor queries. Solutions supporting a set of query ranges suffer from prohibitive space cost. There are many applications which are quality sensitive and need to efficiently and accurately support near neighbor queries for all query ranges. In this paper, we propose a novel indexing and querying scheme called Spatial Intersection and Metric Pruning (SIMP). It efficiently supports r-near neighbor queries in very high dimensional spaces for all query ranges with 100% quality guarantee and with practical storage costs. Our empirical studies on three real datasets having dimensions between 32 and 256 and sizes up to 10 million show a superior performance of SIMP over LSH, Multi-Probe LSH, LSB tree, and iDistance. Our scalability tests on real datasets having as many as 100 million points of dimensions up to 256 establish that SIMP scales linearly with query range, dataset dimension, and dataset size.

References

[1]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[2]
M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, pages 651--660, 2005.
[3]
J. L. Bentley. Multidimensional binary search trees used for associative searching. 18(9):509--517, 1975.
[4]
S. Berchtold, C. Böhm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In PODS, pages 78--86, 1997.
[5]
S. Berchtold, D. A. Keim, H.-P. Kriegel, and T. Seidl. Indexing the solution space: A new technique for nearest neighbor search in high-dimensional space. IEEE TKDE, 12(1):45--57, 2000.
[6]
C. Böhm. A cost model for query processing in high-dimensional data. ACM TDS, 25:129--178, 2000.
[7]
J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--428, 2001.
[8]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In ACM STOC, pages 380--388, 2002.
[9]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997.
[10]
P. Ciaccia, M. Patella, and P. Zezula. A cost model for similarity queries in metric spaces. In PODS, pages 59--68, 1998.
[11]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004.
[12]
W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling lsh for performance tuning. In CIKM, pages 669--678, 2008.
[13]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226--231. AAAI Press, 1996.
[14]
V. Gaede and O. Günther. Multidimensional access methods. ACM Comput. Surv., 30(2):170--231, 1998.
[15]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[16]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47--57, 1984.
[17]
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 1998.
[18]
H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM TDS, 30(2):364--397, 2005.
[19]
H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 2010.
[20]
A. Joly and O. Buisson. A posteriori multi-probe locality sensitive hashing. In ACM MM, pages 209--218, 2008.
[21]
N. Koudas, B. C. Ooi, H. T. Shen, and A. K. H. Tung. Ldc: Enabling search by partial distance in a hyper-dimensional space. In ICDE, pages 6--17, 2004.
[22]
C. A. Lang and A. K. Singh. Modeling high-dimensional index structures using sampling. In SIGMOD, pages 389--400, 2001.
[23]
C. A. Lang and A. K. Singh. Faster similarity search for multimedia data via query transformations. Int. J. Image Graphics, pages 3--30, 2003.
[24]
J. K. Lawder and P. J. H. King. Using space-filling curves for multi-dimensional indexing. In BNCOD, pages 20--35, 2000.
[25]
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91--110, 2004.
[26]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.
[27]
B. S. Manjunath, P. Salembier, and T. Sikora. Introduction to MPEG-7: Multimedia Content Description Interface. Wiley, 2002.
[28]
R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In SCG '06, pages 154--157, 2006.
[29]
R. Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, pages 1186--1195, 2006.
[30]
H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005.
[31]
S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results. In Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases, pages 236--256, 2001.
[32]
V. Singh, A. Bhattacharya, and A. K. Singh. Querying spatial patterns. In EDBT, pages 418--429, 2010.
[33]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009.
[34]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998.
[35]
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010.

Cited By

View all
  • (2023)Processing Reverse Nearest Neighbor Queries Based on Unbalanced Multiway Region Tree IndexWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_57(733-747)Online publication date: 21-Oct-2023
  • (2021)Processing Approximate KNN Query Based on Data Source Selection2021 International Conference on Intelligent Computing, Automation and Applications (ICAA)10.1109/ICAA53760.2021.00121(672-676)Online publication date: Jun-2021
  • (2021)Integrating Real-Time Entity Resolution with Top-N Join Query ProcessingKnowledge Science, Engineering and Management10.1007/978-3-030-82153-1_10(111-123)Online publication date: 7-Aug-2021
  • Show More Cited By

Index Terms

  1. SIMP: accurate and efficient near neighbor search in high dimensional spaces

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
      March 2012
      643 pages
      ISBN:9781450307901
      DOI:10.1145/2247596
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 March 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      EDBT '12

      Acceptance Rates

      Overall Acceptance Rate 7 of 10 submissions, 70%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Processing Reverse Nearest Neighbor Queries Based on Unbalanced Multiway Region Tree IndexWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_57(733-747)Online publication date: 21-Oct-2023
      • (2021)Processing Approximate KNN Query Based on Data Source Selection2021 International Conference on Intelligent Computing, Automation and Applications (ICAA)10.1109/ICAA53760.2021.00121(672-676)Online publication date: Jun-2021
      • (2021)Integrating Real-Time Entity Resolution with Top-N Join Query ProcessingKnowledge Science, Engineering and Management10.1007/978-3-030-82153-1_10(111-123)Online publication date: 7-Aug-2021
      • (2020) Evaluating Top- N Join Queries with Real-time Entity Resolution Journal of Physics: Conference Series10.1088/1742-6596/1575/1/0120841575(012084)Online publication date: 14-Jul-2020
      • (2018) Region-tree Based Sorted List Indexing for Real-time Entity Resolution in n -Dimensional Data Space IOP Conference Series: Materials Science and Engineering10.1088/1757-899X/466/1/012025466(012025)Online publication date: 28-Dec-2018
      • (2017)High dimensional nearest neighbor search considering outliers based on fuzzy membership2017 Computing Conference10.1109/SAI.2017.8252127(363-371)Online publication date: Jul-2017
      • (2016)Nearest Keyword Set Search in Multi-Dimensional DatasetsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.249254928:3(741-755)Online publication date: 1-Mar-2016
      • (2016)Evaluating Top-N queries in n-dimensional normed spacesInformation Sciences: an International Journal10.1016/j.ins.2016.09.035374:C(255-275)Online publication date: 20-Dec-2016
      • (2013)Improving the Performance of High-Dimensional kNN Retrieval through Localized Dataspace Segmentation and Hybrid IndexingProceedings of the 17th East European Conference on Advances in Databases and Information Systems - Volume 813310.5555/2939301.2939313(344-357)Online publication date: 1-Sep-2013
      • (2013)Improving the Performance of High-Dimensional kNN Retrieval through Localized Dataspace Segmentation and Hybrid IndexingAdvances in Databases and Information Systems10.1007/978-3-642-40683-6_26(344-357)Online publication date: 2013
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media