Abstract
Although conventional index structures provide various nearest-neighbor search algorithms for high-dimensional data, there are additional requirements to increase search performances, as well as to support index scalability for large-scale datasets. To support these requirements, we propose a distributed high-dimensional index structure based on cluster systems, called a Distributed Vector Approximation-tree (DVA-tree), which is a two-level structure consisting of a hybrid spill-tree and Vector Approximation files (VA-files). We also describe the algorithms used for constructing the DVA-tree over multiple machines and performing distributed k-nearest neighbors (NN) searches. To evaluate performances of the DVA-tree, we conduct an experimental study using both real and synthetic datasets. The results show that our proposed method has significant performance advantages over existing index structures on different kinds of dataset.
Similar content being viewed by others
References
Nikos K, Christos F, Ibrahim K (1996) Declustering spatial databases on a multi-computer architecture. In: Proceedings of the international conference on extending database technology. LNCS, vol 1057, pp 592–614
Bernd S, Scott TL (1999) Master-client R-trees: a new parallel R-tree architecture. In: Proceedings of the international conference on scientific and statistical database management, pp 68–77
Ting L, Charles R, Henry AR (2007) Clustering billions of images with large scale nearest neighbor search. In: Proceedings of the IEEE workshop on applications of computer vision, pp 28–33
Roger W, Klemens B, Hans JS (2000) Interactive-time similarity search for large image collection using parallel VA-files. In: Proceedings of the European conference on research and advanced technology for digital libraries. LNCS, vol 1923, pp 83–92
Jaewoo C, Ahreum L (2008) Parallel high-dimensional index structure for content-based information retrieval. In: Proceedings of the IEEE international conference on computer and information technology, pp 101–106
Chi Z, Arvind K, Randolph YW (2004) SkipIndex: towards a scalable peer-to-peer index service for high dimensional data. Technical report TR-703-04, Princeton University
Beomseok N, Alan S (2005) DiST: fully decentralized indexing for querying distributed multidimensional datasets. Technical report CS-TR-4720 and UMIACS-TR-2005-28, Maryland University
Jagadish HV, Beng CO, Quang HV, Rong Z, Aoying Z (2006) VBI-tree: a peer-to-peer framework for supporting multi-dimensional indexing schemes. In: Proceedings of the international conference on data engineering, p 34. doi:10.1109/ICDE.2006.169
Mayank B, Tyson C, Prasanna G (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the international conference on world wide web, pp 353–366
Parisa H, Sebastian M, Philippe C-M, Karl A (2008) LSH at large-distributed KNN search in high dimensions. In: Proceedings of the international workshop on the web and databases
Roger W, Hans JS, Stephen B (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the international conference on very large data bases, pp 194–205
Roger W, Stephen B (1997) An approximation-based data structure for similarity search. Technical report 24, ESPRIT project HERMES (no 9141)
John TR (1981) The K-D-B-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the international ACM SIGMOD conference. doi:10.1145/582318.582321
David BL, Betty S (1989) A robust multi-attribute search structure. In: Proceedings of the IEEE international conference on data engineering, pp 296–304
Norbert B, Hans PK (1990) The R∗-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the international ACM SIGMOD conference, pp 322–331
Stefan B, Daniel AK, Hans PK (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of the international conference on very large data bases, pp 28–39
Paolo C, Marco P, Pavel Z (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the international conference on very large data bases, pp 426–435
Ting L, Andrew WM, Alexander G, Ke Y (2004) An investigation of practical approximate nearest neighbor algorithms. In: Proceedings of the international conference on neural information processing systems, pp 825–832
Christian B, Hans PK (2000) Dynamically optimizing high-dimensional index structures. In: Proceedings of the international conference on extending database technology. LNCS, vol 1777, pp 36–50
Guang HC, Xiaoming Z, Dragutin P, Chin WC (2002) An efficient indexing method for nearest neighbor searches in high-dimensional image databases. IEEE Trans Multimed 4(1):76–87
Sung GH, Jae WC (2000) A new high-dimensional index structure using a cell-based filtering technique. In: Proceedings of the international conference on database systems for advanced applications. LNCS, vol 1884, pp 79–92
Aristides G, Piotr I, Rajeev M (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large data bases, pp 518–529
Edith C, Mayur D, Shinji F, Aristides G, Piotr I, Rajeev M, Jeffrey DU, Cheng Y (2000) Finding interesting associations without support pruning. In: Proceedings of the IEEE international conference on data engineering, pp 64–78
Taro Y (1976) Statistics: an introductory analysis
Paolo C, Marco P, Pavel Z (1998) A cost model for similarity queries in metric spaces. In: Proceedings of the Australasian database conference, pp 65–76
Airphoto dataset, http://vivaldi.ece.ucsb.edu/Manjunath/research.htm
http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html
Acknowledgements
This work was supported by the IT R&D program of MKE/KEIT. [10038768, The Development of Supercomputing System for the Genome Analysis].
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Choi, HH., Lee, MY. & Lee, KC. Distributed high dimensional indexing for k-NN search. J Supercomput 62, 1362–1384 (2012). https://doi.org/10.1007/s11227-012-0800-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0800-z