ABSTRACT
Due to the well-known dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel symmetrical encoding-based index structure, which is called EHD-Tree (for symmetrical Encoding-based Hybrid Distance Tree), is proposed to support fast k-Nearest-Neighbor (k-NN) search in high-dimensional spaces. In an EHD-Tree, all data points are first grouped into clusters by a k-Means clustering algorithm. Then the uniform ID number of each data point is obtained by a dual-distance-driven encoding scheme in which each cluster sphere is partitioned twice according to the dual distances of start- and centroid-distance. Finally, the uniform ID number and the centroid-distance of each data point are combined to get a uniform index key, the latter is then indexed through a partition-based B+-tree. Thus, given a query point, its k-NN search in high-dimensional spaces can be transformed into search in a single dimensional space with the aid of the EHD-Tree index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of our proposed scheme, and the results demonstrate that this method outperforms the state-of-the-art high dimensional search techniques such as the X-Tree, VA-file, iDistance and NB-Tree, especially when the query radius is not very large.
- Christian Böhm, Stefan Berchtold, Daniel Keira. Searching in High-dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases, ACM Computing Surveys, 2001. 33 (3). Google ScholarDigital Library
- Bentley JL. Multidimensional binary search trees used for associative searching, Communications of the ACM, 18(9): pp. 509--517, 1975. Google ScholarDigital Library
- A. Guttman, R-tree: A dynamic index structure for spatial searching, In Proceedings of the ACM SIGMOD Conference, pp.47--54, 1984. Google ScholarDigital Library
- N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger. The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, In Proceedings of ACM SIGMOD Conference, pp. 322--331, 1990. Google ScholarDigital Library
- King-Ip Lin, H. V. Jagadish and Christos Faloutsos, The TV-tree an index structure for high-dimensional data, VLDB Journal, 1994. Google ScholarDigital Library
- S. Berchtold, D. A. Keim and H. P. Kriegel. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th VLDB Conference, pp. 28--37, 1996. Google ScholarDigital Library
- D. A. White and R. Jain. Similarity Indexing with the SS- tree, In Proceedings of ICDE Conference, pp. 516--523, 1996. Google ScholarDigital Library
- N. Katamaya and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proceedings of ACM SIGMOD Conference, pp. 32--42. 1997. Google ScholarDigital Library
- R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th VLDB Conference, pp. 194--205, 1998. Google ScholarDigital Library
- S. Berchtold, C. Bohm, H. P. Kriegel, J. Sander, and H. V. Jagadish. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the 16th ICDE Conference, pp. 577--588. 2000. Google ScholarDigital Library
- Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima. The A-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of VLDB Conference, pp. 516--526, 2000. Google ScholarDigital Library
- E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquín, Searching in Metric Spaces, ACM Computing Surveys: 33(3), pp. 273--321, ACM Press, 2001. Google ScholarDigital Library
- T. Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In Proceedings of ACM SIGMOD Conference, pages 357--368. 1997. Google ScholarDigital Library
- P.Ciaccia, M. Patella, and P. Zezula. M-trees: An efficient access method for similarity search in metric space. In Proceedings of the 23rd VLDB Conference, pages 426--435. 1997. Google ScholarDigital Library
- S. Berchtold, C. Bohm, and H.-P. Kriegel. The pyramid technique: Towards breaking the curse of dimensionality. In Proceedings of SIGMOD Conference, 1998. Google ScholarDigital Library
- Traina Jr., C., Traina, A., Seeger, B., Faloutsos, Slim-trees: High Performance Metric Trees Minimizing Overlap Between Nodes, In Proceedings of the EDBT Conference, Konstanz, Germany, 2000. Google ScholarDigital Library
- Filho, R. F. S., Traina, A., and Faloutsos, C. Similarity search without tears: The Omni family of all-purpose access methods. In Proceedings of ICDE Conference, pp. 623--630. 2001. Google ScholarDigital Library
- M J. Fonseca and J A. Jorge. Indexing High-dimensional Data for Content-Based Retrieval in Large Databases. In Proceedings of the 8th DASSFA Conference, Kyoto, Japan, pp. 267--274, 2003. Google ScholarDigital Library
- H. V. Jagadish, B. C. Ooi, K. L. Tan, C. Yu, R. Zhang. iDistance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search., ACM Transactions on Data Base Systems, 2005. 30(2), pp. 364--397. Google ScholarDigital Library
- UCI KDD Archive, http://www.kdd.ics.uci.edu, 2002.Google Scholar
Index Terms
- Indexing high-dimensional data in dual distance spaces: a symmetrical encoding approach
Recommendations
Composite distance transformation for indexing and k-nearest-neighbor searching in high-dimensional spaces
Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast k-nearest-neighbor (...
Enhanced algorithm for high-dimensional data classification
Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...
Constrained discriminant neighborhood embedding for high dimensional data feature extraction
When handling pattern classification problem such as face recognition and digital handwriting identification, image data is always represented to high dimensional vectors, from which discriminant features are extracted using dimensionality reduction ...
Comments