Abstract
Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.
Similar content being viewed by others
References
Aggarwal C et al (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory, pp 420–434
Aggarwal C et al (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, pp 61–72
An J et al (2002) The convex polyhedra technique: an index structure for high-dimensional space. In: Proceedings of the 13th Australasian database conference, pp 33–40
An J et al (2003) Grids-based indexing of large time series databases. In: 4th international conference on intelligent data engineering and automated learning. Lecture notes in computer science, vol 2690, pp 614–621
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Beckmann N et al (1990) The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 322–331
Berchtold S et al (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of 26th international conference on very large data bases, pp 28–39
Berchtold S et al (1997) A cost model for nearest neighbor search in high-dimensional data space. In: ACM PODS symposium on principles of database systems, pp 78–86
Beyer KS et al (1999) When is “nearest neighbor” meaningful. In: Proceedings of the 7th internation conference on database theory, pp 217–235
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of 26th international conference on very large data bases, pp 151–162
Chen H et al (2002) C2VA: trim high dimensional indexes. In: Third international conference on advances in web-age information management. Lecture notes in computer science, vol 2419, pp 303–315
Faloutsos C et al (1994a) Efficient and effective querying by image content. J Intell Inf Syst 3:231–262
Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174
Faloutsos C et al (1994b) Fast subsequence matching in time series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, pp 419–429
Faloutsos MFP, Faloutsos C (1999) On power-law relationships of the Internet topology. In: Proceedings of the 1999 ACM SIGCOMM, pp 251–262
Fukunaga K (1990) Statistical pattern recognition. Academic
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, pp 369–380
Shinohara T et al (2000) Approximate retrieval of high-dimensional data with L 1 metric by spatial indexing. New Generation Comput 18:39–47
Weber R et al (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of 24th international conference on very large data bases, pp 194–205
Zipf G (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison, Cambridge, MA
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
An, J., Chen, H., Furuse, K. et al. CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7, 337–357 (2005). https://doi.org/10.1007/s10115-004-0149-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0149-6