Skip to main content
Log in

CVA file: an index structure for high-dimensional datasets

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C et al (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory, pp 420–434

  2. Aggarwal C et al (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, pp 61–72

  3. An J et al (2002) The convex polyhedra technique: an index structure for high-dimensional space. In: Proceedings of the 13th Australasian database conference, pp 33–40

  4. An J et al (2003) Grids-based indexing of large time series databases. In: 4th international conference on intelligent data engineering and automated learning. Lecture notes in computer science, vol 2690, pp 614–621

  5. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley

  6. Beckmann N et al (1990) The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 322–331

  7. Berchtold S et al (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of 26th international conference on very large data bases, pp 28–39

  8. Berchtold S et al (1997) A cost model for nearest neighbor search in high-dimensional data space. In: ACM PODS symposium on principles of database systems, pp 78–86

  9. Beyer KS et al (1999) When is “nearest neighbor” meaningful. In: Proceedings of the 7th internation conference on database theory, pp 217–235

  10. Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of 26th international conference on very large data bases, pp 151–162

  11. Chen H et al (2002) C2VA: trim high dimensional indexes. In: Third international conference on advances in web-age information management. Lecture notes in computer science, vol 2419, pp 303–315

  12. Faloutsos C et al (1994a) Efficient and effective querying by image content. J Intell Inf Syst 3:231–262

    Article  Google Scholar 

  13. Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174

  14. Faloutsos C et al (1994b) Fast subsequence matching in time series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, pp 419–429

  15. Faloutsos MFP, Faloutsos C (1999) On power-law relationships of the Internet topology. In: Proceedings of the 1999 ACM SIGCOMM, pp 251–262

  16. Fukunaga K (1990) Statistical pattern recognition. Academic

  17. Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, pp 369–380

  18. Shinohara T et al (2000) Approximate retrieval of high-dimensional data with L 1 metric by spatial indexing. New Generation Comput 18:39–47

    Article  Google Scholar 

  19. Weber R et al (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of 24th international conference on very large data bases, pp 194–205

  20. Zipf G (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison, Cambridge, MA

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiyuan An.

Rights and permissions

Reprints and permissions

About this article

Cite this article

An, J., Chen, H., Furuse, K. et al. CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7, 337–357 (2005). https://doi.org/10.1007/s10115-004-0149-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0149-6

Keywords

Navigation