CVA file: an index structure for high-dimensional datasets

An, Jiyuan; Chen, Hanxiong; Furuse, Kazutaka; Ohbo, Nobuo

doi:10.1007/s10115-004-0149-6

CVA file: an index structure for high-dimensional datasets

Published: 01 March 2005

Volume 7, pages 337–357, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jiyuan An^1,3,
Hanxiong Chen²,
Kazutaka Furuse² &
…
Nobuo Ohbo²

96 Accesses
9 Citations
Explore all metrics

Abstract

Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C et al (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory, pp 420–434
Aggarwal C et al (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, pp 61–72
An J et al (2002) The convex polyhedra technique: an index structure for high-dimensional space. In: Proceedings of the 13th Australasian database conference, pp 33–40
An J et al (2003) Grids-based indexing of large time series databases. In: 4th international conference on intelligent data engineering and automated learning. Lecture notes in computer science, vol 2690, pp 614–621
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Beckmann N et al (1990) The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 322–331
Berchtold S et al (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of 26th international conference on very large data bases, pp 28–39
Berchtold S et al (1997) A cost model for nearest neighbor search in high-dimensional data space. In: ACM PODS symposium on principles of database systems, pp 78–86
Beyer KS et al (1999) When is “nearest neighbor” meaningful. In: Proceedings of the 7th internation conference on database theory, pp 217–235
Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of 26th international conference on very large data bases, pp 151–162
Chen H et al (2002) C^2VA: trim high dimensional indexes. In: Third international conference on advances in web-age information management. Lecture notes in computer science, vol 2419, pp 303–315
Faloutsos C et al (1994a) Efficient and effective querying by image content. J Intell Inf Syst 3:231–262
Article Google Scholar
Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, pp 163–174
Faloutsos C et al (1994b) Fast subsequence matching in time series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, pp 419–429
Faloutsos MFP, Faloutsos C (1999) On power-law relationships of the Internet topology. In: Proceedings of the 1999 ACM SIGCOMM, pp 251–262
Fukunaga K (1990) Statistical pattern recognition. Academic
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, pp 369–380
Shinohara T et al (2000) Approximate retrieval of high-dimensional data with L ₁ metric by spatial indexing. New Generation Comput 18:39–47
Article Google Scholar
Weber R et al (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of 24th international conference on very large data bases, pp 194–205
Zipf G (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison, Cambridge, MA
Google Scholar

Download references

Author information

Authors and Affiliations

Doctoral Program in Engineering, University of Tsukuba, Ibaraki, Japan
Jiyuan An
Institute of Information Sciences and Electronics, University of Tsukuba, Ibaraki, Japan
Hanxiong Chen, Kazutaka Furuse & Nobuo Ohbo
Centre for Information Technology Innovation, Queensland University of Technology, 126 Margaret Street GPO Box 2434, Brisbane, Australia
Jiyuan An

Authors

Jiyuan An
View author publications
You can also search for this author in PubMed Google Scholar
Hanxiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kazutaka Furuse
View author publications
You can also search for this author in PubMed Google Scholar
Nobuo Ohbo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyuan An.

Rights and permissions

Reprints and permissions

About this article

Cite this article

An, J., Chen, H., Furuse, K. et al. CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7, 337–357 (2005). https://doi.org/10.1007/s10115-004-0149-6

Download citation

Received: 15 March 2003
Revised: 14 September 2003
Accepted: 01 December 2003
Published: 01 March 2005
Issue Date: March 2005
DOI: https://doi.org/10.1007/s10115-004-0149-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CVA file: an index structure for high-dimensional datasets

Abstract

Access this article

Similar content being viewed by others

Analysing Indexability of Intrinsically High-Dimensional Data Using TriGen

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

PL-Tree: An Efficient Indexing Method for High-Dimensional Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CVA file: an index structure for high-dimensional datasets

Abstract

Access this article

Similar content being viewed by others

Analysing Indexability of Intrinsically High-Dimensional Data Using TriGen

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

PL-Tree: An Efficient Indexing Method for High-Dimensional Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation