Skip to main content
Log in

Hierarchical clustering for OLAP: the CUBE File approach

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This paper deals with the problem of physical clustering of multidimensional data that are organized in hierarchies on disk in a hierarchy-preserving manner. This is called hierarchical clustering. A typical case, where hierarchical clustering is necessary for reducing I/Os during query evaluation, is the most detailed data of an OLAP cube. The presence of hierarchies in the multidimensional space results in an enormous search space for this problem. We propose a representation of the data space that results in a chunk-tree representation of the cube. The model is adaptive to the cube’s extensive sparseness and provides efficient access to subsets of data based on hierarchy value combinations. Based on this representation of the search space we formulate the problem as a chunk-to-bucket allocation problem, which is a packing problem as opposed to the linear ordering approach followed in the literature.

We propose a metric to evaluate the quality of hierarchical clustering achieved (i.e., evaluate the solutions to the problem) and formulate the problem as an optimization problem. We prove its NP-Hardness and provide an effective solution based on a linear time greedy algorithm. The solution of this problem leads to the construction of the CUBE File data structure. We analyze in depth all steps of the construction and provide solutions for interesting sub-problems arising, such as the formation of bucket-regions, the storage of large data chunks and the caching of the upper nodes (root directory) in main memory.

Finally, we provide an extensive experimental evaluation of the CUBE File’s adaptability to the data space sparseness as well as to an increasing number of data points. The main result is that the CUBE File is highly adaptive to even the most sparse data spaces and for realistic cases of data point cardinalities provides hierarchical clustering of high quality and significant space savings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bayer R., McCreight E. (1972) Organization and maintenance of large ordered Indexes. Acta Inf. 1, 173–189

    Article  Google Scholar 

  2. Bayer, R.: The universal B-tree for multi-dimensional indexing: general concepts. In: WWCA 1997

  3. Chan, C.Y., Ioannidis, Y.: Bitmap index design and evaluation. In: SIGMOD 1998

  4. Chaudhuri S., Dayal U. (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec. 26(1): 65–74

    Article  Google Scholar 

  5. Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.: Caching multidimensional queries using chunks. In: Proceedings of, ACM SIGMOD International Conference on Management of Data, pp. 259–270, 1998

  6. Fagin R., Nievergelt J., Pippenger N., Raymond H. (1979) Strong: extendible hashing—a fast access method for dynamic files. TODS 4(3): 315–344

    Article  Google Scholar 

  7. Faloutsos, C., Rong, Y.: DOT: A Spatial Access Method Using Fractals. In: ICDE 1991, pp. 152–159

  8. Gaede V., Günther O. (1998) Multidimensional access methods. ACM Comput. Surv. 30(2): 170–231

    Article  Google Scholar 

  9. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and subtotal. In: ICDE 1996

  10. Gupta A., Mumick I.S. (1995) Maintenance of materialized views: problems, techniques, and applications. Data Eng. Bull. 18(2): 3–18

    Google Scholar 

  11. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. In: Proceedings of. ACM SIGMOD International Conference on Management of Data, pp. 205–227, 1996

  12. Hinrichs K. (1985) Implementation of the grid file: design concepts and experience. BIT 25(4): 569–592

    Article  MathSciNet  MATH  Google Scholar 

  13. Jagadish, H.V.: Linear clustering of objects with multiple attributes. In: SIGMOD Conference, pp. 332–342, 1990

  14. Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D.: Snakes and sandwiches: optimal clustering strategies for a data warehouse. In:SIGMOD Conference, pp. 37–48, 1999

  15. Karayannidis, N. et al.: Processing star-queries on hierarchically-clustered fact-tables. In: VLDB 2002

  16. Karayannidis, N.: Storage structures, query processing and implementation of on-line analytical processing systems. Ph.D. Thesis, National Technical University of Athens, 2003. Available at: http://www.dblab.ece.ntua.gr/~ni kos/thesis/PhD_thesis_en.pdf

  17. Karayannidis N., Sellis T. (2003) SISYPHUS: the implementation of a chunk-based storage manager for OLAP data cubes. Data Knowl. Eng. 45(2): 155–188

    Article  Google Scholar 

  18. Karayannidis, N., Sellis, T., Kouvaras, Y.: CUBE File: a file structure for hierarchically clustered OLAP cubes. In: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, 14–18 March 2004, EDBT, pp. 621–638, 2004

  19. Kotidis, Y., Roussopoulos, N.: An alternative storage organization for ROLAP aggregate views based on cubetrees. In: Proceedings. ACM SIGMOD International Conference. on Management of Data, pp. 249–258, 1998

  20. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: how to summarize the semantics of a data cube. In: VLDB 2002

  21. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: QC-Trees: an efficient summary structure for semantic oLAP. In: SIGMOD 2003

  22. Markl, V., Ramsak, F., Bayern, R.: Improving OLAP performance by multidimensional hierarchical clustering. In:IDEAS 1999

  23. Nievergelt J., Hinterberger H., Sevcik K.C. (1984) The grid file: an adaptable, symmetric multikey file structure. TODS 9(1): 38–71

    Article  Google Scholar 

  24. OLAP Report: Database explosion. Available at: http://www. olapreport.com/DatabaseExplosion.htm, 1999

  25. O’Neil P.E., Graefe G. (1995) Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3): 8–11

    Article  Google Scholar 

  26. O’Neil, P.E., Quass, D.: Improved query performance with variant indexes. In: SIGMOD 1997

  27. Orenstein, J.A., Merrett, T.H.: A class of data structures for associative searching. In: PODS, pp. 181–190, 1984

  28. Padmanabhan, S., Bhattacharjee, B., Malkemus, T., Cranston, L., Huras, M.: Multi-dimensional clustering: a new data layout scheme in DB2. In: SIGMOD Conference, pp. 637–641, 2003

  29. Pieringer, R. et al. (2003) Combining hierarchy encoding and pre-grouping: intelligent grouping in star join processing. In:ICDE 2003

  30. Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the UB-tree into a database system kernel. In: VLDB, pp. 263–272, 2000

  31. Régnier M. (1985) Analysis of grid file algorithms. BIT 25(2): 335–357

    Article  MathSciNet  MATH  Google Scholar 

  32. Roussopoulos N. (1998) Materialized views and data warehouses. SIGMOD Rec. 27(1): 21–26

    Article  Google Scholar 

  33. Sagan H. (1994) Space-Filling Curves.Springer, Berlin Heidelberg New york

    MATH  Google Scholar 

  34. Sarawagi S. (1997) Indexing OLAP data. Data Eng. Bull. 20(1): 36–43

    Google Scholar 

  35. Sarawagi, S., Stonebraker, M.: Efficient organization of large multidimensional arrays. In: Proceedings. of the 11th International. Conference on Data Engineerings, pp. 326–336, 1994

  36. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.: Dwarf: shrinking the PetaCube. In: SIGMOD 2002

  37. Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering queries with aggregation using views. In: VLDB Conference, pp. 318–329, 1996

  38. Stöhr, T., Märtens, H., Rahm, E.: Multi-dimensional database allocation for parallel data Warehouses. In:VLDB, pp. 273–284, 2000

  39. The TransBase HyperCube® relational database system: available at http://www.transaction.de, 2005

  40. Tsois, A., Sellis, T.: The generalized pre-grouping transformation: aggregate-query optimization in the presence of dependencies. In: VLDB (2003)

  41. Weber, R., Schek, H.-.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp. 194–205, 1998

  42. Weiss M.A. (1995) Data Structures and Algorithm Analysis. Benjamin/Cummings Publishing, Redwood city, pp. 351–359

    MATH  Google Scholar 

  43. Whang, K.-Y., Krishnamurthy, R.: The multilevel grid file—a dynamic hierarchical multidimensional file structure. In: DASFAA, pp. 449–459, 1991

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikos Karayannidis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karayannidis, N., Sellis, T. Hierarchical clustering for OLAP: the CUBE File approach. The VLDB Journal 17, 621–655 (2008). https://doi.org/10.1007/s00778-006-0022-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0022-1

Keywords

Navigation