Skip to main content
Log in

The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi-dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems.

In this paper we discuss the cgmCUBE Project, a multi-year effort to design and implement a multi-processor platform for data cube generation that targets the relational database model (ROLAP). More specifically, we discuss new algorithmic and system optimizations relating to (1) a thorough optimization of the underlying sequential cube construction method and (2) a detailed and carefully engineered cost model for improved parallel load balancing and faster sequential cube construction. These optimizations were key in allowing us to build a prototype that is able to produce data cube output at a rate of over one TeraByte per hour.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and S. Sarawagi, “On the computation of multidimensional aggregates,” in Proceedings of the 22nd International VLDB Conference, 1996, pp. 506–521.

  2. R. Becker, S. Schach, and Y. Perl, “A shifting algorithm for min-max tree partitioning,” Journal of the ACM, vol. 29, pp. 58–67, 1982.

    Article  MathSciNet  Google Scholar 

  3. K. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in Proceedings of the 1999 ACM SIGMOD Conference, 1999, pp. 359–370.

  4. Y. Chen, F. Dehne, T. Eavis, and A. Rau-Chaplin, “ Parallel ROLAP data cube construction on shared-nothing multiprocessors,” Distributed and Parallel Databases, vol. 15, pp. 219–236, 2004.

    Article  Google Scholar 

  5. T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms, The MIT Press, 1996.

  6. F. Dehne, T. Eavis, and A. Rau-Chaplin, “Parallelizing the datacube,” Distributed and Parallel Databases, vol. 11 no. 2, pp. 181–201, 2002.

    Google Scholar 

  7. F. Dehne, Todd Eavis, and A. Rau-Chaplin, “Distributed multi-dimensional ROLAP indexing for the data cube,” The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003.

  8. The Rising Storage Tide, 2003. http://www.datawarehousing.com/papers.

  9. T. Eavis, “Parallel Relational OLAP,” PhD thesis, Dalhousie University, 2003.

  10. W. Feller, An Introduction to Probability Theory and its Applications, John Wiley and Sons, 1957.

  11. P Flajolet and G. Martin, “Probabilistic counting algorithms for database applications,” Journal of Computer and System Sciences, vol. 31 no. 2, pp. 182–209, 1985.

    Article  MathSciNet  Google Scholar 

  12. Flex and Bison, 2003. http://dinosaur.compilertools.net/.

  13. S. Goil and A. Choudhary, “High performance OLAP and data mining on parallel computers,” Journal of Data Mining and Knowledge Discovery, vol no. 4, 1997.

  14. S. Goil and A. Choudhary, “High performance multidimensional analysis of large datasets,” in Proceedings of the First ACM International Workshop on Data Warehousing and OLAP, 1998, pp 34–39.

  15. S. Goil and A. Choudhary, “A parallel scalable infrastructure for OLAP and data mining,” International Database Engineering and Application Symposium, 1999, pp. 178–186.

  16. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “ Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals,” in Proceeding of the 12th International Conference On Data Engineering, 1996, pp. 152–159.

  17. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000.

  18. V. Harinarayan, A. Rajaraman, and J. Ullman, “Implementing data cubes,” in Proceedings of the 1996 ACM SIGMOD Conference, 1996, pp. 205–216.

  19. P. Hass, J. Naughton, S. Seshadri, and L. Stokes, “Sampling based estimation of the number of distinct values of an attribute,” in Proceedings of International VLDB Conference, 1995, pp. 311–322.

  20. X. Huang H. Lu and Z. Li, “Computing data cubes using massively parallel processors,” 7th Parallel Computing Workshop (PCW '97), 1997.

  21. L.V.S. Lakshmanan, J. Pei, and J. Han, “ Quotient cube: How to summarize the semantics of a data cube,” in Proceedings of the 28th VLDB Conference, 2002.

  22. L.V.S. Lakshmanan, J. Pei, and Y. Zhao, “Qc-trees: An efficient summary structure for semantic OLAP,” in Proceedings of the 2003 ACM SIGMOD Conference, 2003, pp. 64–75.

  23. Leda, 2003. http://www.mpi-sb.mpg.de/LEDA/.

  24. H. Lu, J.X. Yu, L. Feng, and X. Li, “Fully dynamic partitioning: Handling data skew in parallel data cube computation,” Distributed and Parallel Databases, vol. 13, pp. 181–202, 2003.

    Article  Google Scholar 

  25. The Message Passing Interface standard, 2003. http://www-unix.mcs.anl.gov/mpi/.

  26. S. Muto and M. Kitsuregawa, “A dynamic load balancing strategy for parallel datacube computation,” ACM 2nd Annual Workshop on Data Warehousing and OLAP, 1999, pp. 67–72.

  27. R. Ng, A. Wagner, and Y. Yin, “Iceberg-cube computation with PC clusters,” in Proceedings of 2001 ACM SIGMOD Conference on Management of Data, 2001, pp. 25–36.

  28. The OLAP Report. http://www.olapreport.com.

  29. Programming POSIX threads. http://www.humanfactor.com/pthreads.

  30. K. Ross and D. Srivastava, “Fast computation of sparse data cubes,” in Proceedings of the 23rd VLDB Conference, 1997, pp. 116–125.

  31. N. Roussopoulos, Y. Kotidis, and M. Roussopolis, “Cubetree: Organization of the bulk incremental updates on the data cube,” in Proceedings of the 1997 ACM SIGMOD Conference, 1997, pp. 89–99.

  32. S. Sarawagi, R. Agrawal, and A.Gupta, “On computing the data cube,” Technical Report RJ10026, IBM Almaden Research Center, San Jose, California, 1996.

  33. Z. Shao, J. Han, and D. Xin, “Mm-cubing: Computing iceberg cubes by factorizing the lattice space,” to appear in the Proceedings of the 16th International Conference on Scientific and Statitistical Database Management (SSDBM), 2004.

  34. A. Shukla, P. Deshpande, J. Naughton, and K. Ramasamy, “Storage estimation for multidimensional aggregates in the presence of hierarchies,” in Proceedings of the 22nd VLDB Conference, 1996, pp. 522–531.

  35. Y. Sismanis, A. Deligiannakis, N. Roussopolos, and Y. Kotidis, “Dwarf: Shrinking the petacube,” in Proceedings of the 2002 ACM SIGMOD Conference, 2002, pp. 464–475.

  36. W. Wang, J. Feng, H. Lu, and J.X. Yu, “Condensed cube: An effective approach to reducing data cube size,” in Proceedings of the International Conference on Data Engineering, 2002.

  37. The Winter Report, 2003. http://www.wintercorp.com/vldb/2003_TopTen_Survey.

  38. D. Xin, J. Han, X. Li, and B. W. Wah, “Star-cubing: Computing iceberg cubes by top-down and bottom-up integration,” in Proceedings Int. Conf. on Very Large Data Bases (VLDB'03), 2003.

  39. G. Yang, R. Jin, and G. Agrawal, “Implementing data cube construction using a cluster middleware: Algorithms, implementation experience, and performance evaluation,” in Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02), 2002.

  40. Y. Zhao, P. Deshpande, and J. Naughton, “An array-based algorithm for simultaneous multi-dimensional aggregates,” in Proceedings of the 1997 ACM SIGMOD Conference, 1997, pp. 159–170.

  41. W. Zipf, The Psycho-Biology of Language: An Introduction to Dynamic Philology, Houghton Mifflin, 1935.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Rau-Chaplin.

Additional information

Research supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dehne, F., Eavis, T. & Rau-Chaplin, A. The cgmCUBE project: Optimizing parallel data cube generation for ROLAP. Distrib Parallel Databases 19, 29–62 (2006). https://doi.org/10.1007/s10619-006-6575-6

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-006-6575-6

Keywords

Navigation