The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Dehne, Frank; Eavis, Todd; Rau-Chaplin, Andrew

doi:10.1007/s10619-006-6575-6

The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Published: January 2006

Volume 19, pages 29–62, (2006)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Frank Dehne¹,
Todd Eavis² &
Andrew Rau-Chaplin³

162 Accesses
37 Citations
Explore all metrics

Abstract

On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi-dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems.

In this paper we discuss the cgmCUBE Project, a multi-year effort to design and implement a multi-processor platform for data cube generation that targets the relational database model (ROLAP). More specifically, we discuss new algorithmic and system optimizations relating to (1) a thorough optimization of the underlying sequential cube construction method and (2) a detailed and carefully engineered cost model for improved parallel load balancing and faster sequential cube construction. These optimizations were key in allowing us to build a prototype that is able to produce data cube output at a rate of over one TeraByte per hour.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

Big data analytics: a survey

Article Open access 01 October 2015

Chun-Wei Tsai, Chin-Feng Lai, … Athanasios V. Vasilakos

References

S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and S. Sarawagi, “On the computation of multidimensional aggregates,” in Proceedings of the 22nd International VLDB Conference, 1996, pp. 506–521.
R. Becker, S. Schach, and Y. Perl, “A shifting algorithm for min-max tree partitioning,” Journal of the ACM, vol. 29, pp. 58–67, 1982.
Article MathSciNet Google Scholar
K. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in Proceedings of the 1999 ACM SIGMOD Conference, 1999, pp. 359–370.
Y. Chen, F. Dehne, T. Eavis, and A. Rau-Chaplin, “ Parallel ROLAP data cube construction on shared-nothing multiprocessors,” Distributed and Parallel Databases, vol. 15, pp. 219–236, 2004.
Article Google Scholar
T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms, The MIT Press, 1996.
F. Dehne, T. Eavis, and A. Rau-Chaplin, “Parallelizing the datacube,” Distributed and Parallel Databases, vol. 11 no. 2, pp. 181–201, 2002.
Google Scholar
F. Dehne, Todd Eavis, and A. Rau-Chaplin, “Distributed multi-dimensional ROLAP indexing for the data cube,” The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003.
The Rising Storage Tide, 2003. http://www.datawarehousing.com/papers.
T. Eavis, “Parallel Relational OLAP,” PhD thesis, Dalhousie University, 2003.
W. Feller, An Introduction to Probability Theory and its Applications, John Wiley and Sons, 1957.
P Flajolet and G. Martin, “Probabilistic counting algorithms for database applications,” Journal of Computer and System Sciences, vol. 31 no. 2, pp. 182–209, 1985.
Article MathSciNet Google Scholar
Flex and Bison, 2003. http://dinosaur.compilertools.net/.
S. Goil and A. Choudhary, “High performance OLAP and data mining on parallel computers,” Journal of Data Mining and Knowledge Discovery, vol no. 4, 1997.
S. Goil and A. Choudhary, “High performance multidimensional analysis of large datasets,” in Proceedings of the First ACM International Workshop on Data Warehousing and OLAP, 1998, pp 34–39.
S. Goil and A. Choudhary, “A parallel scalable infrastructure for OLAP and data mining,” International Database Engineering and Application Symposium, 1999, pp. 178–186.
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “ Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals,” in Proceeding of the 12th International Conference On Data Engineering, 1996, pp. 152–159.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000.
V. Harinarayan, A. Rajaraman, and J. Ullman, “Implementing data cubes,” in Proceedings of the 1996 ACM SIGMOD Conference, 1996, pp. 205–216.
P. Hass, J. Naughton, S. Seshadri, and L. Stokes, “Sampling based estimation of the number of distinct values of an attribute,” in Proceedings of International VLDB Conference, 1995, pp. 311–322.
X. Huang H. Lu and Z. Li, “Computing data cubes using massively parallel processors,” 7th Parallel Computing Workshop (PCW '97), 1997.
L.V.S. Lakshmanan, J. Pei, and J. Han, “ Quotient cube: How to summarize the semantics of a data cube,” in Proceedings of the 28th VLDB Conference, 2002.
L.V.S. Lakshmanan, J. Pei, and Y. Zhao, “Qc-trees: An efficient summary structure for semantic OLAP,” in Proceedings of the 2003 ACM SIGMOD Conference, 2003, pp. 64–75.
Leda, 2003. http://www.mpi-sb.mpg.de/LEDA/.
H. Lu, J.X. Yu, L. Feng, and X. Li, “Fully dynamic partitioning: Handling data skew in parallel data cube computation,” Distributed and Parallel Databases, vol. 13, pp. 181–202, 2003.
Article Google Scholar
The Message Passing Interface standard, 2003. http://www-unix.mcs.anl.gov/mpi/.
S. Muto and M. Kitsuregawa, “A dynamic load balancing strategy for parallel datacube computation,” ACM 2nd Annual Workshop on Data Warehousing and OLAP, 1999, pp. 67–72.
R. Ng, A. Wagner, and Y. Yin, “Iceberg-cube computation with PC clusters,” in Proceedings of 2001 ACM SIGMOD Conference on Management of Data, 2001, pp. 25–36.
The OLAP Report. http://www.olapreport.com.
Programming POSIX threads. http://www.humanfactor.com/pthreads.
K. Ross and D. Srivastava, “Fast computation of sparse data cubes,” in Proceedings of the 23rd VLDB Conference, 1997, pp. 116–125.
N. Roussopoulos, Y. Kotidis, and M. Roussopolis, “Cubetree: Organization of the bulk incremental updates on the data cube,” in Proceedings of the 1997 ACM SIGMOD Conference, 1997, pp. 89–99.
S. Sarawagi, R. Agrawal, and A.Gupta, “On computing the data cube,” Technical Report RJ10026, IBM Almaden Research Center, San Jose, California, 1996.
Z. Shao, J. Han, and D. Xin, “Mm-cubing: Computing iceberg cubes by factorizing the lattice space,” to appear in the Proceedings of the 16th International Conference on Scientific and Statitistical Database Management (SSDBM), 2004.
A. Shukla, P. Deshpande, J. Naughton, and K. Ramasamy, “Storage estimation for multidimensional aggregates in the presence of hierarchies,” in Proceedings of the 22nd VLDB Conference, 1996, pp. 522–531.
Y. Sismanis, A. Deligiannakis, N. Roussopolos, and Y. Kotidis, “Dwarf: Shrinking the petacube,” in Proceedings of the 2002 ACM SIGMOD Conference, 2002, pp. 464–475.
W. Wang, J. Feng, H. Lu, and J.X. Yu, “Condensed cube: An effective approach to reducing data cube size,” in Proceedings of the International Conference on Data Engineering, 2002.
The Winter Report, 2003. http://www.wintercorp.com/vldb/2003_TopTen_Survey.
D. Xin, J. Han, X. Li, and B. W. Wah, “Star-cubing: Computing iceberg cubes by top-down and bottom-up integration,” in Proceedings Int. Conf. on Very Large Data Bases (VLDB'03), 2003.
G. Yang, R. Jin, and G. Agrawal, “Implementing data cube construction using a cluster middleware: Algorithms, implementation experience, and performance evaluation,” in Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02), 2002.
Y. Zhao, P. Deshpande, and J. Naughton, “An array-based algorithm for simultaneous multi-dimensional aggregates,” in Proceedings of the 1997 ACM SIGMOD Conference, 1997, pp. 159–170.
W. Zipf, The Psycho-Biology of Language: An Introduction to Dynamic Philology, Houghton Mifflin, 1935.

Download references

Author information

Authors and Affiliations

Carleton University, Ottawa, Canada
Frank Dehne
Concordia University, Montreal, Canada
Todd Eavis
Faculty of Computer Science, Dalhousie University, 6050 University Ave., Halifax, NS Canada, B3J 1W5, Canada
Andrew Rau-Chaplin

Authors

Frank Dehne
View author publications
You can also search for this author in PubMed Google Scholar
Todd Eavis
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Rau-Chaplin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Rau-Chaplin.

Additional information

Research supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dehne, F., Eavis, T. & Rau-Chaplin, A. The cgmCUBE project: Optimizing parallel data cube generation for ROLAP. Distrib Parallel Databases 19, 29–62 (2006). https://doi.org/10.1007/s10619-006-6575-6

Download citation

Issue Date: January 2006
DOI: https://doi.org/10.1007/s10619-006-6575-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Big data analytics: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The cgmCUBE project: Optimizing parallel data cube generation for ROLAP

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Big data analytics: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation