Abstract
The pre-computation of data cubes is critical to improving the response time of On-Line Analytical Processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. In order to meet the need for improved performance created by growing data sizes, parallel solutions for generating the data cube are becoming increasingly important. This paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor. Since no (expensive) shared disk is required, our method can be used on low cost Beowulf style clusters consisting of standard PCs with local disks connected via a data switch. Our approach uses a ROLAP representation of the data cube where views are stored as relational tables. This allows for tight integration with current relational database technology.
We have implemented our parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, local vs. global schedule trees, data skew, cardinality of dimensions, data dimensionality, and balance tradeoffs. For an input data set of 2,000,000 rows (72 Megabytes), our parallel data cube generation method achieves close to optimal speedup; generating a full data cube of ≈227 million rows (5.6 Gigabytes) on a 16 processors cluster in under 6 minutes. For an input data set of 10,000,000 rows (360 Megabytes), our parallel method, running on a 16 processor PC cluster, created a data cube consisting of ≈846 million rows (21.7 Gigabytes) in under 47 minutes.
Similar content being viewed by others
References
S. Agarwal, R. Agarwal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and S. Srawagi, “On the computation of multi-dimensional aggregates,” in Proc. 22nd VLDB Conf., 1996, pp. 506-521.
K. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in ACM SIGMOD Conference on Management of Data, 1999, pp. 359-370.
F. Dehne, T. Eavis, S. Hambrusch, and A. Rau-Chaplin, “Parallelizing the data cube,” Distributed and Parallel Databases, vol. 11, no. 2, pp. 181–201, 2002.
F. Dehne, T. Eavis, and A. Rau-Chaplin, “A cluster architecture for parallel data warehousing,” in Proc IEEE International Conference on Cluster Computing and the Grid (CCGrid 2001), Brisbane, Australia, 2001.
F. Dehne, T. Eavis, and A. Rau-Chaplin, “Computing partial data cubes,” Technical report, http://www.cs.dal.ca/~arc/publications/2-30/paper.pdf, 2003.
P. Flajolet and G. Martin, “Probablistic counting algorithms for database applications,” Journal of Computer and System Sciences, vol. 31, no. 2, pp. 182–209, 1985.
S. Goil and A. Choudhary, “High performance OLAP and data mining on parallel computers,” Journal of Data Mining and Knowledge Discovery, vol. 1, no. 4, pp. 391–417, 1997.
S. Goil and A.N. Choudhary, “High performance multidimensional analysis of large datasets,” in International Workshop on Data Warehousing and OLAP, 1998, pp. 34-39.
S. Goil and A. Choudhary, “Aparallel scalable infrastructure for OLAP and data mining,” in Proc. International Data Engineering and Applications Symposium (IDEAS'99), Montreal, 1999.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, and M. Venkatrao, “Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals,” J. Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 29–53, 1997.
J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O.R. Zaiane, “DBMiner: A system for mining knowledge in large relational databases,” in Proc. 1996 Int'l Conf. on Data Mining and Knowledge Discovery (KDD'96), Portland, Oregon, 1996, pp. 250-255.
V. Harinarayan, A. Rajaraman, and J. Ullman, “Implementing data cubes efficiently,” ACMSIGMOD Record, vol. 25, no. 2, pp. 205–216, 1996.
X. Li, P. Lu, J. Schaeffer, J. Shillington, P.S. Wong, and H. Shi, “On the versatility of parallel sorting by regular sampling,” Parallel Computing, vol. 19, no. 10, pp. 1079–1103, 1993.
H. Lu, X. Huang, and Z. Li, “Computing data cubes using massively parallel processors,” in Proc. 7th Parallel Computing Workshop (PCW'97), Canberra, Australia, 1997.
K. Mehlhorn and S. Naeher, LEDA. http://www.mpi-sb.mpg.de/LEDA/, 1999.
S. Muto and M. Kitsuregawa, “A dynamic load balancing strategy for parallel datacube computation,” in ACM Second International Workshop on Data Warehousing and OLAP (DOLAP 1999), 1999, pp. 67-72.
S. Muto and M. Kitsuregawa, “A dynamic load balancing strategy for parallel datacube computation,” in Proceedings of the Second ACM InternationalWorkshop on DataWarehousing and OLAP, ACM Press, 1999, pp. 67-72.
R. Ng, A. Wagner, and Y. Yin, “Iceberg-cube computation with pc clusters,” in ACM SIGMOD Conference on Management of Data, 2001, pp. 25-36.
K. Ross and D. Srivastava, “Fast computation of sparse datacubes,” in Proc. 23rd VLDB Conference, 1997, pp. 116-125.
S. Sarawagi, R. Agrawal, and A. Gupta, “On computing the data cube,” Technical report rj10026, IBM Almaden Research Center, San Jose, CA, 1996.
A. Shukla, P. Deshpende, J. Naughton, and K. Ramasamy, “Storage estimation for mutlidimensional aggregates in the presence of hierarchies,” in Proc. 22nd VLDB Conference, 1996, pp. 522-531.
J.S. Vitter, “External memory algorithms and data structures: Dealing with MASSIVE DATA,” ACM Computing Surveys, vol. 33, no. 2, pp. 209–271, 2001.
J.S. Vitter and E.A.M. Shriver, “Algorithms for parallel memory I: Two-level memories,” Algorithmica, vol. 12, nos. 2/3, pp. 110–147, 1994.
J. Yu and H. Lu, “Multi-cube computation,” in Proc. 7th International Symposium on Database Systems for Advanced Applications, Hong Kong, 2001, pp. 126-133.
Y. Zhao, P. Deshpande, and J.F. Naughton, “An array-based algorithm for simultaneous multidimensional aggregates,” in Proc. ACM SIGMOD Conf., 1997, pp. 159-170.
G. Zipf, Human Behavior and The Principle of Least Effort, Addison-Wesley, 1949.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chen, Y., Dehne, F., Eavis, T. et al. Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors. Distributed and Parallel Databases 15, 219–236 (2004). https://doi.org/10.1023/B:DAPD.0000018572.20283.e0
Issue Date:
DOI: https://doi.org/10.1023/B:DAPD.0000018572.20283.e0