Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

Lee, Suan; Kim, Jinho; Moon, Yang-Sae; Lee, Wookey

doi:10.1007/978-3-662-47804-2_1

Suan Lee²¹,
Jinho Kim²¹,
Yang-Sae Moon²¹ &
…
Wookey Lee²²

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9260))

487 Accesses
4 Citations

Abstract

Data cube is an essential part of OLAP(On-Line Analytical Processing) to support efficiently multidimensional analysis for a large size of data. The computation of data cube takes much time, because a data cube with d dimensions consists of 2^d (i.e., exponential order of d) cuboids. To build ROLAP (Relational OLAP) data cubes efficiently, many algorithms (e.g., GBLP, PipeSort, PipeHash, BUC, etc.) have been developed, which share sort cost and input data scan and/or reduce data computation time. Several parallel processing algorithms have been also proposed. On the other hand, MapReduce is recently emerging for the framework processing huge volume of data like web-scale data in a distributed/parallel manner by using a large number of computers (e.g., several hundred or thousands). In the MapReduce framework, the degree of parallel processing is more important to reduce total execution time than elaborate strategies like short-share and computation-reduction which existing ROLAP algorithms use. In this paper, we propose two distributed parallel processing algorithms. The first algorithm called MRLevel, which takes advantages of the MapReduce framework. The second algorithm called MRPipeLevel, which is based on the existing PipeSort algorithm which is one of the most efficient ones for top-down cube computation. (Top-down approach is more effective to handle big data, compared to others such as bottom-up and special data structures which are dependent on main-memory size.) The proposed MRLevel algorithm tries to parallelize cube computation and to reduce the number of data scan by level at the same time. The MRPipeLevel algorithm is based on the advantages of the MRLevel and to reduce the number of data scan by pipelining at the same time. We implemented and evaluated the performance of this algorithm under the MapReduce framework. Through the experiments, we also identify the factors for performance enhancement in MapReduce to process very huge data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: Proceedings of Conference on Data Engineering, New Orleans, LA, pp. 152–199, February 1996
Google Scholar
Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Montreal, Canada, pp. 205–216, June 1996
Google Scholar
Agarwal, S., et al.: On the computation of multidimensional aggregates. In: Proceedings of the 22^nd International Conference on Very Large Data Bases, pp. 506–521, September 1996
Google Scholar
Ross, K.A., Srivastava, D.: Fast computation of sparse datacubes. In: Proceedings of the 23^rd International Conference on Very Large Data Bases, pp. 116–125, August 1997
Google Scholar
Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cubes. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Philadelphia, PA, pp. 359–370, June 1999
Google Scholar
Dehne, F., Eavis, T., Rau-Chaplin, A.: The cgmCUBE project: optimizing parallel data cube generation for ROLAP. Distrib. Parallel Databases 19(1), 29–62 (2006)
Article Google Scholar
Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: PnP: parallel and external memory iceberg cube computation. In: Proceedings of the International Conference on Data Engineering, Tokyo, Japan, pp. 576–577, April 2005
Google Scholar
Jin, R., Vaidyanathan, K., Yang, G., Agrawal, G.: Communication and memory optimal parallel data cube construction. Parallel Distrib. Syst. 16(12), 1105–1119 (2005)
Article Google Scholar
Ng, R. T., Wagner, A., and Yin, Y.: Iceberg-cube computation with PC clusters. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Santa Barbara, CA, pp. 25–36, June 2001
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System. In: Proceedings of 19th on operating Systems Principles, Bolton Landing, NY, pp. 29–43, December 2003
Google Scholar
Hadoop. http://hadoop.apache.org/
HDFS. http://hadoop.apache.org/hdfs/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Jinguo, Y., Jianging, X., Pingjian, Z., Hu, C.: A parallel algorithm for closed cube computation. In: Proceedings of 7^th International Conference on Computer and Information Science, Portland, OR, pp. 95–99, May 2008
Google Scholar
Yuxiang, W., Aibo, S., Junzhou, L.: A MapReduceMerge-based data cube construction method.” In: Proceedings of 9^th International Conference on Grid and Cooperative Computing, Nanjing, China, pp. 1–6, Nov. 2010
Google Scholar
Lee, S., Moon, Y.-S., Kim, J.: Distributed parallel top-down computation of data cube using MapReduce. In: Proceedings of 3^rd International Conference on Emerging Databases, Inchoen, Korea, pp. 303–306, August 2011
Google Scholar
Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Distributed cube materialization on holistic measures. In: Proceedings 27^th International Conference on Data Engineering, Hannover, Germany, pp. 183–194, April 2011
Google Scholar
Cuzzocrea, A.: Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: Proceedings of 8^th International Workshop on Data Warehousing and OLAP, Bremen, Germany, pp. 97–106, November 2005
Google Scholar
Cuzzocrea, A. Sacca, D.: Balancing accuracy and privacy of OLAP aggregations on data cubes. In: Proceedings of 13^th International Workshop on Data Warehousing and OLAP, Toronto, Canada, pp. 93–98, October 2010
Google Scholar
Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large XML data warehouses via k-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3), 301–328 (2009)
Article Google Scholar

Download references

Acknowledgement

This research work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2011-0011824).

Author information

Authors and Affiliations

Department of Computer Science, Kangwon National University, 192-1 Hyoja-dong, Chuncheon, Kangwon, Korea
Suan Lee, Jinho Kim & Yang-Sae Moon
Department of Industrial Engineering, Inha University, 100 Inha-ro, Nam-ku, Incheon, Korea
Wookey Lee

Authors

Suan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jinho Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yang-Sae Moon
View author publications
You can also search for this author in PubMed Google Scholar
Wookey Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinho Kim .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
ICAR-CNR and University of Calabria, Rende, Italy
Alfredo Cuzzocrea
Hewlett-Packard Labatories, Palo Alto, California, USA
Umeshwar Dayal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lee, S., Kim, J., Moon, YS., Lee, W. (2015). Efficient Level-Based Top-Down Data Cube Computation Using MapReduce. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXI. Lecture Notes in Computer Science(), vol 9260. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47804-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-47804-2_1
Published: 17 July 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47803-5
Online ISBN: 978-3-662-47804-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics