skip to main content
10.1145/2628194.2628249acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

CloudETL: scalable dimensional ETL for hive

Published:07 July 2014Publication History

ABSTRACT

Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.

References

  1. A. Abouzeid et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads". PVLDB 2(1):22--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. N. Afrati and J. D. Ullman. "Optimizing Joins in a Map-reduce Environment". In EDBT, pp.99--110, 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Blanas et al. "A Comparison of Join Algorithms for Log Processing in MapReduce". In SIGMOD, pp.975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. "Cascading". http://www.cascading.org as of 2014-06-11.Google ScholarGoogle Scholar
  5. J. Dean and S. Ghemawat. "Mapreduce: Simplified Data Processing on Large Clusters". CACM 1(51):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. DeWitt et al. "Clustera: An Integrated Computation and Data Management System". PVLDB 1(1):28--41, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. "Disco". http://discoproject.org as of 2014-06-11.Google ScholarGoogle Scholar
  8. J. Dittrich et al. "Hadoop++: Making a Yellow Elephant Run Like a Cheetah". PVLDB 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Y. Eltabakh et al. "CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop". PVLDB 4(9):575--585, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. "Hadoop". http://hadoop.apache.org/ as of 2014-06-11.Google ScholarGoogle Scholar
  11. M. Isard et al. "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks". In EuroSys, pp. 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. "The Performance of MapReduce: An In-depth Study". PVLDB 3(1):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Jiang, A. K. H. Tung, and G. Chen. "Map-join-reduce: Towards Scalable and Efficient Data Analysis on Large Clusters". TKDE 23(9)1299--1311, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Kaldeway, E. J. Shekita, and S. Tata. "Clydesdale: structured data processing on MapReduce". In EDBT, pp. 15--25, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Kimball and M. Ross. "The Data Warehouse Toolkit". John Wiley and Son, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Lin and C. Dyer. "Data-Intensive Text Processing with MapReduce". Morgan & Claypool Publishers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Liu, C. Thomsen, and T. B. Pedersen. "CloudETL: Scalable Dimensional ETL for Hadoop and Hive". Technical Report (TR-31), Dept. of Computer Science, Aalborg University, http://dbtr.cs.aau.dk/pub.htm as of 2014-06-11.Google ScholarGoogle Scholar
  18. X. Liu, C. Thomsen, and T. B. Pedersen, "ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce". TLDKS 8, LNCS 7790:1--31, 2013.Google ScholarGoogle Scholar
  19. C. Olston et al. "Pig Latin: A Not-so-foreign Language for Data Processing". In SIGMOD, pp. 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Pavlo et al. "A Comparison of Approaches to Large-scale Data Analysis". In SIGMOD, pp. 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Stonebraker et al. "MapReduce and Parallel DBMSs: friends or foes?". CACM, 53(1):64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Thomsen and T. B. Pedersen. "Building a Web Warehouse for Accessibility Data". In DOLAP, pp. 43--50, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Thomsen and T. B. Pedersen. "Easy and Effective Parallel Programmable ETL". In DOLAP, pp. 37--44, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Thomsen and T. B. Pedersen. "pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers". In DOLAP, pp. 49--56, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Spark Project, http://spark.incubator.apache.org as of 2014-06-11.Google ScholarGoogle Scholar
  26. A. Thusoo et al. "Hive: A Warehousing Solution Over a Map-reduce Framework". PVLDB 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. "TPC-H". http://tpc.org/tpch/ as of 2014-06-11.Google ScholarGoogle Scholar
  28. P. Vassiliadis. "A Survey of Extract-Transform-Load Technology". IJDWM 5(3):1--27, 2009.Google ScholarGoogle Scholar

Index Terms

  1. CloudETL: scalable dimensional ETL for hive
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Other conferences
                IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium
                July 2014
                411 pages
                ISBN:9781450326278
                DOI:10.1145/2628194

                Copyright © 2014 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 7 July 2014

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Author Tags

                Qualifiers

                • research-article

                Acceptance Rates

                Overall Acceptance Rate74of210submissions,35%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader