ABSTRACT
Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.
- A. Abouzeid et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads". PVLDB 2(1):22--933, 2009. Google ScholarDigital Library
- F. N. Afrati and J. D. Ullman. "Optimizing Joins in a Map-reduce Environment". In EDBT, pp.99--110, 2010 Google ScholarDigital Library
- S. Blanas et al. "A Comparison of Join Algorithms for Log Processing in MapReduce". In SIGMOD, pp.975--986, 2010. Google ScholarDigital Library
- "Cascading". http://www.cascading.org as of 2014-06-11.Google Scholar
- J. Dean and S. Ghemawat. "Mapreduce: Simplified Data Processing on Large Clusters". CACM 1(51):107--113, 2008. Google ScholarDigital Library
- D. DeWitt et al. "Clustera: An Integrated Computation and Data Management System". PVLDB 1(1):28--41, 2008. Google ScholarDigital Library
- "Disco". http://discoproject.org as of 2014-06-11.Google Scholar
- J. Dittrich et al. "Hadoop++: Making a Yellow Elephant Run Like a Cheetah". PVLDB 3(1):518--529, 2010. Google ScholarDigital Library
- M. Y. Eltabakh et al. "CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop". PVLDB 4(9):575--585, 2011. Google ScholarDigital Library
- "Hadoop". http://hadoop.apache.org/ as of 2014-06-11.Google Scholar
- M. Isard et al. "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks". In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. "The Performance of MapReduce: An In-depth Study". PVLDB 3(1):472--483, 2010. Google ScholarDigital Library
- D. Jiang, A. K. H. Tung, and G. Chen. "Map-join-reduce: Towards Scalable and Efficient Data Analysis on Large Clusters". TKDE 23(9)1299--1311, 2011. Google ScholarDigital Library
- T. Kaldeway, E. J. Shekita, and S. Tata. "Clydesdale: structured data processing on MapReduce". In EDBT, pp. 15--25, 2012. Google ScholarDigital Library
- R. Kimball and M. Ross. "The Data Warehouse Toolkit". John Wiley and Son, 1996.Google ScholarDigital Library
- J. Lin and C. Dyer. "Data-Intensive Text Processing with MapReduce". Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
- X. Liu, C. Thomsen, and T. B. Pedersen. "CloudETL: Scalable Dimensional ETL for Hadoop and Hive". Technical Report (TR-31), Dept. of Computer Science, Aalborg University, http://dbtr.cs.aau.dk/pub.htm as of 2014-06-11.Google Scholar
- X. Liu, C. Thomsen, and T. B. Pedersen, "ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce". TLDKS 8, LNCS 7790:1--31, 2013.Google Scholar
- C. Olston et al. "Pig Latin: A Not-so-foreign Language for Data Processing". In SIGMOD, pp. 1099--1110, 2008. Google ScholarDigital Library
- A. Pavlo et al. "A Comparison of Approaches to Large-scale Data Analysis". In SIGMOD, pp. 165--178, 2009. Google ScholarDigital Library
- M. Stonebraker et al. "MapReduce and Parallel DBMSs: friends or foes?". CACM, 53(1):64--71, 2010. Google ScholarDigital Library
- C. Thomsen and T. B. Pedersen. "Building a Web Warehouse for Accessibility Data". In DOLAP, pp. 43--50, 2006. Google ScholarDigital Library
- C. Thomsen and T. B. Pedersen. "Easy and Effective Parallel Programmable ETL". In DOLAP, pp. 37--44, 2011. Google ScholarDigital Library
- C. Thomsen and T. B. Pedersen. "pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers". In DOLAP, pp. 49--56, 2009. Google ScholarDigital Library
- Spark Project, http://spark.incubator.apache.org as of 2014-06-11.Google Scholar
- A. Thusoo et al. "Hive: A Warehousing Solution Over a Map-reduce Framework". PVLDB 2(2):1626--1629, 2009. Google ScholarDigital Library
- "TPC-H". http://tpc.org/tpch/ as of 2014-06-11.Google Scholar
- P. Vassiliadis. "A Survey of Extract-Transform-Load Technology". IJDWM 5(3):1--27, 2009.Google Scholar
Index Terms
- CloudETL: scalable dimensional ETL for hive
Recommendations
Major technical advancements in apache hive
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataApache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of ...
Olympics Big Data Prognostications
Data is continuously snowballing over the years, gradually a huge growth is seen in data to store and tame to yield meticulous result. It gives rise to a concept nowadays, reckoned as big data analytics. With the summer Olympics at Rio de Janeiro, ...
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Comments