research-article

CloudETL: scalable dimensional ETL for hive

Authors:
Xiufeng Liu

University of Waterloo

University of Waterloo
View Profile

,
Christian Thomsen

Aalborg University

Aalborg University
View Profile

,
Torben Bach Pedersen

Aalborg University

Aalborg University
View Profile

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications SymposiumJuly 2014Pages 195–206https://doi.org/10.1145/2628194.2628249

Published:07 July 2014Publication History

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

Pages 195–206

ABSTRACT

Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.

References

A. Abouzeid et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads". PVLDB 2(1):22--933, 2009. Google ScholarDigital Library
F. N. Afrati and J. D. Ullman. "Optimizing Joins in a Map-reduce Environment". In EDBT, pp.99--110, 2010 Google ScholarDigital Library
S. Blanas et al. "A Comparison of Join Algorithms for Log Processing in MapReduce". In SIGMOD, pp.975--986, 2010. Google ScholarDigital Library
"Cascading". http://www.cascading.org as of 2014-06-11.Google Scholar
J. Dean and S. Ghemawat. "Mapreduce: Simplified Data Processing on Large Clusters". CACM 1(51):107--113, 2008. Google ScholarDigital Library
D. DeWitt et al. "Clustera: An Integrated Computation and Data Management System". PVLDB 1(1):28--41, 2008. Google ScholarDigital Library
"Disco". http://discoproject.org as of 2014-06-11.Google Scholar
J. Dittrich et al. "Hadoop++: Making a Yellow Elephant Run Like a Cheetah". PVLDB 3(1):518--529, 2010. Google ScholarDigital Library
M. Y. Eltabakh et al. "CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop". PVLDB 4(9):575--585, 2011. Google ScholarDigital Library
"Hadoop". http://hadoop.apache.org/ as of 2014-06-11.Google Scholar
M. Isard et al. "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks". In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. "The Performance of MapReduce: An In-depth Study". PVLDB 3(1):472--483, 2010. Google ScholarDigital Library
D. Jiang, A. K. H. Tung, and G. Chen. "Map-join-reduce: Towards Scalable and Efficient Data Analysis on Large Clusters". TKDE 23(9)1299--1311, 2011. Google ScholarDigital Library
T. Kaldeway, E. J. Shekita, and S. Tata. "Clydesdale: structured data processing on MapReduce". In EDBT, pp. 15--25, 2012. Google ScholarDigital Library
R. Kimball and M. Ross. "The Data Warehouse Toolkit". John Wiley and Son, 1996.Google ScholarDigital Library
J. Lin and C. Dyer. "Data-Intensive Text Processing with MapReduce". Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
X. Liu, C. Thomsen, and T. B. Pedersen. "CloudETL: Scalable Dimensional ETL for Hadoop and Hive". Technical Report (TR-31), Dept. of Computer Science, Aalborg University, http://dbtr.cs.aau.dk/pub.htm as of 2014-06-11.Google Scholar
X. Liu, C. Thomsen, and T. B. Pedersen, "ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce". TLDKS 8, LNCS 7790:1--31, 2013.Google Scholar
C. Olston et al. "Pig Latin: A Not-so-foreign Language for Data Processing". In SIGMOD, pp. 1099--1110, 2008. Google ScholarDigital Library
A. Pavlo et al. "A Comparison of Approaches to Large-scale Data Analysis". In SIGMOD, pp. 165--178, 2009. Google ScholarDigital Library
M. Stonebraker et al. "MapReduce and Parallel DBMSs: friends or foes?". CACM, 53(1):64--71, 2010. Google ScholarDigital Library
C. Thomsen and T. B. Pedersen. "Building a Web Warehouse for Accessibility Data". In DOLAP, pp. 43--50, 2006. Google ScholarDigital Library
C. Thomsen and T. B. Pedersen. "Easy and Effective Parallel Programmable ETL". In DOLAP, pp. 37--44, 2011. Google ScholarDigital Library
C. Thomsen and T. B. Pedersen. "pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers". In DOLAP, pp. 49--56, 2009. Google ScholarDigital Library
Spark Project, http://spark.incubator.apache.org as of 2014-06-11.Google Scholar
A. Thusoo et al. "Hive: A Warehousing Solution Over a Map-reduce Framework". PVLDB 2(2):1626--1629, 2009. Google ScholarDigital Library
"TPC-H". http://tpc.org/tpch/ as of 2014-06-11.Google Scholar
P. Vassiliadis. "A Survey of Extract-Transform-Load Technology". IJDWM 5(3):1--27, 2009.Google Scholar

Index Terms

CloudETL: scalable dimensional ETL for hive

Index terms have been assigned to the content through auto-classification.

Recommendations

Major technical advancements in apache hive
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of ...
Read More
Olympics Big Data Prognostications

Data is continuously snowballing over the years, gradually a huge growth is seen in data to store and tame to yield meticulous result. It gives rise to a concept nowadays, reckoned as big data analytics. With the summer Olympics at Rio de Janeiro, ...
Read More
The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02

This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium
July 2014
411 pages
ISBN:9781450326278
DOI:10.1145/2628194
Editors:
Ana Maria Almeida
ISEP
,
Jorge Bernardino
CISUC-Polytechnic Institute of Coimbra
,
Elsa Ferreira Gomes
ISEP
,
General Chairs:
Bipin C. Desai
Concordia University
,
Jorge Bernardino
CISUC-Polytechnic Institute of Coimbra
,
Program Chairs:
Ana Maria Almeida
ISEP
,
Bipin C. Desai
Concordia University
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ETL
MapReduce
hive
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate74of210submissions,35%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 426
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CloudETL: scalable dimensional ETL for hive

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Major technical advancements in apache hive

Olympics Big Data Prognostications

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CloudETL: scalable dimensional ETL for hive

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Major technical advancements in apache hive

Olympics Big Data Prognostications

The Era of Big Spatial Data: Challenges and Opportunities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media