Synonyms
Definition
ETL is short for Extract-Transform-Load. The ETL process extracts data from operational source systems, transforms the data, and loads the data into a target. The transformations to perform on the data can involve a plethora of different activities, e.g., filtering, normalization or de-normalization to a desired form, joins, conversion, and cleansing to remove bad or dirty data. In the ELT variant, the data is extracted from the source systems, loaded in its raw form into the target, and then transformed.
Overview
The term ETL process has traditionally been used for a process that populates a data warehouse (DW) managed by a relational database management system (RDBMS). As pointed out by Simitsis and Vassiliadis (2017), the basic concept of populating a data store with data reshaped from another data store is, however, older than data warehousing. The ETL process can be hand-coded or made with a designated ETL tool where the developer...
Notes
References
Akidau T, Bradshaw R, Chambers C, Chernyak S, Fernández-Moctezuma RJ, Lax R, McVeety S, Mills D, Perry F, Schmidt E et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1383–1394
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J, Lee AW, Motivala A, Munir AQ, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P (2016) The snowflake elastic data warehouse. In: Proceedings of the 2016 international conference on management of data, SIGMOD’16, New York. ACM, pp 215–226. ISBN:978-1-4503-3531-7. http://doi.acm.org/10.1145/2882903.2903741
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, 6–8 Dec 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srinivasan V (2015) Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD’15, New York. ACM, pp 1917–1923. ISBN:978-1-4503-2758-9. http://doi.acm.org/10.1145/2723372.2742795
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3363–3372
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1215–1230
Kimball R (2008) The data warehouse lifecycle toolkit. Wiley, Hoboken
Liu X, Thomsen C, Pedersen TB (2013) Etlmr: a highly scalable dimensional ETL framework based on mapreduce. In: Hameurlain A, Küng J, Wagner RR (eds) Transactions on large-scale data-and knowledge-centered systems VIII. Springer, Heidelberg/New York, pp 1–31
Liu X, Thomsen C, Pedersen TB (2014) Cloudetl: scalable dimensional ETL for hive. In: Proceedings of the 18th international database engineering & applications symposium. ACM, pp 195–206
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, pp 1099–1110
Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y (2011) Emerging trends in the enterprise data analytics: connecting hadoop and db2 warehouse. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 1161–1164
Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: Parallel analysis with Sawzall. Sci Program 13(4):277–298
Simitsis A, Vassiliadis P (2017) Extraction, transformation, and loading. Springer, New York, pp 1–9. ISBN 978-1-4899-7993-3. https://doi.org/10.1007/978-1-4899-7993-3_158-3
Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In CIDR
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow 2(2):1626–1629
Tigani J, Naidu S (2014) Google BigQuery analytics. Wiley, Indianapolis
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S (2012) Stoica I resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this entry
Cite this entry
Thomsen, C. (2018). ETL. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_11-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_11-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering