ETL

Thomsen, Christian

doi:10.1007/978-3-319-63962-8_11-1

Christian Thomsen³

527 Accesses

Synonyms

ELT; Extract-Transform-Load

Definition

ETL is short for Extract-Transform-Load. The ETL process extracts data from operational source systems, transforms the data, and loads the data into a target. The transformations to perform on the data can involve a plethora of different activities, e.g., filtering, normalization or de-normalization to a desired form, joins, conversion, and cleansing to remove bad or dirty data. In the ELT variant, the data is extracted from the source systems, loaded in its raw form into the target, and then transformed.

Overview

The term ETL process has traditionally been used for a process that populates a data warehouse (DW) managed by a relational database management system (RDBMS). As pointed out by Simitsis and Vassiliadis (2017), the basic concept of populating a data store with data reshaped from another data store is, however, older than data warehousing. The ETL process can be hand-coded or made with a designated ETL tool where the developer...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Akidau T, Bradshaw R, Chambers C, Chernyak S, Fernández-Moctezuma RJ, Lax R, McVeety S, Mills D, Perry F, Schmidt E et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803
Article Google Scholar
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1383–1394
Google Scholar
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J, Lee AW, Motivala A, Munir AQ, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P (2016) The snowflake elastic data warehouse. In: Proceedings of the 2016 international conference on management of data, SIGMOD’16, New York. ACM, pp 215–226. ISBN:978-1-4503-3531-7. http://doi.acm.org/10.1145/2882903.2903741
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, 6–8 Dec 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srinivasan V (2015) Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD’15, New York. ACM, pp 1917–1923. ISBN:978-1-4503-2758-9. http://doi.acm.org/10.1145/2723372.2742795
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3363–3372
Google Scholar
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1215–1230
Google Scholar
Kimball R (2008) The data warehouse lifecycle toolkit. Wiley, Hoboken
Google Scholar
Liu X, Thomsen C, Pedersen TB (2013) Etlmr: a highly scalable dimensional ETL framework based on mapreduce. In: Hameurlain A, Küng J, Wagner RR (eds) Transactions on large-scale data-and knowledge-centered systems VIII. Springer, Heidelberg/New York, pp 1–31
Google Scholar
Liu X, Thomsen C, Pedersen TB (2014) Cloudetl: scalable dimensional ETL for hive. In: Proceedings of the 18th international database engineering & applications symposium. ACM, pp 195–206
Google Scholar
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, pp 1099–1110
Google Scholar
Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y (2011) Emerging trends in the enterprise data analytics: connecting hadoop and db2 warehouse. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 1161–1164
Google Scholar
Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: Parallel analysis with Sawzall. Sci Program 13(4):277–298
Google Scholar
Simitsis A, Vassiliadis P (2017) Extraction, transformation, and loading. Springer, New York, pp 1–9. ISBN 978-1-4899-7993-3. https://doi.org/10.1007/978-1-4899-7993-3_158-3
Book Google Scholar
Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In CIDR
Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow 2(2):1626–1629
Article Google Scholar
Tigani J, Naidu S (2014) Google BigQuery analytics. Wiley, Indianapolis
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S (2012) Stoica I resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Aalborg, Denmark
Christian Thomsen

Authors

Christian Thomsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Thomsen .

Editor information

Editors and Affiliations

School of Comp. Sci. and Engineering, University of New South Wales School of Comp. Sci. and Engineering, Eveleigh, New South Wales, Australia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Database Systems Group, Technische Universität Dresden, 01062, Dresden, Saxony, Deutschland
Maik Thiele

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Thomsen, C. (2018). ETL. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_11-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_11-1
Published: 19 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

ETL