Abstract
ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.
Similar content being viewed by others
References
Inmon W (2005) Building the data warehouse. Wiley, New York
Vassiliadis P (2009) A survey of extract—transform—load technology. Int J Data Warehous Min 5(3):1–27
Eckerson W, White C (2003) Evaluating ETL and data integration platforms. Report of The Data Warehousing Institute 184
Data integration. http://www.pentaho.com/product/data-integration. Accessed 06 Feb 2018
Data integration: talend enterprise data integration services. http://www.talend.com/products/data-integration. Accessed 06 Feb 2018
Data integration tools and software solutions | informatica India. https://www.informatica.com/in/products/data-integration.html. Accessed 06 Feb 2018
Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html. Accessed 06 Feb 2018
IBM, InfoSphere Information Server. http://www-03.ibm.com/software/products/en/infosphere-information-server/. Accessed 06 Feb 2018
Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J (2011) Etl tool evaluation—a criteria framework. University of Texas-Pan American, Texas
Majchrzak TA, Jansen T, Kuchen H (2011) Efficiency evaluation of open source ETL tools. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 287–294
Pall AS, Khaira JS (2013) A comparative review of extraction, transformation and loading tools. Database Syst J BOARD 4(2):42–51
2017 Gartner magic quadrant for data integration tools. https://www.informatica.com/in/data-integration-magic-quadrant.html. Accessed 06 Dec 2017
Thomsen C, Pedersen T (2005) A survey of open source tools for business intelligence. In: International conference on data warehousing and knowledge discovery. Springer, pp 74–84
Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 25–32
Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor Appl Inf Technol 54(2):219–229
Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999) Performance issues in incremental warehouse maintenance. In: Proceedings of the 26th international conference on very large data bases (VLDB’00), Cairo, Egypt, September 2000. Stanford InfoLab
Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generating incremental etl processes automatically. In: First international multi-symposiums on computer and computational sciences (IMSCCS’06), vol 2. IEEE, pp 516–521
Jörg T, Dessloch S (2008) Towards generating ETL processes for incremental loading. In: Proceedings of the 2008 international symposium on database engineering applications (IDEAS’08). ACM, pp 101–110
Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental loading of data warehouses. In: BTW, pp 327–346
Behrend A, Jörg T (2010) Optimized incremental etl jobs for maintaining data warehouses. In: Proceedings of the fourteenth international database engineering and applications symposium, ACM, pp 216–224
Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Big data analytics and knowledge discovery. Springer, pp 217–228
Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: 2008 international conference on computer science and software engineering, vol 4, IEEE, pp 478–481
Ma K, Yang B (2015) Log-based change data capture from schema-free document stores using mapreduce. In: 2015 International conference on cloud technologies and applications (CloudTech). IEEE, pp 1–6
Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: 2010 6th world congress on services (SERVICES-1). IEEE, pp 128–131
Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding ETL processing in data warehouses using high-performance joins for changed data capture (cdc). In: 2010 international conference on advances in recent technologies in communication and computing (ARTCom). IEEE, pp 365–368
Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture on OLTP staging area for nearly real time data warehouse base on database trigger. Int J. Comput. Appl. 52(11):32–37
Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente J (2013) Real time delta extraction based on triggers to support data warehousing. In: International conference on parallel and distributed computing, applications and technologies (PDCAT’13). IEEE, pp 293–297
Thomsen C, Pedersen T (2009) pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 49–56
Thomsen C, Pedersen T (2011) Easy and effective parallel programmable etl. In: Proceedings of the ACM 14th international workshop on data warehousing and OLAP. ACM, pp 37–44
pygrametl, ETL programming in Python. http://www.pygrametl.org/. Accessed 25 Feb 2018
Petl - Extract, transform and load (tables of data). http://petl.readthedocs.io/en/latest/. Accessed 06 Dec 2017
Welcome to Scriptella ETL Project. http://scriptella.org/. Accessed 06 Dec 2017
Scriptella/scriptella-etl. https://github.com/scriptella/scriptella-etl/wiki. Accessed 06 Dec 2017
Baumer B (2017) A grammar for reproducible and painless extract-transform-load operations on medium data. arXiv preprint arXiv:1708.07073
Baumer B (2017) etl: extract-transform-load framework for medium data. http://github.com/beanumber/etl, r package version 0.3.7
ETL. https://cran.r-project.org/web/packages/etl/README.html. Accessed 10 Mar 2018
Efficient and real time data integration with change data capture. White paper, Attunity Ltd. (2009). http://attunity.com
Ankorion I (2005) Change data capture efficient ETL for real-time bi. Inf Manag 15(1):36
Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of change data capture and real time data warehouse. In: International journal of engineering research and technology, vol 2. ESRSA Publications
Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A snapshot differential refresh algorithm. In: Proceedings of the ACM-SIGMOD conference, vol 15
Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: Proceedings of the 22th international conference on very large data bases (VLDB’96). Morgan Kaufmann Publishers Inc, pp 63–74
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Biswas, N., Sarkar, A. & Mondal, K.C. Efficient incremental loading in ETL processing for real-time data integration. Innovations Syst Softw Eng 16, 53–61 (2020). https://doi.org/10.1007/s11334-019-00344-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-019-00344-4