Skip to main content
Log in

Efficient incremental loading in ETL processing for real-time data integration

  • S.I. : CICBA 2018
  • Published:
Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Abstract

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Inmon W (2005) Building the data warehouse. Wiley, New York

    Google Scholar 

  2. Vassiliadis P (2009) A survey of extract—transform—load technology. Int J Data Warehous Min 5(3):1–27

    Article  Google Scholar 

  3. Eckerson W, White C (2003) Evaluating ETL and data integration platforms. Report of The Data Warehousing Institute 184

  4. Data integration. http://www.pentaho.com/product/data-integration. Accessed 06 Feb 2018

  5. Data integration: talend enterprise data integration services. http://www.talend.com/products/data-integration. Accessed 06 Feb 2018

  6. Data integration tools and software solutions | informatica India. https://www.informatica.com/in/products/data-integration.html. Accessed 06 Feb 2018

  7. Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html. Accessed 06 Feb 2018

  8. IBM, InfoSphere Information Server. http://www-03.ibm.com/software/products/en/infosphere-information-server/. Accessed 06 Feb 2018

  9. Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J (2011) Etl tool evaluation—a criteria framework. University of Texas-Pan American, Texas

    Google Scholar 

  10. Majchrzak TA, Jansen T, Kuchen H (2011) Efficiency evaluation of open source ETL tools. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 287–294

  11. Pall AS, Khaira JS (2013) A comparative review of extraction, transformation and loading tools. Database Syst J BOARD 4(2):42–51

    Google Scholar 

  12. 2017 Gartner magic quadrant for data integration tools. https://www.informatica.com/in/data-integration-magic-quadrant.html. Accessed 06 Dec 2017

  13. Thomsen C, Pedersen T (2005) A survey of open source tools for business intelligence. In: International conference on data warehousing and knowledge discovery. Springer, pp 74–84

  14. Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 25–32

  15. Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor Appl Inf Technol 54(2):219–229

    Google Scholar 

  16. Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999) Performance issues in incremental warehouse maintenance. In: Proceedings of the 26th international conference on very large data bases (VLDB’00), Cairo, Egypt, September 2000. Stanford InfoLab

  17. Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generating incremental etl processes automatically. In: First international multi-symposiums on computer and computational sciences (IMSCCS’06), vol 2. IEEE, pp 516–521

  18. Jörg T, Dessloch S (2008) Towards generating ETL processes for incremental loading. In: Proceedings of the 2008 international symposium on database engineering applications (IDEAS’08). ACM, pp 101–110

  19. Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental loading of data warehouses. In: BTW, pp 327–346

  20. Behrend A, Jörg T (2010) Optimized incremental etl jobs for maintaining data warehouses. In: Proceedings of the fourteenth international database engineering and applications symposium, ACM, pp 216–224

  21. Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Big data analytics and knowledge discovery. Springer, pp 217–228

  22. Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: 2008 international conference on computer science and software engineering, vol 4, IEEE, pp 478–481

  23. Ma K, Yang B (2015) Log-based change data capture from schema-free document stores using mapreduce. In: 2015 International conference on cloud technologies and applications (CloudTech). IEEE, pp 1–6

  24. Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: 2010 6th world congress on services (SERVICES-1). IEEE, pp 128–131

  25. Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding ETL processing in data warehouses using high-performance joins for changed data capture (cdc). In: 2010 international conference on advances in recent technologies in communication and computing (ARTCom). IEEE, pp 365–368

  26. Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture on OLTP staging area for nearly real time data warehouse base on database trigger. Int J. Comput. Appl. 52(11):32–37

    Google Scholar 

  27. Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente J (2013) Real time delta extraction based on triggers to support data warehousing. In: International conference on parallel and distributed computing, applications and technologies (PDCAT’13). IEEE, pp 293–297

  28. Thomsen C, Pedersen T (2009) pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. ACM, pp 49–56

  29. Thomsen C, Pedersen T (2011) Easy and effective parallel programmable etl. In: Proceedings of the ACM 14th international workshop on data warehousing and OLAP. ACM, pp 37–44

  30. pygrametl, ETL programming in Python. http://www.pygrametl.org/. Accessed 25 Feb 2018

  31. Petl - Extract, transform and load (tables of data). http://petl.readthedocs.io/en/latest/. Accessed 06 Dec 2017

  32. Welcome to Scriptella ETL Project. http://scriptella.org/. Accessed 06 Dec 2017

  33. Scriptella/scriptella-etl. https://github.com/scriptella/scriptella-etl/wiki. Accessed 06 Dec 2017

  34. Baumer B (2017) A grammar for reproducible and painless extract-transform-load operations on medium data. arXiv preprint arXiv:1708.07073

  35. Baumer B (2017) etl: extract-transform-load framework for medium data. http://github.com/beanumber/etl, r package version 0.3.7

  36. ETL. https://cran.r-project.org/web/packages/etl/README.html. Accessed 10 Mar 2018

  37. Efficient and real time data integration with change data capture. White paper, Attunity Ltd. (2009). http://attunity.com

  38. Ankorion I (2005) Change data capture efficient ETL for real-time bi. Inf Manag 15(1):36

    Google Scholar 

  39. Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of change data capture and real time data warehouse. In: International journal of engineering research and technology, vol 2. ESRSA Publications

  40. Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A snapshot differential refresh algorithm. In: Proceedings of the ACM-SIGMOD conference, vol 15

    Article  Google Scholar 

  41. Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms for data warehousing. In: Proceedings of the 22th international conference on very large data bases (VLDB’96). Morgan Kaufmann Publishers Inc, pp 63–74

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kartick Chandra Mondal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biswas, N., Sarkar, A. & Mondal, K.C. Efficient incremental loading in ETL processing for real-time data integration. Innovations Syst Softw Eng 16, 53–61 (2020). https://doi.org/10.1007/s11334-019-00344-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11334-019-00344-4

Keywords

Navigation