Skip to main content

Advertisement

Log in

Data integration from traditional to big data: main features and comparisons of ETL approaches

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Data integration combines information from different sources to provide a comprehensive view for making informed business decisions. The ETL (Extract, Transform, and Load) process is essential in data integration. In the past two decades, modeling the ETL process has become a priority for effectively managing information. This paper aims to explore ETL approaches to help researchers and organizational stakeholders overcome challenges, especially in Big Data integration. It offers a comprehensive overview of ETL methods, from traditional to Big Data, and discusses their advantages, limitations, and the primary trends in Big Data integration. The study emphasizes that many technologies have been integrated into ETL steps for data collection, storage, processing, querying, and analysis without proper modeling. Therefore, more generic and customized design modeling of the ETL steps should be carried out to ensure reusability and flexibility. The paper summarizes the exploration of ETL modeling, focusing on Big Data scalability and processing trends. It also identifies critical dilemmas, such as ensuring compatibility across multiple sources and dealing with large volumes of Big Data. Furthermore, it suggests future directions in Big Data integration by leveraging advanced artificial intelligence processing and storage systems to ensure consistency, efficiency, and data integrity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

No datasets were generated or analyzed during the current study.

References

  1. Dhaouadi A, Bousselmi K, Gammoudi MM, Monnet S, Hammoudi S (2023) Data warehousing process modeling from classical approaches to new trends: main features and comparisons. Data 7(8):113

    Google Scholar 

  2. Demarest M (1997) The politics of data warehousing. June, http://www.hevanet.com/demarest/marc/dwpol.html, 6(03), 1998

  3. Nwokeji JC, Matovu R (2021) A systematic literature review on Big Data extraction, transformation and loading (ETL). In: Intelligent Computing: Proceedings of the 2021 Computing Conference, vol 2. Springer International Publishing, pp 308-324

  4. Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. A roadmap from models to technologies, Mobile big data, pp 3–20

  5. Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1–18

    Google Scholar 

  6. Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T (2001) ARKTOS: towards the modeling, design, control and execution of ETL processes. Inf Syst 26(8):537–561

    Google Scholar 

  7. Vassiliadis P, Simitsis A, Skiadopoulos S (2002) Conceptual modeling for ETL processes. In: International Workshop on Data Warehousing and OLAP. ACM, pp 14–21

  8. Vassiliadis P, Simitsis A, Georgantas P, Terrovitis M, Skiadopoulos S (2005) A generic and customizable framework for the design of ETL scenarios. Inf Syst 30(7):492–525

    Google Scholar 

  9. Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: International Workshop on Data Warehousing and OLAP (DOLAP). ACM, pp 25–32

  10. Köppen V, Brüggemann B, Berendt B (2011) Designing data integration: the ETL pattern approach. UPGRADE Eur J Inform Prof 3:49–55

    Google Scholar 

  11. El-Sappagh SHA, Hendawi AMA, El Bastawissy AH (2011) A proposed model for data warehouse ETL processes. J King Saud Univ-Comput Inf Sci 23(2):91–104

    Google Scholar 

  12. Petrović M, Vučković M, Turajlić N, Babarogić S, Aničić N, Marjanović Z (2017) Automating ETL processes using the domain-specific modeling approach. Inf Syst e-Bus Manag 15:425–460

    Google Scholar 

  13. Deme A, Buchmann R (2021) A technology-specific modeling method for data ETL processes. In: AMCIS

  14. Oliveira B, Belo O (2016) An ontology for describing ETL patterns behavior. In: 5th International Conference on Data Management Technologies and Applications, pp 102–109

  15. Oliveira B, Belo O (2017) Approaching ETL processes specification using a pattern-based ontology. In: Data Management Technologies and Applications; Communications in Computer and Information Science, vol 737. Springer, pp 65–78

  16. Jacobson L, Booch JRG (2021) The unified modeling language reference manual

  17. Trujillo J, Luján-Mora S (2003) A UML based approach for modeling ETL processes in data warehouses. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 307–320

  18. Luján-Mora S, Vassiliadis P, TrujilloJ (2004) Data mapping diagrams for data warehouse design with UML. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 191-204

  19. Song X, Yan X, Yang L (2009) Design ETL metamodel based on UML profile. In: International Symposium on Knowledge Acquisition and Modeling, vol 3. IEEE, pp 69–72

  20. Muñoz L, Mazón, JN, Pardillo J, Trujillo J (2008) Modelling ETL processes of data warehouses with UML activity diagrams. In: OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". Springer Berlin Heidelberg, pp 44–53

  21. Muñoz L, Mazón JN, Trujillo J (2010) A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses. Inf Softw Technol 52(11):1188–1203

    Google Scholar 

  22. Mallek H, Walha A, Ghozzi F, Gargouri F (2014) ETL-web process modeling. In: Advances on Decisional Systems Conference (ASD)

  23. Biswas N, Chattopadhyay S, Mahapatra G, Chatterjee S, Mondal K C (2017) SysML based conceptual ETL process modeling. In: Computational Intelligence, Communications, and Business Analytics International Conference (CICBA). Springer Singapore, pp 242–255

  24. Friedenthal S, Moore A, Steiner R (2014) A practical guide to SysML: the systems modeling language. Morgan Kaufmann

  25. Biswas N, Chattapadhyay S, Mahapatra G, Chatterjee S, Mondal KC (2019) A new approach for conceptual extraction-transformation-loading process modeling. Int J Amb Comput Intell (IJACI) 10(1):30–45

    Google Scholar 

  26. Chinosi M, Trombetta A (2012) BPMN: an introduction to the standard. Comput Stand Interfaces 34(1):124–134

    Google Scholar 

  27. Wilkinson K, Simitsis A, Castellanos M, Dayal U (2010) Leveraging business process models for ETL design. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 15–30

  28. Nabli A, Bouaziz S, Yangui R, Gargouri F (2015) Two-ETL phases for data warehouse creation: design and implementation. In: Advances in Databases and Information Systems: East European Conference (ADBIS). Springer, pp 138–150

  29. El Akkaoui Z, Zimányi E (2009) Defining ETL worfklows using BPMN and BPEL. In: International workshop on Data warehousing and OLAP. pp 41–48

  30. El Akkaoui Z, Mazón JN, Vaisman A, Zimányi E, (2012) BPMN-based conceptual modeling of ETL processes. In: Data Warehousing and Knowledge Discovery (DaWaK, (2012). Springer, Berlin Heidelberg, pp 1–14

  31. El Akkaoui Z, Zimányi E, Mazón JN, Trujillo J (2013) A BPMN-based design and maintenance framework for ETL processes. In J Data Warehous Min (IJDWM) 9(3):46–72

    Google Scholar 

  32. El Akkaoui Z, Vaisman AA, Zimányi E (2019) A quality-based ETL design evaluation framework. ICEIS 1:249–257

    Google Scholar 

  33. Oliveira B, Oliveira Ó, Belo O (2021) Using BPMN for ETL conceptual modelling: a case study. In: Data, pp 267–274

  34. Awiti J, Vaisman AA, Zimányi E (2020) Design and implementation of ETL processes using BPMN and relational algebra. Data Knowl Eng 129:101–837

    Google Scholar 

  35. Oliveira B, Belo O (2012) BPMN patterns for ETL conceptual modelling and validation. In: Foundations of Intelligent Systems International Symposium (ISMIS (2012). Springer, Berlin Heidelberg, pp 445–454

  36. Walha A, Ghozzi F, Gargouri F (2019) From user generated content to social data warehouse: processes, operations and data modelling. Int J Web Eng Technol 14(3):203–230

    Google Scholar 

  37. Dhaouadi A, Bousselmi K, Monnet S, Gammoudi MM, Hammoudi S (2022) A multi-layer modeling for the generation of new architectures for big data warehousing. In: International Conference on Advanced Information Networking and Applications. Springer, pp 204–218

  38. Iribarne L, Asensio JA, Padilla N, Criado J (2017) Modeling Big data-based systems through ontological trading. Softw Pract Exp 47(11):1561–1596

    Google Scholar 

  39. Sahiet D, Asanka PD (2015) ETL framework design for NoSQL databases in dataware housing. Int. J. Res. Comput. Appl. Rob. 3:67–75

    Google Scholar 

  40. Mehmood E, Anees T (2022) Distributed real-time ETL architecture for unstructured big data. Knowl Inf Syst 64(12):419–3445

    Google Scholar 

  41. Mallek H, Ghozzi F, Teste O, Gargouri F (2017) BigDimETL: ETL for multidimensional big data. In: International Conference on Intelligent Systems Design and Applications (ISDA 2016). Springer, pp 935-944

  42. Mallek H, Ghozzi F, Teste O, Gargouri F (2018) BigDimETL with NoSQL database. Procedia Comput Sci 126:798–807

    Google Scholar 

  43. Mallek H, Ghozzi F, Gargouri F (2020) Towards extract-transform-load operations in a big data context. Int J Sociotechnol Knowl Dev (IJSKD) 12(2):77–95

    Google Scholar 

  44. Mallek H, Ghozzi F, Gargouri F (2022) Conversion operation: from semi-structured collection of documents to column-oriented structure. In: International Conference on Hybrid Intelligent Systems. Springer Nature Switzerland, Cham, pp 585–594

  45. Gupta G, Kumar N, Chhabra I (2020) Optimised transformation algorithm for hadoop data loading in web ETL framework. EAI Endorsed Trans Scalable Inf Syst 7(25):e6–e6

    Google Scholar 

  46. Souibgui M, Atigui F, Yahia SB, Si-Said Cherfi S (2020) Business intelligence and analytics: on-demand ETL over document stores. In: Research Challenges in Information Science (RCIS 2020). Springer, pp 556–561

  47. Souibgui M, Atigui F, Yahia SB, Cherfi SSS (2022) An embedding driven approach to automatically detect identifiers and references in document stores. Data Knowl Eng 139:102003

    Google Scholar 

  48. Ali SMF (2018) Next-generation ETL framework to address the challenges posed by big data. In: DOLAP

  49. Ali SMF, Mey J, Thiele M (2019) Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int J Appl Math Comput Sci 29(1):69–79

    Google Scholar 

  50. Pau M, Kapsalis P, Pan Z, Korbakis G, Pellegrino D, Monti A (2022) MATRYCS-a big data architecture for advanced services in the building domain. Energies 15(7):2568

    Google Scholar 

  51. Moalla I, Nabli A, Hammami M (2022) Data warehouse building to support opinion analysis in social media. Soc Netw Anal Min 12(1):123

    Google Scholar 

  52. Moalla I, Nabli A, Hammami M (2018) Towards opinions analysis method from social media for multidimensional analysis. In: International Conference on Advances in Mobile Computing and Multimedia, pp 8–14

  53. Qaiser A, Farooq MU, Mustafa SMN, Abrar N (2023) Comparative analysis of ETL tools in big data analytics. Pak J Eng Technol 6(1):7–12

    Google Scholar 

  54. Bala M, Boussaid O, Alimazighi Z (2014) P-ETL: parallel-ETL based on the MapReduce paradigm. In: IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). IEEE, pp 42–49

  55. Bala M, Boussaid O, Alimazighi Z (2016) Extracting-transforming-loading modeling approach for big data analytics. Int J Decis Support Syst Technol (IJDSST) 8(4):50–69

    Google Scholar 

  56. Bala M, Boussaid O, Alimazighi Z (2017) A fine-grained distribution approach for ETL processes in big data environments. Data Knowl Eng 111:114–136

    Google Scholar 

  57. Yangui R, Nabli A, Gargouri F (2017) ETL based framework for NoSQL warehousing. In: Information Systems: 14th European, Mediterranean, and Middle Eastern Conference, (EMCIS). Springer, pp 40–53

  58. Walha A, Ghozzi F, Gargouri F (2016) ETL design toward social network opinion analysis. Computer and information science. Springer, Cham, pp 235–249

    Google Scholar 

  59. Lanza Cruz IL, Berlanga Llavori R (2018) Defining dynamic indicators for social network analysis: a case study in the automotive domain using Twiter

  60. Ben Kraiem M, Alqarni M, Feki J, Ravat F (2020) OLAP operators for social network analysis. Clust Comput 23:2347–2374

    Google Scholar 

  61. Moulai H, Drias H (2018) From data warehouse to information warehouse: application to social media. In: International Conference on Learning and Optimization Algorithms: Theory and Applications, pp 1–6

  62. Gallinucci E, Golfarelli M, Rizzi S (2015) Advanced topic modeling for social business intelligence. Inf Syst 53:87–106

    Google Scholar 

  63. Kurnia PF (2018) Business intelligence model to analyze social media information. Procedia Comput Sci 135:5–14

    Google Scholar 

  64. Gutiérrez-Batista K, Campaña JR, Vila MA, Martin-Bautista MJ (2018) Building a contextual dimension for OLAP using textual data from social networks. Expert Syst Appl 93:118–133

    Google Scholar 

  65. Walha A, Ghozzi F, Gargouri F (2021) Design and execution of ETL process to build topic dimension from user-generated content. In: International Conference on Research Challenges in Information Science. Springer, pp 374–389

  66. Walha A, Ghozzi F, Gargouri F (2024) Extract-transform-load process for recognizing sentiment from user-generated text on social media. In: International Conference on Evaluation of Novel Approaches to Software Engineering. SCITEPRESS, pp 641–648

  67. Martinez-Mosquera D, Luján-Mora S, Recalde H (2017) Conceptual modeling of big data extract processes with UML. In: International Conference on Information Systems and Computer Science (INCISCOS). IEEE, pp 207–211

  68. Machado GV, Cunha Í, Pereira AC, Oliveira LB (2019) DOD-ETL: distributed on-demand ETL for near real-time business intelligence. J Internet Serv Appl 10:1–15

    Google Scholar 

  69. Raj A, Bosch J, Olsson HH, Wang TJ (2020) Modelling data pipelines. In: Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, pp 13–20

  70. Mallek H, Ghozzi F, Gargouri F (2023) Conceptual modeling of big data extraction phase. Int J Hybrid Intell Syst 19(3,4):167–182

    Google Scholar 

  71. Mallek H, Ghozzi F, Gargouri F (2023) Conceptual modeling of big data SPJ operations with Twitter social medium. Soc Netw Anal Min 13(1):105

    Google Scholar 

  72. Pan Z, Pan G, Monti A (2022) Semantic-similarity-based schema matching for management of building energy data. Energies 15(23):8894

    Google Scholar 

  73. Walha A, Ghozzi F, Gargouri F (2017) ETL4Social-data: modeling approach for topic hierarchy. In: KEOD, pp 107–118

  74. Hung LP, Alias S (2023) Beyond sentiment analysis: a review of recent trends in text based sentiment analysis and emotion detection. J Adv Comput Intell Intell Inform 27(1):84–95

    Google Scholar 

  75. Qi Y, Shabrina Z (2023) Sentiment analysis using Twitter data: a comparative application of lexicon-and machine-learning-based approach. Soc Netw Anal Min 13(1):31

    Google Scholar 

  76. Hajji T, Loukili R, El Hassani I, Masrour T (2023) Optimizations of distributed computing processes on apache spark platform. IAENG Int J Comput Sci 50(2):422–433

    Google Scholar 

  77. Sundarakumar MR, Mahadevan G, Natchadalingam R, Karthikeyan G, Ashok J, Manoharan JS, Velmurugadass P (2023) A comprehensive study and review of tuning the performance on database scalability in Big Data analytics. J Intell Fuzzy Syst 44(3):5231–5255

    Google Scholar 

  78. Biswas N, Mondal KC (2022) Integration of ETL in cloud using spark for streaming data. In: Advanced Techniques for IoT Applications: Proceedings of EAIT 2020. Springer Singapore, pp 172–182

  79. Borra P (2024) Comprehensive survey of amazon web services (AWS): techniques, tools, and best practices for cloud solutions

  80. Armbrust M, Ghodsi A, Xin R, Zaharia M (2021) Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Proceedings of CIDR, vol 8, p 28

  81. Kumar A, Mishra A, Kumar A (2024) Build multi-cloud modern distributed data warehouses with Azure and AWS. In: Architecting a modern data warehouse for large enterprises. Apress, Berkeley

  82. Simitsis A, Skiadopoulos S, Vassiliadis P (2023) The history, present, and future of ETL technology. In: DOLAP, pp 3–12

  83. Ali A, Naeem S, Anam S, Ahmed MM (2023) A state of art survey for Big Data processing and nosql database architecture. Int J Comput Digit Syst 14(1):1–1

    Google Scholar 

  84. Patil R, Boit S, Gudivada V, Nandigam J (2023) A survey of text representation and embedding techniques in nlp. IEEE Access 11:36120–36146

    Google Scholar 

  85. Silva MC, Eugénio P, Faria D, Pesquita C (2022) Ontologies and knowledge graphs in oncology research. Cancers 14(8):1906

    Google Scholar 

  86. Dang NC, Moreno-García MN, De la Prieta F (2020) Sentiment analysis based on deep learning: a comparative study. Electronics 9(3):483

    Google Scholar 

  87. Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, Azam S (2024) A review on large Language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 12:26839–26874

    Google Scholar 

  88. Mbata A, Sripada Y, Zhong M (2024) A survey of pipeline tools for data engineering. Preprint at arXiv:2406.08335

  89. Beretta V (2018) Data veracity assessment: enhancing truth discovery using a priori knowledge. In: Computer Science [cs]. IMT Mines Alès

  90. Nambiar A, Mundra D (2022) An overview of data warehouse and data lake in modern enterprise data management. Big Data Cogn Comput 6(4):132

    Google Scholar 

  91. Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA (2021) A review of machine learning and deep learning techniques for anomaly detection in IoT data. Appl Sci 11(12):5320

    Google Scholar 

  92. Lambert SL, Davidson BI, LeMay SA (2023) Survey of emerging blockchain technologies for improving the data integrity and auditability of manufacturing bills of materials in enterprise resource planning. J Emerg Technol Account 20(2):119–134

    Google Scholar 

  93. Ding PMR, Wang S Han S, Zhang D (2023) InsightPilot: an LLM-empowered automated data exploration system. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore. Association for Computational Linguistics, pp 346–352

Download references

Author information

Authors and Affiliations

Authors

Contributions

Authors collaborated in writing the manuscript.

Corresponding author

Correspondence to Afef Walha.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Walha, A., Ghozzi, F. & Gargouri, F. Data integration from traditional to big data: main features and comparisons of ETL approaches. J Supercomput 80, 26687–26725 (2024). https://doi.org/10.1007/s11227-024-06413-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06413-1

Keywords