Abstract
Data integration combines information from different sources to provide a comprehensive view for making informed business decisions. The ETL (Extract, Transform, and Load) process is essential in data integration. In the past two decades, modeling the ETL process has become a priority for effectively managing information. This paper aims to explore ETL approaches to help researchers and organizational stakeholders overcome challenges, especially in Big Data integration. It offers a comprehensive overview of ETL methods, from traditional to Big Data, and discusses their advantages, limitations, and the primary trends in Big Data integration. The study emphasizes that many technologies have been integrated into ETL steps for data collection, storage, processing, querying, and analysis without proper modeling. Therefore, more generic and customized design modeling of the ETL steps should be carried out to ensure reusability and flexibility. The paper summarizes the exploration of ETL modeling, focusing on Big Data scalability and processing trends. It also identifies critical dilemmas, such as ensuring compatibility across multiple sources and dealing with large volumes of Big Data. Furthermore, it suggests future directions in Big Data integration by leveraging advanced artificial intelligence processing and storage systems to ensure consistency, efficiency, and data integrity.








Similar content being viewed by others
Data availability
No datasets were generated or analyzed during the current study.
References
Dhaouadi A, Bousselmi K, Gammoudi MM, Monnet S, Hammoudi S (2023) Data warehousing process modeling from classical approaches to new trends: main features and comparisons. Data 7(8):113
Demarest M (1997) The politics of data warehousing. June, http://www.hevanet.com/demarest/marc/dwpol.html, 6(03), 1998
Nwokeji JC, Matovu R (2021) A systematic literature review on Big Data extraction, transformation and loading (ETL). In: Intelligent Computing: Proceedings of the 2021 Computing Conference, vol 2. Springer International Publishing, pp 308-324
Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. A roadmap from models to technologies, Mobile big data, pp 3–20
Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1–18
Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T (2001) ARKTOS: towards the modeling, design, control and execution of ETL processes. Inf Syst 26(8):537–561
Vassiliadis P, Simitsis A, Skiadopoulos S (2002) Conceptual modeling for ETL processes. In: International Workshop on Data Warehousing and OLAP. ACM, pp 14–21
Vassiliadis P, Simitsis A, Georgantas P, Terrovitis M, Skiadopoulos S (2005) A generic and customizable framework for the design of ETL scenarios. Inf Syst 30(7):492–525
Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL activities. In: International Workshop on Data Warehousing and OLAP (DOLAP). ACM, pp 25–32
Köppen V, Brüggemann B, Berendt B (2011) Designing data integration: the ETL pattern approach. UPGRADE Eur J Inform Prof 3:49–55
El-Sappagh SHA, Hendawi AMA, El Bastawissy AH (2011) A proposed model for data warehouse ETL processes. J King Saud Univ-Comput Inf Sci 23(2):91–104
Petrović M, Vučković M, Turajlić N, Babarogić S, Aničić N, Marjanović Z (2017) Automating ETL processes using the domain-specific modeling approach. Inf Syst e-Bus Manag 15:425–460
Deme A, Buchmann R (2021) A technology-specific modeling method for data ETL processes. In: AMCIS
Oliveira B, Belo O (2016) An ontology for describing ETL patterns behavior. In: 5th International Conference on Data Management Technologies and Applications, pp 102–109
Oliveira B, Belo O (2017) Approaching ETL processes specification using a pattern-based ontology. In: Data Management Technologies and Applications; Communications in Computer and Information Science, vol 737. Springer, pp 65–78
Jacobson L, Booch JRG (2021) The unified modeling language reference manual
Trujillo J, Luján-Mora S (2003) A UML based approach for modeling ETL processes in data warehouses. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 307–320
Luján-Mora S, Vassiliadis P, TrujilloJ (2004) Data mapping diagrams for data warehouse design with UML. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 191-204
Song X, Yan X, Yang L (2009) Design ETL metamodel based on UML profile. In: International Symposium on Knowledge Acquisition and Modeling, vol 3. IEEE, pp 69–72
Muñoz L, Mazón, JN, Pardillo J, Trujillo J (2008) Modelling ETL processes of data warehouses with UML activity diagrams. In: OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". Springer Berlin Heidelberg, pp 44–53
Muñoz L, Mazón JN, Trujillo J (2010) A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses. Inf Softw Technol 52(11):1188–1203
Mallek H, Walha A, Ghozzi F, Gargouri F (2014) ETL-web process modeling. In: Advances on Decisional Systems Conference (ASD)
Biswas N, Chattopadhyay S, Mahapatra G, Chatterjee S, Mondal K C (2017) SysML based conceptual ETL process modeling. In: Computational Intelligence, Communications, and Business Analytics International Conference (CICBA). Springer Singapore, pp 242–255
Friedenthal S, Moore A, Steiner R (2014) A practical guide to SysML: the systems modeling language. Morgan Kaufmann
Biswas N, Chattapadhyay S, Mahapatra G, Chatterjee S, Mondal KC (2019) A new approach for conceptual extraction-transformation-loading process modeling. Int J Amb Comput Intell (IJACI) 10(1):30–45
Chinosi M, Trombetta A (2012) BPMN: an introduction to the standard. Comput Stand Interfaces 34(1):124–134
Wilkinson K, Simitsis A, Castellanos M, Dayal U (2010) Leveraging business process models for ETL design. In: International Conference on Conceptual Modeling. Springer Berlin Heidelberg, pp 15–30
Nabli A, Bouaziz S, Yangui R, Gargouri F (2015) Two-ETL phases for data warehouse creation: design and implementation. In: Advances in Databases and Information Systems: East European Conference (ADBIS). Springer, pp 138–150
El Akkaoui Z, Zimányi E (2009) Defining ETL worfklows using BPMN and BPEL. In: International workshop on Data warehousing and OLAP. pp 41–48
El Akkaoui Z, Mazón JN, Vaisman A, Zimányi E, (2012) BPMN-based conceptual modeling of ETL processes. In: Data Warehousing and Knowledge Discovery (DaWaK, (2012). Springer, Berlin Heidelberg, pp 1–14
El Akkaoui Z, Zimányi E, Mazón JN, Trujillo J (2013) A BPMN-based design and maintenance framework for ETL processes. In J Data Warehous Min (IJDWM) 9(3):46–72
El Akkaoui Z, Vaisman AA, Zimányi E (2019) A quality-based ETL design evaluation framework. ICEIS 1:249–257
Oliveira B, Oliveira Ó, Belo O (2021) Using BPMN for ETL conceptual modelling: a case study. In: Data, pp 267–274
Awiti J, Vaisman AA, Zimányi E (2020) Design and implementation of ETL processes using BPMN and relational algebra. Data Knowl Eng 129:101–837
Oliveira B, Belo O (2012) BPMN patterns for ETL conceptual modelling and validation. In: Foundations of Intelligent Systems International Symposium (ISMIS (2012). Springer, Berlin Heidelberg, pp 445–454
Walha A, Ghozzi F, Gargouri F (2019) From user generated content to social data warehouse: processes, operations and data modelling. Int J Web Eng Technol 14(3):203–230
Dhaouadi A, Bousselmi K, Monnet S, Gammoudi MM, Hammoudi S (2022) A multi-layer modeling for the generation of new architectures for big data warehousing. In: International Conference on Advanced Information Networking and Applications. Springer, pp 204–218
Iribarne L, Asensio JA, Padilla N, Criado J (2017) Modeling Big data-based systems through ontological trading. Softw Pract Exp 47(11):1561–1596
Sahiet D, Asanka PD (2015) ETL framework design for NoSQL databases in dataware housing. Int. J. Res. Comput. Appl. Rob. 3:67–75
Mehmood E, Anees T (2022) Distributed real-time ETL architecture for unstructured big data. Knowl Inf Syst 64(12):419–3445
Mallek H, Ghozzi F, Teste O, Gargouri F (2017) BigDimETL: ETL for multidimensional big data. In: International Conference on Intelligent Systems Design and Applications (ISDA 2016). Springer, pp 935-944
Mallek H, Ghozzi F, Teste O, Gargouri F (2018) BigDimETL with NoSQL database. Procedia Comput Sci 126:798–807
Mallek H, Ghozzi F, Gargouri F (2020) Towards extract-transform-load operations in a big data context. Int J Sociotechnol Knowl Dev (IJSKD) 12(2):77–95
Mallek H, Ghozzi F, Gargouri F (2022) Conversion operation: from semi-structured collection of documents to column-oriented structure. In: International Conference on Hybrid Intelligent Systems. Springer Nature Switzerland, Cham, pp 585–594
Gupta G, Kumar N, Chhabra I (2020) Optimised transformation algorithm for hadoop data loading in web ETL framework. EAI Endorsed Trans Scalable Inf Syst 7(25):e6–e6
Souibgui M, Atigui F, Yahia SB, Si-Said Cherfi S (2020) Business intelligence and analytics: on-demand ETL over document stores. In: Research Challenges in Information Science (RCIS 2020). Springer, pp 556–561
Souibgui M, Atigui F, Yahia SB, Cherfi SSS (2022) An embedding driven approach to automatically detect identifiers and references in document stores. Data Knowl Eng 139:102003
Ali SMF (2018) Next-generation ETL framework to address the challenges posed by big data. In: DOLAP
Ali SMF, Mey J, Thiele M (2019) Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int J Appl Math Comput Sci 29(1):69–79
Pau M, Kapsalis P, Pan Z, Korbakis G, Pellegrino D, Monti A (2022) MATRYCS-a big data architecture for advanced services in the building domain. Energies 15(7):2568
Moalla I, Nabli A, Hammami M (2022) Data warehouse building to support opinion analysis in social media. Soc Netw Anal Min 12(1):123
Moalla I, Nabli A, Hammami M (2018) Towards opinions analysis method from social media for multidimensional analysis. In: International Conference on Advances in Mobile Computing and Multimedia, pp 8–14
Qaiser A, Farooq MU, Mustafa SMN, Abrar N (2023) Comparative analysis of ETL tools in big data analytics. Pak J Eng Technol 6(1):7–12
Bala M, Boussaid O, Alimazighi Z (2014) P-ETL: parallel-ETL based on the MapReduce paradigm. In: IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). IEEE, pp 42–49
Bala M, Boussaid O, Alimazighi Z (2016) Extracting-transforming-loading modeling approach for big data analytics. Int J Decis Support Syst Technol (IJDSST) 8(4):50–69
Bala M, Boussaid O, Alimazighi Z (2017) A fine-grained distribution approach for ETL processes in big data environments. Data Knowl Eng 111:114–136
Yangui R, Nabli A, Gargouri F (2017) ETL based framework for NoSQL warehousing. In: Information Systems: 14th European, Mediterranean, and Middle Eastern Conference, (EMCIS). Springer, pp 40–53
Walha A, Ghozzi F, Gargouri F (2016) ETL design toward social network opinion analysis. Computer and information science. Springer, Cham, pp 235–249
Lanza Cruz IL, Berlanga Llavori R (2018) Defining dynamic indicators for social network analysis: a case study in the automotive domain using Twiter
Ben Kraiem M, Alqarni M, Feki J, Ravat F (2020) OLAP operators for social network analysis. Clust Comput 23:2347–2374
Moulai H, Drias H (2018) From data warehouse to information warehouse: application to social media. In: International Conference on Learning and Optimization Algorithms: Theory and Applications, pp 1–6
Gallinucci E, Golfarelli M, Rizzi S (2015) Advanced topic modeling for social business intelligence. Inf Syst 53:87–106
Kurnia PF (2018) Business intelligence model to analyze social media information. Procedia Comput Sci 135:5–14
Gutiérrez-Batista K, Campaña JR, Vila MA, Martin-Bautista MJ (2018) Building a contextual dimension for OLAP using textual data from social networks. Expert Syst Appl 93:118–133
Walha A, Ghozzi F, Gargouri F (2021) Design and execution of ETL process to build topic dimension from user-generated content. In: International Conference on Research Challenges in Information Science. Springer, pp 374–389
Walha A, Ghozzi F, Gargouri F (2024) Extract-transform-load process for recognizing sentiment from user-generated text on social media. In: International Conference on Evaluation of Novel Approaches to Software Engineering. SCITEPRESS, pp 641–648
Martinez-Mosquera D, Luján-Mora S, Recalde H (2017) Conceptual modeling of big data extract processes with UML. In: International Conference on Information Systems and Computer Science (INCISCOS). IEEE, pp 207–211
Machado GV, Cunha Í, Pereira AC, Oliveira LB (2019) DOD-ETL: distributed on-demand ETL for near real-time business intelligence. J Internet Serv Appl 10:1–15
Raj A, Bosch J, Olsson HH, Wang TJ (2020) Modelling data pipelines. In: Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, pp 13–20
Mallek H, Ghozzi F, Gargouri F (2023) Conceptual modeling of big data extraction phase. Int J Hybrid Intell Syst 19(3,4):167–182
Mallek H, Ghozzi F, Gargouri F (2023) Conceptual modeling of big data SPJ operations with Twitter social medium. Soc Netw Anal Min 13(1):105
Pan Z, Pan G, Monti A (2022) Semantic-similarity-based schema matching for management of building energy data. Energies 15(23):8894
Walha A, Ghozzi F, Gargouri F (2017) ETL4Social-data: modeling approach for topic hierarchy. In: KEOD, pp 107–118
Hung LP, Alias S (2023) Beyond sentiment analysis: a review of recent trends in text based sentiment analysis and emotion detection. J Adv Comput Intell Intell Inform 27(1):84–95
Qi Y, Shabrina Z (2023) Sentiment analysis using Twitter data: a comparative application of lexicon-and machine-learning-based approach. Soc Netw Anal Min 13(1):31
Hajji T, Loukili R, El Hassani I, Masrour T (2023) Optimizations of distributed computing processes on apache spark platform. IAENG Int J Comput Sci 50(2):422–433
Sundarakumar MR, Mahadevan G, Natchadalingam R, Karthikeyan G, Ashok J, Manoharan JS, Velmurugadass P (2023) A comprehensive study and review of tuning the performance on database scalability in Big Data analytics. J Intell Fuzzy Syst 44(3):5231–5255
Biswas N, Mondal KC (2022) Integration of ETL in cloud using spark for streaming data. In: Advanced Techniques for IoT Applications: Proceedings of EAIT 2020. Springer Singapore, pp 172–182
Borra P (2024) Comprehensive survey of amazon web services (AWS): techniques, tools, and best practices for cloud solutions
Armbrust M, Ghodsi A, Xin R, Zaharia M (2021) Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Proceedings of CIDR, vol 8, p 28
Kumar A, Mishra A, Kumar A (2024) Build multi-cloud modern distributed data warehouses with Azure and AWS. In: Architecting a modern data warehouse for large enterprises. Apress, Berkeley
Simitsis A, Skiadopoulos S, Vassiliadis P (2023) The history, present, and future of ETL technology. In: DOLAP, pp 3–12
Ali A, Naeem S, Anam S, Ahmed MM (2023) A state of art survey for Big Data processing and nosql database architecture. Int J Comput Digit Syst 14(1):1–1
Patil R, Boit S, Gudivada V, Nandigam J (2023) A survey of text representation and embedding techniques in nlp. IEEE Access 11:36120–36146
Silva MC, Eugénio P, Faria D, Pesquita C (2022) Ontologies and knowledge graphs in oncology research. Cancers 14(8):1906
Dang NC, Moreno-García MN, De la Prieta F (2020) Sentiment analysis based on deep learning: a comparative study. Electronics 9(3):483
Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, Azam S (2024) A review on large Language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 12:26839–26874
Mbata A, Sripada Y, Zhong M (2024) A survey of pipeline tools for data engineering. Preprint at arXiv:2406.08335
Beretta V (2018) Data veracity assessment: enhancing truth discovery using a priori knowledge. In: Computer Science [cs]. IMT Mines Alès
Nambiar A, Mundra D (2022) An overview of data warehouse and data lake in modern enterprise data management. Big Data Cogn Comput 6(4):132
Al-amri R, Murugesan RK, Man M, Abdulateef AF, Al-Sharafi MA, Alkahtani AA (2021) A review of machine learning and deep learning techniques for anomaly detection in IoT data. Appl Sci 11(12):5320
Lambert SL, Davidson BI, LeMay SA (2023) Survey of emerging blockchain technologies for improving the data integrity and auditability of manufacturing bills of materials in enterprise resource planning. J Emerg Technol Account 20(2):119–134
Ding PMR, Wang S Han S, Zhang D (2023) InsightPilot: an LLM-empowered automated data exploration system. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore. Association for Computational Linguistics, pp 346–352
Author information
Authors and Affiliations
Contributions
Authors collaborated in writing the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Walha, A., Ghozzi, F. & Gargouri, F. Data integration from traditional to big data: main features and comparisons of ETL approaches. J Supercomput 80, 26687–26725 (2024). https://doi.org/10.1007/s11227-024-06413-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06413-1