Abstract
Nowadays, we are living in an open and connected world, where small, medium and large companies are looking for integrating data from various data sources to satisfy the requirements of new applications such as delivering real-time alerts and trigger automated actions, complex system failure detection, anomalies detection, etc. The process of getting these data from their sources to its home system in efficient and correct manner is known by data ingestion, usually refer to Extract, Transform, Load (ETL) widely studied in data warehouses. In the context of rapidly technology changing and the explosion of data sources, ETL processes have to consider two main issues: (a) the variety of data sources that spans traditional, XML, semantic, graph databases, etc. and (b) the variety of storage platforms, where the home system may have several stores (known by polystore), where one hosts a particular type of data. These issues directly impact the efficiency and the deployment flexibility of ETL. In this paper, we deal with these issues. Firstly, thanks to Model Driven Engineering, we make generic different types of data sources. This genericity allows overloading the ETL operators for each type of sources. This genericity is illustrated by considering three types of the most popular data sources: relational, semantic and graph databases. Secondly, we show the impact of genericity of operators in the ETL workflow, where a Web-service-driven approach for orchestrating the ETL flows is given. Thirdly, the extracted and merged data obtained by the ETL workflow are deployed according their favorite stores. Finally, our finding is validated through a proof of concept tool using the LUBM semantic database and Yago graph deployed in Oracle RDF Semantic Graph 12c.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
El Akkaoui, Z., Mazón, J.-N., Vaisman, A., Zimányi, E.: BPMN-based conceptual modeling of ETL processes. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 1–14. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32584-7_1
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017)
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)
Berkani, N., Bellatreche, L.: A variety-sensitive ETL processes. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 201–216. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64471-4_17
Berkani, N., Bellatreche, L., Khouri, S.: Towards a conceptualization of ETL and physical storage of semantic data warehouses as a service. Cluster Comput. 16(4), 915–931 (2013)
Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., Rosati, R.: Data integration in data warehousing. Int. J. Coop. Inf. Syst. 10(3), 237–271 (2001)
Calvanese, D., Lenzerini, M., Nardi, D.: Description logics for conceptual data modeling. In: Chomicki, J., Saake, G. (eds.) Logics for Databases and Information Systems, vol. 436, pp. 229–263. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5643-5_8
Craig, I.: The Interpretation of Object-Oriented Programming Languages. Springer, London (2002). https://doi.org/10.1007/978-1-4471-0199-4
DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1255–1266. ACM (2013)
Dong, X.L., Srivastava, D.: Big data integration. PVLDB 6(11), 118 (2013)
Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)
Inmon, W.H.: Building the Data Warehouse. Wiley, Hoboken (2002)
Mazón, J.-N., Trujillo, J.: An MDA approach for the development of data warehouses. In: JISBD, pp. 208–208 (2009)
Jean, S., Bellatreche, L., Ordonez, C., Fokou, G., Baron, M.: OntoDBench: interactively benchmarking ontology storage in a database. In: Ng, W., Storey, V.C., Trujillo, J.C. (eds.) ER 2013. LNCS, vol. 8217, pp. 499–503. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41924-9_44
Khouri, S., Semassel, K., Bellatreche, L.: Managing data warehouse traceability: a life-cycle driven approach. In: Zdravkovic, J., Kirikova, M., Johannesson, P. (eds.) CAiSE 2015. LNCS, vol. 9097, pp. 199–213. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19069-3_13
Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)
Lenzerini, M.: Data integration: a theoretical perspective. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002)
Luján-Mora, S., Vassiliadis, P., Trujillo, J.: Data mapping diagrams for data warehouse design with UML. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 191–204. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30464-7_16
Nakuçi, E., Theodorou, V., Jovanovic, P., Abelló, A.: Bijoux: data generator for evaluating ETL process quality. In: ACM DOLAP, pp. 23–32 (2014)
Nebot, V., Berlanga, R.: Building data warehouses with semantic web data. Decis. Support Syst. 52(4), 853–868 (2012)
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
Raventós, R., Olivé, A.: An object-oriented operation-based approach to translation between MOF metaschemas. Data Knowl. Eng. 67(3), 444–462 (2008)
Rodriguez, M.A., Neubauer, P.: Constructions from dots and lines. CoRR, abs/1006.2361 (2010)
Shmueli, O., Tsur, S.: Logical diagnosis of LDL programs. New Gener. Comput. 9(3/4), 277–304 (1991)
Simitsis, A., Vassiliadis, P., Sellis, T.-K.: Optimizing ETL processes in data warehouses. In: ICDE, pp. 564–575 (2005)
Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 829–840. ACM (2012)
Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010)
Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. 3(4), 1–24 (2007)
Stonebraker, M.: Technical perspective - one size fits all: an idea whose time has come and gone. Commun. ACM 51(12), 76 (2008)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW, pp. 697–706 (2007)
Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39648-2_25
Tziovara, P., Vassiliadis, P., Simitsis, A.: Deciding the physical implementation of ETL workflows. In: DOLAP, pp. 49–56 (2007)
Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009)
Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of ETL activities. In: ACM DOLAP, pp. 25–32 (2009)
Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of etl scenarios. Inf. Syst. 30(7), 492–525 (2005)
Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: DOLAP, pp. 14–21 (2002)
Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL activities as graphs. In: DMDW, pp. 52–61 (2002)
Wilkinson, K., Simitsis, A., Castellanos, M., Dayal, U.: Leveraging business process models for ETL design. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 15–30. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16373-9_2
Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference on Cloud and Service Computing (CSC), pp. 330–335. IEEE (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Berkani, N., Bellatreche, L., Guittet, L. (2018). ETL Processes in the Era of Variety. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science(), vol 11310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58415-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-662-58415-6_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58414-9
Online ISBN: 978-3-662-58415-6
eBook Packages: Computer ScienceComputer Science (R0)