Abstract
An ETL process is used to extract data from various sources, transform it and load it into a Data Warehouse. In this paper, we analyse an ETL flow and observe that only some of the dependencies in an ETL flow are essential while others are basically represents the flow of data. For the linear flows, we exploit the underlying dependency graph and develop a greedy heuristic technique to determine a reordering that significantly improves the quality of the flow. Rather than adopting a state-space search approach, we use the cost functions and selectivities to determine the best option at each position in a right-to-left manner. To deal with complex flows, we identify activities that can be transferred between linear segments in it and position those activities appropriately. We then use the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow. Experimental evaluation has shown that by using the proposed techniques, ETL flows can be better optimized and with much less effort compared to existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Inmon, W.: Building the Data Warehouse, 3rd edn. Wiley & Sons, New York (2002)
Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual Modeling for ETL Processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (DOLAP 2002), pp. 14–21. ACM, New York (2002)
Trujillo, J., Luján-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003)
Eckerson, W., White, C.: http://www.dw-institute.com/etlreport (2003)
IBM: IBM data warehouse manager, www3.ibm.com/software/data/db2/datawarehouse
Oracle: Oracle warehouse builder 11g, http://www.oracle.com/technology/products/warehouse/
Informatica: PowerCenter, http://www.informatica.com/products/data+integration/powercenter/default.htm
Simitsis, A., Vassiliadis, P., Sellis, T.: State-Space Optimization of ETL Workflows. IEEE Trans. on Knowl. and Data Eng. 17(10), 1404–1419 (2005)
Vassiliadis, P., Simitsis, A., Spiros, S.: Modeling ETL Activities as Graphs. In: 4th International Workshop on the Design and Management of Data Warehouses (DMDW 2002), pp. 52–61. IEEE Computer Society, Toronto (2002)
Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of etl activities. In: DOLAP 2009: Proceeding of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 25–32. ACM, New York (2009)
Tziovara, V., Vassiliadis, P., Simitsis, A.: Deciding the Physical Implementation of ETL Workflows. In: Proceedings of the ACM Tenth International Workshop on Data Warehousing and OLAP (DOLAP 2007), pp. 49–56. ACM, New York (2007)
Vassiliadis, P., Karagiannis, A., Tziovara, V., Simitsis, A.: Towards a Benchmark for ETL Workflows. In: Proceedings of the 5th International Workshop on Quality in Databases (QDB 2007), in Conjunction with the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 117–137 (2007)
Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Addison-Wesley Pubs., Reading (2000)
Simitsis, A., Vassiliadis, P., Sellis, T.: Optimizing ETL Processes in Data Warehouses. In: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), pp. 564–575. IEEE Computer Society, Washington, DC (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kumar, N., Kumar, P.S. (2011). An Efficient Heuristic for Logical Optimization of ETL Workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds) Enabling Real-Time Business Intelligence. BIRTE 2010. Lecture Notes in Business Information Processing, vol 84. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22970-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-22970-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22969-5
Online ISBN: 978-3-642-22970-1
eBook Packages: Computer ScienceComputer Science (R0)