Skip to main content

Optimization of Analytic Data Flows for Next Generation Business Intelligence Applications

  • Conference paper
Book cover Topics in Performance Evaluation, Measurement and Characterization (TPCTC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7144))

Included in the following conference series:

Abstract

This paper addresses the challenge of optimizing analytic data flows for modern business intelligence (BI) applications. We first describe the changing nature of BI in today’s enterprises as it has evolved from batch-based processes, in which the back-end extraction-transform-load (ETL) stage was separate from the front-end query and analytics stages, to near real-time data flows that fuse the back-end and front-end stages. We describe industry trends that force new BI architectures, e.g., mobile and cloud computing, semi-structured content, event and content streams as well as different execution engine architectures. For execution engines, the consequence of “one size does not fit all” is that BI queries and analytic applications now require complicated information flows as data is moved among data engines and queries span systems. In addition, new quality of service objectives are desired that incorporate measures beyond performance such as freshness (latency), reliability, accuracy, and so on. Existing approaches that optimize data flows simply for performance on a single system or a homogeneous cluster are insufficient. This paper describes our research to address the challenge of optimizing this new type of flow. We leverage concepts from earlier work in federated databases, but we face a much larger search space due to new objectives and a larger set of operators. We describe our initial optimizer that supports multiple objectives over a single processing engine. We then describe our research in optimizing flows for multiple engines and objectives and the challenges that remain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoop DB: An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 922–933 (2009)

    Google Scholar 

  2. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a Programming Model and Execution Framework for Web-Scale Analytical Processing. In: SoCC, pp. 119–130 (2010)

    Google Scholar 

  3. Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.C., Ozcan, F., Shekita, E.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. In: VLDB (2011)

    Google Scholar 

  4. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  5. Dayal, U.: Processing Queries over Generalization Hierarchies in a Multidatabase System. In: VLDB, pp. 342–353 (1983)

    Google Scholar 

  6. Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data Integration Flows for Business Intelligence. In: EDBT, pp. 1–11 (2009)

    Google Scholar 

  7. Du, W., Krishnamurthy, R., Shan, M.-C.: Query optimization in heterogeneous DBMS. In: VLDB, pp. 277–291 (1992)

    Google Scholar 

  8. Haas, L., Kossman, D., Wimmers, E.L., Yang, J.: Optimizing Queries across Diverse Data Sources. In: VLDB, pp. 276–285 (1997)

    Google Scholar 

  9. Han, W.-S., Kwak, W., Lee, J., Lohman, G.M., Markl, V.: Parallelizing query optimization. PVLDB 1(1), 188–200 (2008)

    Google Scholar 

  10. Informatica. PowerCenter Pushdown Optimization Option Datasheet (2011), http://www.informatica.com/INFA_Resources/ds_pushdown_optimization_6675.pdf

  11. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In: EuroSys (2007)

    Google Scholar 

  12. Jiang, D., Chin Ooi, B., Shi, L., Wu, S.: The Performance of MapReduce: An In-depth Study. PVLDB 3(1), 472–483 (2010)

    Google Scholar 

  13. Lohman, G.M., Mohan, C., Haas, L.M., Daniels, D., Lindsay, B.G., Selinger, P.G., Wilms, P.F.: Query Processing in R*. In: Query Processing in Database Systems, pp. 31–47 (1985)

    Google Scholar 

  14. Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: CIEL: A Universal Execution Engine for Distributed Data-flow Computing. In: USENIX NSDI (2011)

    Google Scholar 

  15. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a Not-so-foreign Language for Data Processing. In: SIGMOD, pp. 1099–1110 (2008)

    Google Scholar 

  16. Roth, M.T., Arya, M., Haas, L.M., Carey, M.J., Cody, W.F., Fagin, R., Schwarz, P.M., Thomas II, J., Wimmers, E.L.: The Garlic Project. In: SIGMOD, p. 557 (1996)

    Google Scholar 

  17. Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB 3(1), 460–471 (2010)

    Google Scholar 

  18. Sellis, T.K.: Global Query Optimization. In: SIGMOD, pp. 191–205 (1986)

    Google Scholar 

  19. Sellis, T.K.: Multiple-Query Optimization. TODS 13(1), 23–52 (1988)

    Article  Google Scholar 

  20. Simitsis, A., Vassiliadis, P., Dayal, U., Karagiannis, A., Tziovara, V.: Benchmarking ETL Workflows. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 199–220. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  21. Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL Processes in Data Warehouses. In: ICDE, pp. 564–575 (2005)

    Google Scholar 

  22. Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: QoX-driven ETL design: Reducing the Cost of ETL Consulting Engagements. In: SIGMOD, pp. 953–960 (2009)

    Google Scholar 

  23. Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL Workflows for Fault-Tolerance. In: ICDE, pp. 385–396 (2010)

    Google Scholar 

  24. Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a Petabyte Scale Data Warehouse Using Hadoop. In: ICDE, pp. 996–1005 (2010)

    Google Scholar 

  25. TPC. TPC-DS specification (2011), http://www.tpc.org/tpcds/spec/tpcds1.0.0.d.pdf

  26. Vassiliadis, P., Simitsis, A.: Extraction, Transformation, and Loading. In: Encyclopedia of Database Systems, pp. 1095–1101 (2009)

    Google Scholar 

  27. Vrhovnik, M., Schwarz, H., Suhre, O., Mitschang, B., Markl, V., Maier, A., Kraft, T.: An Approach to Optimize Data Processing in Business Processes. In: VLDB, pp. 615–626 (2007)

    Google Scholar 

  28. Wilkinson, K., Simitsis, A., Castellanos, M., Dayal, U.: Leveraging Business Process Models for ETL Design. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 15–30. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dayal, U., Wilkinson, K., Simitsis, A., Castellanos, M., Paz, L. (2012). Optimization of Analytic Data Flows for Next Generation Business Intelligence Applications. In: Nambiar, R., Poess, M. (eds) Topics in Performance Evaluation, Measurement and Characterization. TPCTC 2011. Lecture Notes in Computer Science, vol 7144. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32627-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32627-1_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32626-4

  • Online ISBN: 978-3-642-32627-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics