Skip to main content

Multi-engine Analytics with IReS

  • Conference paper
  • First Online:
Real-Time Business Intelligence and Analytics (BIRTE 2015, BIRTE 2016, BIRTE 2017)

Abstract

We present IReS, the Intelligent Resource Scheduler that is able to abstractly describe, optimize and execute any batch analytics workflow with respect to a multi-objective policy. Relying on cost and performance models of the required tasks over the available platforms, IReS allocates distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and decides on the exact amount of resources provisioned. Moreover, IReS efficiently adapts to the current cluster/engine conditions and recovers from failures by effectively monitoring the workflow execution in real-time. Our current prototype has been tested in a plethora of business driven and synthetic workflows, proving its potential of yielding significant gains in cost and performance compared to statically scheduled, single-engine executions. IReS incurs only marginal overhead to the workflow execution performance, managing to discover an approximate pareto-optimal set of execution plans within a few seconds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/project-asap/IReS-Platform.

  2. 2.

    ASAP (Adaptive Scalable Analytics Platform) envisions a unified execution framework for scalable data analytics. www.asap-fp7.eu/.

References

  1. Apache Flink. https://flink.apache.org/

  2. Apache Hadoop. http://hadoop.apache.org/

  3. Apache Spark. https://spark.apache.org/

  4. Cascading Lingual. www.cascading.org/projects/lingual/

  5. Cloudera Distribution CDH 5.2.0. http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html

  6. Hortonworks Sandbox. http://hortonworks.com/products/hortonworks-sandbox/

  7. Kitten. https://github.com/cloudera/kitten

  8. monetdb. https://www.monetdb.org/

  9. Presto. http://www.teradata.com/Presto

  10. Running Databases on AWS. http://aws.amazon.com/running_databases/

  11. The Infrastructure Behind Twitter: Scale. https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html

  12. What is Facebook’s architecture? https://www.quora.com/What-is-Facebooks-architecture-6

  13. Agrawal, D., et al.: Rheem: enabling multi-platform task execution. In: SIGMOD (2016)

    Google Scholar 

  14. Armbrust, M., et al.: SparkSQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394. ACM (2015)

    Google Scholar 

  15. Bharathi, S., et al.: Characterization of scientific workflows. In: Workshop on Workflows in Support of Large-Scale Science (2008)

    Google Scholar 

  16. Bugiotti, F., et al.: Invisible glue: scalable self-tuning multi-stores. In: CIDR (2015)

    Google Scholar 

  17. Chawathe, S., et al.: The TSIMMIS project: integration of heterogenous information sources. In: IPSJ, pp. 7–18 (1994)

    Google Scholar 

  18. Deb, K., et al.: A fast and elitist multiobjective genetic algorithm: NSGA-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)

    Article  Google Scholar 

  19. Doka, K., Papailiou, N., Tsoumakos, D., Mantas, C., Koziris, N.: IReS: intelligent, multi-engine resource scheduler for big data analytics workflows. In: Proceedings of the 2015 ACM SIGMOD, pp. 1451–1456. ACM (2015)

    Google Scholar 

  20. Doka, K., et al.: Mix “n” match multi-engine analytics. In: Big data, pp. 194–203. IEEE (2016)

    Google Scholar 

  21. Duggan, J., et al.: The bigDAWG polystore system. ACM Sigmod Rec. 44(2), 11–16 (2015)

    Article  Google Scholar 

  22. Giannakopoulos, I., Tsoumakos, D., Koziris, N.: A decision tree based approach towards adaptive profiling of cloud applications. In: IEEE Big Data (2017)

    Google Scholar 

  23. Gog, I., et al.: Musketeer: all for one, one for all in data processing systems. In: Eurosys, p. 2. ACM (2015)

    Google Scholar 

  24. Haynes, B., Cheung, A., Balazinska, M.: Pipegen: data pipe generator for hybrid analytics. arXiv:1605.01664 (2016)

  25. Henrikson, J.: Completeness and total boundedness of the hausdorff metric. MIT Undergrad. J. Math. 1, 69–80 (1999)

    Google Scholar 

  26. Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)

    Google Scholar 

  27. Johnson, N., Near, J.P., Song, D.: Towards practical differential privacy for SQL queries. Vertica 1, 1000

    Google Scholar 

  28. Karpathiotakis, et al.: No data left behind: real-time insights from a complex data ecosystem. In: SoCC, pp. 108–120. ACM (2017)

    Google Scholar 

  29. Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI (1995)

    Google Scholar 

  30. Kolev, B., et al.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34, 1–41 (2015)

    Google Scholar 

  31. Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for mapreduce workflows. In: VLDB (2012)

    Google Scholar 

  32. Roth, M.T., Schwarz, P.M.: Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In: VLDB, vol. 97 (1997)

    Google Scholar 

  33. Sharma, B., Wood, T., Das, C.R.: HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers. In: ICDCS (2013)

    Google Scholar 

  34. Simitsis, A., et al.: HFMS: managing the lifecycle and complexity of hybrid analytic data flows. In: ICDE. IEEE (2013)

    Google Scholar 

  35. Tomasic, A., Raschid, L., Valduriez, P.: Scaling access to heterogeneous data sources with DISCO. IEEE TKDE 10(5), 808–823 (1998)

    Google Scholar 

  36. Tsoumakos, D., Mantas, C.: The case for multi-engine data analytics. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 406–415. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54420-0_40

    Chapter  Google Scholar 

  37. Vavilapalli, V.K., et al.: Apache hadoop yarn: yet another resource negotiator. In: SoCC, p. 5. ACM (2013)

    Google Scholar 

  38. Wang, J., et al.: The myria big data management and analytics system and cloud services. In: CIDR (2017)

    Google Scholar 

  39. Zhang, Z., et al.: Automated profiling and resource management of pig programs for meeting service level objectives. In: ICAC, pp. 53–62. ACM (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katerina Doka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Doka, K., Mytilinis, I., Papailiou, N., Giannakouris, V., Tsoumakos, D., Koziris, N. (2019). Multi-engine Analytics with IReS. In: Castellanos, M., Chrysanthis, P., Pelechrinis, K. (eds) Real-Time Business Intelligence and Analytics. BIRTE BIRTE BIRTE 2015 2016 2017. Lecture Notes in Business Information Processing, vol 337. Springer, Cham. https://doi.org/10.1007/978-3-030-24124-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-24124-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-24123-0

  • Online ISBN: 978-3-030-24124-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics