research-article

Finding the right cloud configuration for analytics clusters

Authors:

Muhammad Bilal,

Rodrigo RodriguesAuthors Info & Claims

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

Pages 208 - 222

https://doi.org/10.1145/3419111.3421305

Published: 12 October 2020 Publication History

Abstract

Finding good cloud configurations for deploying a single distributed system is already a challenging task, and it becomes substantially harder when a data analytics cluster is formed by multiple distributed systems since the search space becomes exponentially larger. In particular, recent proposals for single system deployments rely on benchmarking runs that become prohibitively expensive as we shift to joint optimization of multiple systems, as users have to wait until the end of a long optimization run to start the production run of their job.

We propose Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster. To deal with this large search space, Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs. This is achieved by combining a series of techniques in a novel way, namely a metrics-based optimizer for the benchmarking runs, and a Mondrian forest-based performance model and transfer learning during production runs. Our results show that Vanir can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2X fewer benchmarking runs. This leads to an overall search cost that is 1.3--24X lower compared to the state-of-the-art. Additionally, when transfer learning can be used, Vanir can minimize the benchmarking runs even further, and use online optimization to achieve a performance comparable to the deployments found by today's single-system frameworks.

Supplementary Material

MP4 File (p208-bilal-presentation.mp4)

Download
224.37 MB

References

[1]

Apache Airflow. https://airflow.apache.org. [Online; accessed 25/05/2020].

[2]

Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy. https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/. [Online; accessed 25/05/2020].

[3]

Flink Use cases. https://flink.apache.org/usecases.html. [Online; accessed 01/03/2020].

[4]

GraphX lib github. https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib.

[5]

How Verizon Media Group migrated from on-premises Apache Hadoop and Spark to Amazon EMR. https://aws.amazon.com/blogs/big-data/how-verizon-media-group-migrated-from-on-premises-apache-hadoop-and-spark-to-amazon-emr/. [Online; accessed 25/05/2020].

[6]

Intel's HiBench benchmark. https://github.com/intel-hadoop/HiBench.

[7]

Luigi. https://github.com/spotify/luigi. [Online; accessed 25/05/2020].

[8]

Modified version of Intel's HiBench benchmark. https://github.com/MBtech/HiBench.

[9]

Quoble Pipeline. https://www.qubole.com/developers/spark-getting-started-guide/workflow/. [Online; accessed 25/05/2020].

[10]

Spearmint Github repo. https://github.com/HIPS/Spearmint.

[11]

Walmart Labs: Lambda Architecture. https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3. [Online; accessed 01/03/2020].

[12]

S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In NSDI, pages 21--21. USENIX Association, 2012.

Digital Library

[13]

O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI, 2017.

[14]

S. Babu. Towards Automatic Optimization of MapReduce Programs. In SoCC, pages 137--142. ACM, 2010.

Digital Library

[15]

M. Bilal and M. Canini. Towards automatic parameter tuning of stream processing systems. In SoCC, pages 189--200. ACM, 2017.

Digital Library

[16]

F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Transactions on the Web (TWEB), 5(1):2, 2011.

Digital Library

[17]

M. Casimiro, D. Didona, P. Romano, L. Rodrigues, W. Zwanepoel, and D. Garlan. Lynceus: Cost-efficient tuning and provisioning of data analytic jobs. arXiv preprint arXiv:1905.02119, 2019.

[18]

C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In ASPLOS, pages 77--88. ACM, 2013.

Digital Library

[19]

C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In ASPLOS, pages 127--144. ACM, 2014.

Digital Library

[20]

M. Denil, D. Matheson, and N. Freitas. Consistency of online random forests. In ICML, pages 1256--1264, 2013.

[21]

H. Du, P. Han, W. Chen, Y. Wang, and C. Zhang. Otterman: A novel approach of spark auto-tuning by a hybrid strategy. In ICSAI, pages 478--483, 2018.

[22]

S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. In PVLDB, pages 1246--1257. VLDB Endowment, 2009.

Digital Library

[23]

A. Fekry, L. Carata, T. Pasquier, A. Rice, and A. Hopper. Tuneful: An online significance-aware configuration tuner for big data analytics. arXiv preprint arXiv:2001.08002, 2020.

[24]

A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In EuroSys, pages 99--112. ACM, 2012.

Digital Library

[25]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In PVLDB, pages 1111--1122. VLDB Endowment, 2011.

Digital Library

[26]

H. Herodotou, F. Dong, and S. Babu. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In SoCC, page 18. ACM, 2011.

Digital Library

[27]

C. Hsu, V. Nair, T. Menzies, and V. Freeh. Micky: A cheaper alternative for selecting cloud instances. In CLOUD, volume 00, pages 409--416. IEEE, 2018.

[28]

C.-J. Hsu, V. Nair, V. W. Freeh, and T. Menzies. Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM. In ICDCS, pages 660--670. IEEE, 2018.

[29]

C.-J. Hsu, V. Nair, T. Menzies, and V. W. Freeh. Scout: An experienced guide to find the best cloud configuration. arXiv preprint arXiv:1803.01296, 2018.

[30]

S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, I. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus: Towards automated slos for enterprise clusters. In OSDI, pages 117--134. USENIX Association, 2016.

Digital Library

[31]

A. Klimovic, H. Litz, and C. Kozyrakis. Selecta: Heterogeneous cloud storage configuration for data analytics. In USENIX ATC, pages 759--773, 2018.

[32]

B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Efficient online random forests. In NIPS, pages 3140--3148, 2014.

[33]

C. Li, S. Wang, H. Hoffmann, and S. Lu. Statically inferring performance properties of software configurations. In EuroSys, pages 1--16, 2020.

Digital Library

[34]

A. Mahgoub, P. Wood, S. Ganesh, S. Mitra, W. Gerlach, T. Harrison, F. Meyer, A. Grama, S. Bagchi, and S. Chaterji. Rafiki: a middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads. In Middleware, pages 28--40, 2017.

[35]

K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. PerfOrator: Eloquent Performance Models for Resource Optimization. In SoCC, pages 415--427. ACM, 2016.

Digital Library

[36]

D. M. Roy, Y. W. Teh, et al. The mondrian process. In NIPS, pages 1377--1384, 2008.

[37]

A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In ICCV, pages 1393--1400. IEEE, 2009.

[38]

B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, et al. Item-based collaborative filtering recommendation algorithms. In WWW, volume 1, pages 285--295, 2001.

Digital Library

[39]

L. Shao, Y. Zhu, S. Liu, A. Eswaran, K. Lieber, J. Mahajan, M. Thigpen, S. Darbha, S. Krishnan, S. Srinivasan, C. Curino, and K. Karanasos. Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms. In SoCC, pages 441--452. ACM, 2019.

Digital Library

[40]

M. Trotter, T. Wood, and J. Hwang. Forecasting a Storm: Divining Optimal Configurations using Genetic Algorithms and Supervised Learning. In ICAC, pages 136--146. IEEE, 2019.

[41]

N. Vasić, D. Novaković, S. Miučin, D. Kostić, and R. Bianchini. Dejavu: accelerating resource allocation in virtualized environments. In ASPLOS, pages 423--436. ACM, 2012.

Digital Library

[42]

S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In NSDI, pages 363--378. USENIX Association, 2016.

Digital Library

[43]

N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz. Selecting the Best VM Across Multiple Public Clouds: A Data-driven Performance Modeling Approach. In SoCC, pages 452--465. ACM, 2017.

Digital Library

[44]

Y. Zhu, J. Liu, M. Guo, Y. Bao, W. Ma, Z. Liu, K. Song, and Y. Yang. Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In SoCC, pages 338--350, 2017.

Digital Library

Cited By

Dou HWang YZhang YChen PZheng Z(2024)DeepCAT⁺: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3459889
Dou HZhu SZhou YZhang YMei JWu YDai J(2024)COTuner: Joint Optimization of Resource Configuration and Software Parameters for Recurring Streaming Jobs on the Cloud2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00019(87-96)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00019
Scheinert DGuttenberger AWill JKao O(2024)Challenges and Future Directions in Similarity Assessment of Big Data Analytics Workloads2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825426(3774-3779)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825426
Show More Cited By

Index Terms

Finding the right cloud configuration for analytics clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems

Recommendations

Lpar configuration and management: working with ibm eserver
Finding your cronies: static analysis for dynamic object colocation
OOPSLA '04

This paper introduces <i>dynamic</i> object colocation, an optimization to reduce copying costs in generational and other incremental garbage collectors by allocating connected objects together in the same space. Previous work indicates that connected ...
Configuring highly available clusters using hacmp 4.5

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

October 2020

535 pages

ISBN:9781450381376

DOI:10.1145/3419111

General Chair:
Rodrigo Fonseca
Microsoft and Brown University
,
Program Chairs:
Christina Delimitrou
Cornell University
,
Beng Chin Ooi
National University of Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Fundação para a Ciência e a Tecnologia

Conference

SoCC '20

Sponsor:

SoCC '20: ACM Symposium on Cloud Computing

October 19 - 21, 2020

Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
469
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dou HWang YZhang YChen PZheng Z(2024)DeepCAT⁺: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3459889
Dou HZhu SZhou YZhang YMei JWu YDai J(2024)COTuner: Joint Optimization of Resource Configuration and Software Parameters for Recurring Streaming Jobs on the Cloud2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00019(87-96)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00019
Scheinert DGuttenberger AWill JKao O(2024)Challenges and Future Directions in Similarity Assessment of Big Data Analytics Workloads2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825426(3774-3779)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825426
Marino JRisso FBighi M(2023)Dynamic Optimization of Provider-Based Scheduling for HPC Workloads2023 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)10.23919/SoftCOM58365.2023.10271608(1-6)Online publication date: 21-Sep-2023
https://doi.org/10.23919/SoftCOM58365.2023.10271608
Mohapatra AOh K(2023)SmartpickProceedings of the 24th International Middleware Conference10.1145/3590140.3592850(29-42)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3592850
Park HGanger GAmvrosiadis GGilad YKostic DMoatti YBiran O(2023)Mimir: Finding Cost-efficient Storage Configurations in the Public CloudProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594776(22-34)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3579370.3594776
Bilal MCanini MFonseca RRodrigues RFedorova ANarayanan DDi Luna GQuerzoni L(2023)With Great Freedom Comes Great Opportunity: Rethinking Resource Allocation for Serverless FunctionsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567506(381-397)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567506
Li YLin YWang YYe KXu C(2023)Serverless Computing: State-of-the-Art, Challenges and OpportunitiesIEEE Transactions on Services Computing10.1109/TSC.2022.316655316:2(1522-1539)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TSC.2022.3166553
Scheinert DWiesner PWittkopp TThamsen LWill JKao O(2023)Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253884(403-412)Online publication date: 17-Nov-2023
https://doi.org/10.1109/IPCCC59175.2023.10253884
Nassereldine ADiab SBaydoun MLeach KAlt MMilojicic DEl Hajj I(2023)Predicting the Performance-Cost Trade-off of Applications Across Multiple Systems2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00029(216-228)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00029
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten