skip to main content
10.1145/3419111.3421305acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Finding the right cloud configuration for analytics clusters

Published: 12 October 2020 Publication History

Abstract

Finding good cloud configurations for deploying a single distributed system is already a challenging task, and it becomes substantially harder when a data analytics cluster is formed by multiple distributed systems since the search space becomes exponentially larger. In particular, recent proposals for single system deployments rely on benchmarking runs that become prohibitively expensive as we shift to joint optimization of multiple systems, as users have to wait until the end of a long optimization run to start the production run of their job.
We propose Vanir, an optimization framework designed to operate in an ecosystem of multiple distributed systems forming an analytics cluster. To deal with this large search space, Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs. This is achieved by combining a series of techniques in a novel way, namely a metrics-based optimizer for the benchmarking runs, and a Mondrian forest-based performance model and transfer learning during production runs. Our results show that Vanir can find deployments that perform comparably to the ones found by state-of-the-art single-system cloud configuration optimizers while spending 2X fewer benchmarking runs. This leads to an overall search cost that is 1.3--24X lower compared to the state-of-the-art. Additionally, when transfer learning can be used, Vanir can minimize the benchmarking runs even further, and use online optimization to achieve a performance comparable to the deployments found by today's single-system frameworks.

Supplementary Material

MP4 File (p208-bilal-presentation.mp4)

References

[1]
Apache Airflow. https://airflow.apache.org. [Online; accessed 25/05/2020].
[2]
Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy. https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/. [Online; accessed 25/05/2020].
[3]
Flink Use cases. https://flink.apache.org/usecases.html. [Online; accessed 01/03/2020].
[4]
GraphX lib github. https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib.
[5]
How Verizon Media Group migrated from on-premises Apache Hadoop and Spark to Amazon EMR. https://aws.amazon.com/blogs/big-data/how-verizon-media-group-migrated-from-on-premises-apache-hadoop-and-spark-to-amazon-emr/. [Online; accessed 25/05/2020].
[6]
Intel's HiBench benchmark. https://github.com/intel-hadoop/HiBench.
[7]
Luigi. https://github.com/spotify/luigi. [Online; accessed 25/05/2020].
[8]
Modified version of Intel's HiBench benchmark. https://github.com/MBtech/HiBench.
[9]
Quoble Pipeline. https://www.qubole.com/developers/spark-getting-started-guide/workflow/. [Online; accessed 25/05/2020].
[10]
Spearmint Github repo. https://github.com/HIPS/Spearmint.
[11]
Walmart Labs: Lambda Architecture. https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3. [Online; accessed 01/03/2020].
[12]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In NSDI, pages 21--21. USENIX Association, 2012.
[13]
O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI, 2017.
[14]
S. Babu. Towards Automatic Optimization of MapReduce Programs. In SoCC, pages 137--142. ACM, 2010.
[15]
M. Bilal and M. Canini. Towards automatic parameter tuning of stream processing systems. In SoCC, pages 189--200. ACM, 2017.
[16]
F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Transactions on the Web (TWEB), 5(1):2, 2011.
[17]
M. Casimiro, D. Didona, P. Romano, L. Rodrigues, W. Zwanepoel, and D. Garlan. Lynceus: Cost-efficient tuning and provisioning of data analytic jobs. arXiv preprint arXiv:1905.02119, 2019.
[18]
C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In ASPLOS, pages 77--88. ACM, 2013.
[19]
C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In ASPLOS, pages 127--144. ACM, 2014.
[20]
M. Denil, D. Matheson, and N. Freitas. Consistency of online random forests. In ICML, pages 1256--1264, 2013.
[21]
H. Du, P. Han, W. Chen, Y. Wang, and C. Zhang. Otterman: A novel approach of spark auto-tuning by a hybrid strategy. In ICSAI, pages 478--483, 2018.
[22]
S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. In PVLDB, pages 1246--1257. VLDB Endowment, 2009.
[23]
A. Fekry, L. Carata, T. Pasquier, A. Rice, and A. Hopper. Tuneful: An online significance-aware configuration tuner for big data analytics. arXiv preprint arXiv:2001.08002, 2020.
[24]
A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In EuroSys, pages 99--112. ACM, 2012.
[25]
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In PVLDB, pages 1111--1122. VLDB Endowment, 2011.
[26]
H. Herodotou, F. Dong, and S. Babu. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In SoCC, page 18. ACM, 2011.
[27]
C. Hsu, V. Nair, T. Menzies, and V. Freeh. Micky: A cheaper alternative for selecting cloud instances. In CLOUD, volume 00, pages 409--416. IEEE, 2018.
[28]
C.-J. Hsu, V. Nair, V. W. Freeh, and T. Menzies. Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM. In ICDCS, pages 660--670. IEEE, 2018.
[29]
C.-J. Hsu, V. Nair, T. Menzies, and V. W. Freeh. Scout: An experienced guide to find the best cloud configuration. arXiv preprint arXiv:1803.01296, 2018.
[30]
S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, I. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus: Towards automated slos for enterprise clusters. In OSDI, pages 117--134. USENIX Association, 2016.
[31]
A. Klimovic, H. Litz, and C. Kozyrakis. Selecta: Heterogeneous cloud storage configuration for data analytics. In USENIX ATC, pages 759--773, 2018.
[32]
B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Efficient online random forests. In NIPS, pages 3140--3148, 2014.
[33]
C. Li, S. Wang, H. Hoffmann, and S. Lu. Statically inferring performance properties of software configurations. In EuroSys, pages 1--16, 2020.
[34]
A. Mahgoub, P. Wood, S. Ganesh, S. Mitra, W. Gerlach, T. Harrison, F. Meyer, A. Grama, S. Bagchi, and S. Chaterji. Rafiki: a middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads. In Middleware, pages 28--40, 2017.
[35]
K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. PerfOrator: Eloquent Performance Models for Resource Optimization. In SoCC, pages 415--427. ACM, 2016.
[36]
D. M. Roy, Y. W. Teh, et al. The mondrian process. In NIPS, pages 1377--1384, 2008.
[37]
A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In ICCV, pages 1393--1400. IEEE, 2009.
[38]
B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, et al. Item-based collaborative filtering recommendation algorithms. In WWW, volume 1, pages 285--295, 2001.
[39]
L. Shao, Y. Zhu, S. Liu, A. Eswaran, K. Lieber, J. Mahajan, M. Thigpen, S. Darbha, S. Krishnan, S. Srinivasan, C. Curino, and K. Karanasos. Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms. In SoCC, pages 441--452. ACM, 2019.
[40]
M. Trotter, T. Wood, and J. Hwang. Forecasting a Storm: Divining Optimal Configurations using Genetic Algorithms and Supervised Learning. In ICAC, pages 136--146. IEEE, 2019.
[41]
N. Vasić, D. Novaković, S. Miučin, D. Kostić, and R. Bianchini. Dejavu: accelerating resource allocation in virtualized environments. In ASPLOS, pages 423--436. ACM, 2012.
[42]
S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In NSDI, pages 363--378. USENIX Association, 2016.
[43]
N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz. Selecting the Best VM Across Multiple Public Clouds: A Data-driven Performance Modeling Approach. In SoCC, pages 452--465. ACM, 2017.
[44]
Y. Zhu, J. Liu, M. Guo, Y. Bao, W. Ma, Z. Liu, K. Song, and Y. Yang. Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In SoCC, pages 338--350, 2017.

Cited By

View all
  • (2024)DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
  • (2024)COTuner: Joint Optimization of Resource Configuration and Software Parameters for Recurring Streaming Jobs on the Cloud2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00019(87-96)Online publication date: 6-May-2024
  • (2024)Challenges and Future Directions in Similarity Assessment of Big Data Analytics Workloads2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825426(3774-3779)Online publication date: 15-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
October 2020
535 pages
ISBN:9781450381376
DOI:10.1145/3419111
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • Fundação para a Ciência e a Tecnologia

Conference

SoCC '20
Sponsor:
SoCC '20: ACM Symposium on Cloud Computing
October 19 - 21, 2020
Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)4
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
  • (2024)COTuner: Joint Optimization of Resource Configuration and Software Parameters for Recurring Streaming Jobs on the Cloud2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00019(87-96)Online publication date: 6-May-2024
  • (2024)Challenges and Future Directions in Similarity Assessment of Big Data Analytics Workloads2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825426(3774-3779)Online publication date: 15-Dec-2024
  • (2023)Dynamic Optimization of Provider-Based Scheduling for HPC Workloads2023 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)10.23919/SoftCOM58365.2023.10271608(1-6)Online publication date: 21-Sep-2023
  • (2023)SmartpickProceedings of the 24th International Middleware Conference10.1145/3590140.3592850(29-42)Online publication date: 27-Nov-2023
  • (2023)Mimir: Finding Cost-efficient Storage Configurations in the Public CloudProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594776(22-34)Online publication date: 5-Jun-2023
  • (2023)With Great Freedom Comes Great Opportunity: Rethinking Resource Allocation for Serverless FunctionsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567506(381-397)Online publication date: 8-May-2023
  • (2023)Serverless Computing: State-of-the-Art, Challenges and OpportunitiesIEEE Transactions on Services Computing10.1109/TSC.2022.316655316:2(1522-1539)Online publication date: 1-Mar-2023
  • (2023)Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253884(403-412)Online publication date: 17-Nov-2023
  • (2023)Predicting the Performance-Cost Trade-off of Applications Across Multiple Systems2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00029(216-228)Online publication date: May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media