research-article

Benchmarking and Performance Modelling of Dataflow with Cycles

Authors:

Sheriffo Ceesay,

Adam BarkerAuthors Info & Claims

BDCAT '21: Proceedings of the 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies

Pages 91 - 100

https://doi.org/10.1145/3492324.3494159

Published: 13 January 2022 Publication History

Abstract

Over the years, the popularity of iterative data-intensive applications such as machine learning applications has grown immensely. Unlike batch applications, iterative applications such as k-means, regression or classification algorithms require multiple access to the input data to train it sufficiently for convergence. In the context of big data, these applications are executed on distributed computing frameworks such as Apache Spark. These frameworks are simple to deploy and use, however, under the hood they are complex and highly configurable. To perform an exhaustive study of the impact of these ubiquitous parameters on application performance would be cumbersome due to the exponential amount of their combinations.

In this paper, we group applications based on a common dataflow and communication pattern. We then present a multi-objective performance prediction framework to model the performance of these applications. The models can predict the execution time of a given application with high accuracy. The framework can be used to infer optimal configuration parameters to meet application execution deadlines. Based on these optimal configurable values, we recommend the best EC2 instances in terms of cost. The average error rate of the prediction results is ± 14% from the measured value.

References

[1]

Anthony G Barnston. 1992. Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score. Weather and Forecasting 7, 4 (1992), 699–709.

[2]

James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281–305.

Digital Library

[3]

Sheriffo Ceesay, Adam Barker, and Yuhui Lin. 2019. Benchmarking and Performance Modelling of MapReduce Communication Pattern. In 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 127–134. https://doi.org/10.1109/CloudCom.2019.00029

[4]

S. Ceesay, A. Barker, and B. Varghese. 2017. Plug and play bench: Simplifying big data benchmarking using containers. In 2017 IEEE International Conference on Big Data (Big Data). 2821–2828. https://doi.org/10.1109/BigData.2017.8258249

[5]

Zemin Chao, Shengfei Shi, Hong Gao, Jizhou Luo, and Hongzhi Wang. 2018. A gray-box performance model for Apache Spark. Future Generation Computer Systems 89 (2018), 58–67.

Digital Library

[6]

Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: A networking abstraction for cluster applications. In Proceedings of the 11th ACM Workshop on Hot Topics in Networks. ACM, 31–36.

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.

Digital Library

[8]

Yoav Freund and Llew Mason. 1999. The alternating decision tree learning algorithm. In icml, Vol. 99. 124–133.

[9]

Anastasios Gounaris and Jordi Torres. 2018. A methodology for spark parameter tuning. Big data research 11(2018), 22–32.

[10]

John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1(1979), 100–108.

Digital Library

[11]

Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Data Engineering Workshops (ICDEW). IEEE, 41–51.

[12]

Muhammed Tawfiqul Islam, Shanika Karunasekera, and Rajkumar Buyya. 2017. DSpark: Deadline-based resource allocation for big data applications in apache spark. In 2017 IEEE 13th ICES (e-Science). IEEE, 89–98.

[13]

Max Kuhn 2008. Building predictive models in R using the caret package. Journal of statistical software 28, 5 (2008), 1–26.

[14]

Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM ICC Frontiers. ACM, 53.

Digital Library

[15]

Andy Liaw, Matthew Wiener, 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18–22.

[16]

Gilles Louppe. 2014. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502(2014).

[17]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.

Digital Library

[18]

Jeremy Miles. 2005. R-Squared, Adjusted R-Squared. Encyclopedia of Statistics in Behavioral Science (2005).

[19]

Raymond H Myers and Raymond H Myers. 1990. Classical and modern regression with applications. Vol. 2. Duxbury press Belmont, CA.

[20]

Nhan Nguyen, Mohammad Maifi Hasan Khan, Yusuf Albayram, and Kewen Wang. 2017. Understanding the influence of configuration settings: An execution model-driven framework for apache spark platform. In 2017 IEEE 10th (CLOUD). IEEE, 802–807.

[21]

Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. 2016. Spark parameter tuning via trial-and-error. In INNS Conference on Big Data. Springer, 226–237.

[22]

Daryl Pregibon 1981. Logistic regression diagnostics. The Annals of Statistics 9, 4 (1981), 705–724.

[23]

Johan AK Suykens and Joos Vandewalle. 1999. Least squares support vector machine classifiers. Neural processing letters 9, 3 (1999), 293–300.

[24]

Blesson Varghese, Ozgur Akgun, Ian Miguel, Long Thai, and Adam Barker. 2016. Cloud benchmarking for maximising performance of scientific applications. IEEE Transactions on Cloud Computing 7, 1 (2016), 170–182.

[25]

Blesson Varghese, Lawan Thamsuhang Subba, Long Thai, and Adam Barker. 2016. Container-based cloud virtual machine benchmarking. In 2016 IEEE (IC2E). IEEE, 192–201.

[26]

Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In NSDI. 363–378.

[27]

Guolu Wang, Jungang Xu, and Ben He. 2016. A novel method for tuning configuration parameters of spark based on machine learning. In 18th HPCC). IEEE, 586–593.

[28]

Kewen Wang and Mohammad Maifi Hasan Khan. 2015. Performance prediction for apache spark platform. In 2015 IEEE 17th HPCC. IEEE, 166–173.

[29]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX on NSDI. USENIX Association, 2–2.

[30]

Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. 2008. Fast training support vector machines using parallel sequential minimal optimization. In 2008 3rd international conference on intelligent system and knowledge engineering, Vol. 1. IEEE, 997–1001.

Index Terms

Benchmarking and Performance Modelling of Dataflow with Cycles

Index terms have been assigned to the content through auto-classification.

Recommendations

Integrating Systems Modelling and Data Science: The Joint Future of Simulation and 'Big Data' Science

Although System Dynamics modelling is sometimes referred to as data-poor modelling, it often is -or could be-applied in a data-rich manner. However, more can be done in the era of 'big data'. Big data refers here to situations with much more available ...
AI Based Performance Benchmarking & Analysis of Big Data and Cloud Powered Applications: An in Depth View
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Big data analytics platforms on cloud are becoming mainstream technology enabling cost-effective rapid deployment of customer's Big Data applications delivering quicker insights from their data. It is, therefore, even more imperative that we have high ...
A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Data is growing now in a very high speed with a large volume, Spark and MapReduce¹ both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BDCAT '21: Proceedings of the 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies

December 2021

133 pages

ISBN:9781450391641

DOI:10.1145/3492324

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

BDCAT '21

Sponsor:

SIGARCH

BDCAT '21: 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies

December 6 - 9, 2021

Leicester, United Kingdom

Acceptance Rates

Overall Acceptance Rate 27 of 93 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
73
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten