skip to main content
10.1145/3492324.3494159acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedingsconference-collections
research-article

Benchmarking and Performance Modelling of Dataflow with Cycles

Published: 13 January 2022 Publication History

Abstract

Over the years, the popularity of iterative data-intensive applications such as machine learning applications has grown immensely. Unlike batch applications, iterative applications such as k-means, regression or classification algorithms require multiple access to the input data to train it sufficiently for convergence. In the context of big data, these applications are executed on distributed computing frameworks such as Apache Spark. These frameworks are simple to deploy and use, however, under the hood they are complex and highly configurable. To perform an exhaustive study of the impact of these ubiquitous parameters on application performance would be cumbersome due to the exponential amount of their combinations.
In this paper, we group applications based on a common dataflow and communication pattern. We then present a multi-objective performance prediction framework to model the performance of these applications. The models can predict the execution time of a given application with high accuracy. The framework can be used to infer optimal configuration parameters to meet application execution deadlines. Based on these optimal configurable values, we recommend the best EC2 instances in terms of cost. The average error rate of the prediction results is ± 14% from the measured value.

References

[1]
Anthony G Barnston. 1992. Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score. Weather and Forecasting 7, 4 (1992), 699–709.
[2]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281–305.
[3]
Sheriffo Ceesay, Adam Barker, and Yuhui Lin. 2019. Benchmarking and Performance Modelling of MapReduce Communication Pattern. In 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 127–134. https://doi.org/10.1109/CloudCom.2019.00029
[4]
S. Ceesay, A. Barker, and B. Varghese. 2017. Plug and play bench: Simplifying big data benchmarking using containers. In 2017 IEEE International Conference on Big Data (Big Data). 2821–2828. https://doi.org/10.1109/BigData.2017.8258249
[5]
Zemin Chao, Shengfei Shi, Hong Gao, Jizhou Luo, and Hongzhi Wang. 2018. A gray-box performance model for Apache Spark. Future Generation Computer Systems 89 (2018), 58–67.
[6]
Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: A networking abstraction for cluster applications. In Proceedings of the 11th ACM Workshop on Hot Topics in Networks. ACM, 31–36.
[7]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[8]
Yoav Freund and Llew Mason. 1999. The alternating decision tree learning algorithm. In icml, Vol. 99. 124–133.
[9]
Anastasios Gounaris and Jordi Torres. 2018. A methodology for spark parameter tuning. Big data research 11(2018), 22–32.
[10]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1(1979), 100–108.
[11]
Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Data Engineering Workshops (ICDEW). IEEE, 41–51.
[12]
Muhammed Tawfiqul Islam, Shanika Karunasekera, and Rajkumar Buyya. 2017. DSpark: Deadline-based resource allocation for big data applications in apache spark. In 2017 IEEE 13th ICES (e-Science). IEEE, 89–98.
[13]
Max Kuhn 2008. Building predictive models in R using the caret package. Journal of statistical software 28, 5 (2008), 1–26.
[14]
Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM ICC Frontiers. ACM, 53.
[15]
Andy Liaw, Matthew Wiener, 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18–22.
[16]
Gilles Louppe. 2014. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502(2014).
[17]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.
[18]
Jeremy Miles. 2005. R-Squared, Adjusted R-Squared. Encyclopedia of Statistics in Behavioral Science (2005).
[19]
Raymond H Myers and Raymond H Myers. 1990. Classical and modern regression with applications. Vol. 2. Duxbury press Belmont, CA.
[20]
Nhan Nguyen, Mohammad Maifi Hasan Khan, Yusuf Albayram, and Kewen Wang. 2017. Understanding the influence of configuration settings: An execution model-driven framework for apache spark platform. In 2017 IEEE 10th (CLOUD). IEEE, 802–807.
[21]
Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. 2016. Spark parameter tuning via trial-and-error. In INNS Conference on Big Data. Springer, 226–237.
[22]
Daryl Pregibon 1981. Logistic regression diagnostics. The Annals of Statistics 9, 4 (1981), 705–724.
[23]
Johan AK Suykens and Joos Vandewalle. 1999. Least squares support vector machine classifiers. Neural processing letters 9, 3 (1999), 293–300.
[24]
Blesson Varghese, Ozgur Akgun, Ian Miguel, Long Thai, and Adam Barker. 2016. Cloud benchmarking for maximising performance of scientific applications. IEEE Transactions on Cloud Computing 7, 1 (2016), 170–182.
[25]
Blesson Varghese, Lawan Thamsuhang Subba, Long Thai, and Adam Barker. 2016. Container-based cloud virtual machine benchmarking. In 2016 IEEE (IC2E). IEEE, 192–201.
[26]
Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In NSDI. 363–378.
[27]
Guolu Wang, Jungang Xu, and Ben He. 2016. A novel method for tuning configuration parameters of spark based on machine learning. In 18th HPCC). IEEE, 586–593.
[28]
Kewen Wang and Mohammad Maifi Hasan Khan. 2015. Performance prediction for apache spark platform. In 2015 IEEE 17th HPCC. IEEE, 166–173.
[29]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX on NSDI. USENIX Association, 2–2.
[30]
Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. 2008. Fast training support vector machines using parallel sequential minimal optimization. In 2008 3rd international conference on intelligent system and knowledge engineering, Vol. 1. IEEE, 997–1001.

Index Terms

  1. Benchmarking and Performance Modelling of Dataflow with Cycles
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BDCAT '21: Proceedings of the 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies
        December 2021
        133 pages
        ISBN:9781450391641
        DOI:10.1145/3492324
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 January 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Big Data
        2. Communication Patterns
        3. Dataflow With Cycles
        4. Machine Learning
        5. Modelling

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        BDCAT '21
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 27 of 93 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 73
          Total Downloads
        • Downloads (Last 12 months)12
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 05 Mar 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media