Skip to main content

Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study

  • Conference paper
  • First Online:
Lecture Notes in Computational Intelligence and Decision Making (ISDMCI 2020)

Abstract

In this paper, we present an analysis and results of experimental research into determining the performance of solving machine learning problems via the library Apache Spark MLlib for the ecosystem Microsoft Azure HDInsight with the help of the test dataset Spark-Pref. In order to solve the defined problems, software and information support methodology have been developed based on the monitoring system SparkMeasure and Ambari. Metrics have been suggested for analyzing the performance of Apache Spark computations. These metrics use statistical characteristics of learning and testing processes when benchmark Spark-perf tests are carried out. There have been suggested formulas for determining settings for Apache Spark parameters. These formulas provide a time minimization as compared to the standard values of Spark parameter settings for executing sets of machine learning test tasks for heterogeneous and homogeneous cluster configurations of Apache Spark Azure HDInsight. In order to assess computing performance for machine learning methods in Spark-Pref a metric has been proposed, which is calculated as the ratio of the average testing time and the average training time. The results of the computational experiments have been demonstrated. They confirm the effectiveness of the proposed algorithms for Apache Spark settings relative to the standard values for heterogeneous and homogeneous clusters deployed on the platform Apache Spark Azure HDInsight, machine learning methods for a Spark-Pref test set being implemented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amazon AWS: Complete business guide to the world’s largest provider of cloud services. https://www.zdnet.com/article/amazon-aws-everything-you-should-know-about-the-largest-cloud. Accessed 5 Mar 2020

  2. Amazon Machine Learning. https://docs.aws.amazon.com/machine-learning/latest/dg/what-is-amazon-machine-learning.html. Accessed 5 Mar 2020

  3. Apache Mahout For Creating Scalable Performant Machine Learning Applications. https://mahout.apache.org/. Accessed 5 Mar 2020

  4. Azure HDInsight Azure HDInsight documentation. https://docs.microsoft.com/en-us/azure/hdinsight/. Accessed 5 Mar 2020

  5. Cloud Serving Benchmark. https://research.yahoo.com/news/yahoo-cloud-serving-benchmark. Accessed 5 Mar 2020

  6. Dv2 and DSv2-series. https://docs.microsoft.com/en-us/azure/virtual-machines/dv2-dsv2-series. Accessed 5 Mar 2020

  7. Ev3 and Esv3-series. https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series. Accessed 5 Mar 2020

  8. Get started with Google Cloud. https://cloud.google.com/docs. Accessed 5 Mar 2020

  9. Microsoft®Azure Official Site | Create Your Free Account Today. https://azure.microsoft.com/en-us/free/search/. Accessed 5 Mar 2020

  10. MLlib is Apache Spark’s scalable machine learning library. https://spark.apache.org/mllib/. Accessed 5 Mar 2020

  11. Spark-perf (homepage) Performance tests for Spark. https://spark-packages.org/package/databricks/spark-perf. Accessed 5 Mar 2020

  12. TensorFlow on Spark. TensorFlow. https://docs.microsoft.com/en-us/azure/databricks/applications/deep-learning/single-node-training/tensorflow. Accessed 5 Mar 2020

  13. Aziz, K., Zaidouni, D., Bellafkih, M.: Big data processing using machine learning algorithms: Mllib and mahout use case, pp. 1–6 (2018). https://doi.org/10.1145/3289402.3289525

  14. Gao, W., Zhan, J., Wang, L., Luo, C., Zheng, D., Wen, X., Ren, R., Zheng, C., He, X., Ye, H., Tang, H., Cao, Z., Zhang, S., Daig, J.: Bigdatabench: A scalable and unified big data and ai benchmark suite (2018). https://arxiv.org/abs/1802.08254

  15. Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.: Bigbench v2: the new and improved bigbench, pp. 1225–1236 (2017). https://doi.org/10.1109/ICDE.2017.167

  16. Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11 (2017). https://doi.org/10.1016/j.bdr.2017.05.001

  17. Han, R., John, L., Zhan, J.: Benchmarking big data systems: a review. IEEE Trans. Serv. Comput. PP, 1 (2017). https://doi.org/10.1109/TSC.2017.2730882

  18. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis, pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747

  19. Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N., Buell, J.: Big data benchmark compendium. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Lecture Notes in Computer Science, vol. 9508, pp. 135–155. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-31409-9_9

  20. Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Jordan, M., Franklin, M.: A distributed machine-learning system (2013)

    Google Scholar 

  21. Palit, T., Shen, Y., Ferdman, M.: Demystifying cloud benchmarking, pp. 122–132 (2016). https://doi.org/10.1109/ISPASS.2016.7482080

  22. Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) INNS Conference on Big Data. Advances in Intelligent Systems and Computing, vol. 529, pp. 226–237 (2016). http://dblp.uni-trier.de/db/conf/inns/inns2016.html#PetridisGT16

  23. Wang, K., Maifi Hasan Khan, M., Nguyen, N., Gokhale, S.: A model driven approach towards improving the performance of apache spark applications. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 233–242 (2019)

    Google Scholar 

  24. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

    Google Scholar 

  25. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)

    Google Scholar 

  26. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X.,Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J.,Shenker, S., Stoica, I.: Apache spark: a unified engine for big dataprocessing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergii Minukhin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Minukhin, S., Brynza, N., Sitnikov, D. (2021). Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study. In: Babichev, S., Lytvynenko, V., Wójcik, W., Vyshemyrskaya, S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_8

Download citation

Publish with us

Policies and ethics