ABSTRACT
Distributed machine learning (DML) environments are widely used in many application domains to build decision-making systems. However, the complexity of these environments is overwhelming for novice users. On the one hand, data scientists are more familiar with hyper-parameter tuning and typically lack an understanding of the trade-offs and challenges of parameterizing DML platforms to achieve good performance. On the other hand, system administrators focus on tuning distributed platforms, unaware of the possible implications of the platform on the quality of the learning models. To shed light on such parameter configuration interplay, we run multiple DML workloads on the widely used Apache Spark distributed platform, leveraging 13 popular learning methods and 6 real-world datasets on two distinct clusters. We collect and perform an in-depth analysis of workload execution traces to compare the efficiency of different configuration strategies. We consider tuning only hyper-parameters, tuning only platform parameters, and jointly tuning both hyper-parameters and platform parameters. We publicly release our collected traces and derive key takeaways on DML workloads. Counter-intuitively, platform parameters have a higher impact on the model quality than hyper-parameters. More generally, we show that multi-level parameter configuration can provide better results in terms of model quality and execution time while also optimizing resource costs.
- MLPerf. https://mlperf.org/. Last accessed: Oct 24, 2023.Google Scholar
- News 20 dataset. http://qwone.com/~jason/20Newsgroups. Last accessed: Oct 24, 2023.Google Scholar
- Sparkmeasure, a tool for performance troubleshooting of apache spark workloads. https://db-blog.web.cern.ch/blog/luca-canali/2018-08-sparkmeasure-tool-performance-troubleshooting-apache-spark-workloads. Last accessed: Oct 24, 2023.Google Scholar
- UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Last accessed: Oct 24, 2023.Google Scholar
- Kaggle. https://www.kaggle.com/datasets, 2021.Google Scholar
- DML Workload Characterization Git Repository. https://github.com/DMLCharacterization/DMLCharacterization/, May 2023.Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Osdi'16, pages 265--283, Berkeley, CA, USA, 2016. USENIX Association.Google Scholar
- Sunita B. Aher and L.M.R.J. Lobo. Combination of Machine Learning Algorithms for Recommendation of Courses in E-Learning System Based on Historical Data. Knowledge-Based Systems, 51:1--14, 2013.Google ScholarDigital Library
- Laila Alterkawi and Matteo Migliavacca. Parallelism and Partitioning in Large-Scale GAs Using Spark. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO '19, pages 736--744, New York, NY, USA, 2019. Association for Computing Machinery.Google Scholar
- Apache Spark. Spark Configuration. https://spark.apache.org/docs/2.4.3/configuration.html. Last accessed: Oct 24, 2023.Google Scholar
- AWS. Amazon EC2 On-Demand Pricing. https://aws.amazon.com/fr/ec2/pricing/on-demand/. Last accessed: Oct 24, 2023.Google Scholar
- Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature Communications, 5(C), July 2014.Google Scholar
- Jeff Barnes. Azure Machine Learning. Microsoft Azure Essentials. 1st ed, Microsoft, 2015.Google Scholar
- Maria Carla Calzarossa, Luisa Massari, and Daniele Tessera. Workload Characterization: A Survey Revisited. ACM Computing Surveys (CSUR), 48(3):1--43, 2016.Google Scholar
- Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Tsung-Yuan Charlie Tai, and Anshumali Shrivastava. SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. In Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze, editors, Proc. of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.Google Scholar
- Jian Chen and Russell M. Clapp. Astro: Auto-Generation of Synthetic Traces Using Scaling Pattern Recognition for MPI Workloads. IEEE Transactions on Parallel and Distributed Systems, 28(8):2159--2171, 2017.Google Scholar
- Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. AdaNet: Adaptive Structural Learning of Artificial Neural Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 874--883, 06--11 Aug 2017.Google Scholar
- Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, et al. BigDL: A Distributed Deep Learning Framework for Big Data. In Proceedings of the ACM Symposium on Cloud Computing, pages 50--60, 2019.Google ScholarDigital Library
- Samuel Danziger, Roberta Baronio, Lydia Ho, Linda Hall, Kirsty Salmon, G. Hatfield, Peter Kaiser, and Richard Lathrop. Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning. PLoS computational biology, 5, 09 2009.Google Scholar
- Li Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Processing Magazine, 29(6):141--142, 2012.Google ScholarCross Ref
- Li Deng and Dong Yu. Deep Learning: Methods and Applications. Foundations and trends in signal processing, 7(3-4):197--387, 2014.Google ScholarCross Ref
- Katerine Diaz-Chito, Aura Hernández-Sabaté, and Antonio M. López. A Reduced Feature Set for Driver Head Pose Estimation. Appl. Soft Comput., 45(C):98--107, August 2016.Google Scholar
- Radwa Elshawi, Abdul Wahab, Ahmed Barnawi, and Sherif Sakr. DLBench: A Comprehensive Experimental Evaluation of Deep Learning Frameworks. Cluster Computing, February 2021.Google Scholar
- Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-Sklearn 2.0: The Next Generation. CoRR, abs/2007.04074, 2020.Google Scholar
- Pasi Fränti and Sami Sieranoja. How Much Can k-Means Be Improved by Using Better Initialization and Repeats? Pattern Recognition, 93:95--112, 2019.Google ScholarDigital Library
- Paulo Gabriel and Rodrigo Mello. Modelling Distributed Computing Workloads to Support The Study of Scheduling Decisions. International Journal of Computational Science and Engineering, 11:155--166, 01 2015.Google ScholarDigital Library
- Hugo E. S. Galindo, Erico A. C. Guedes, Paulo R. M. Maciel, Bruno Silva, and Sergio M. L. Galdino. WGCap: A Synthetic Trace Generation Tool for Capacity Planning of Virtual Server Environments. In 2010 IEEE International Conference on Systems, Man and Cybernetics, pages 2094--2101, 2010.Google Scholar
- Matt W Gardner and SR Dorling. Artificial Neural Networks (The Multilayer Perceptron) -- A Review of Applications in The Atmospheric Sciences. Atmospheric environment, 32(14-15):2627--2636, 1998.Google ScholarCross Ref
- Jean Gaudart, Bernard Giusiano, and Laetitia Huiart. Comparison of The Performance of Multi-Layer Perceptron and Linear Regression for Epidemiological Data. Computational Statistics and Data Analysis, 44:547--570, 2004.Google ScholarCross Ref
- Herodotos Herodotou, Yuxing Chen, and Jiaheng Lu. A Survey on Automatic Parameter Tuning for Big Data Processing Systems. ACM Comput. Surv., 53(2), April 2020.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarDigital Library
- Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model Complexity of Deep Learning: A Survey. Knowledge and Information Systems, 63:2585--2619, 2021.Google ScholarDigital Library
- Jeffrey Jackovich and Ruze Richards. Machine Learning with AWS: Explore The Power of Cloud Services for Your Machine Learning and Artificial Intelligence Projects. Packt Publishing Ltd, 2018.Google Scholar
- Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947--960, Renton, WA, July 2019. USENIX Association.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675--678, 2014.Google Scholar
- Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1946--1956, 2019.Google Scholar
- Virginia Klema and Alan Laub. The Singular Value Decomposition: Its Computation and Some Applications. IEEE Transactions on automatic control, 25(2):164--176, 1980.Google ScholarCross Ref
- Furkan Koltuk and Ece GÃijran Schmidt. A Novel Method for the Synthetic Generation of Non-I.I.D Workloads for Cloud Data Centers. In 2020 IEEE Symposium on Computers and Communications (ISCC), pages 1--6, 2020.Google ScholarCross Ref
- Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, and Dimitrios I Fotiadis. Machine Learning Applications in Cancer Prognosis and Prediction. Computational and structural biotechnology journal, 13:8--17, 2015.Google Scholar
- Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. SparkBench: A Comprehensive Benchmarking Suite for in Memory Data Analytic Platform Spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, New York, NY, USA, 2015. Association for Computing Machinery.Google Scholar
- Weizhe Li, Mike Mikailov, and Weijie Chen. Scaling the Inference of Digital Pathology Deep Learning Models using CPU-based High-Performance Computing. IEEE Transactions on Artificial Intelligence, pages 1--15, 2023.Google Scholar
- Petro Liashchynskyi and Pavlo Liashchynskyi. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv preprint arXiv:1912.06059, 2019.Google Scholar
- Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. The Global K-Means Clustering Algorithm. Pattern recognition, 36(2):451--461, 2003.Google ScholarCross Ref
- Xiu Ma, Guangli Li, Lei Liu, Huaxiao Liu, and Xueying Wang. Accelerating Deep Neural Network Filter Pruning with Mask-Aware Convolutional Computations on Modern CPUs. Neurocomputing, 505:375--387, 2022.Google ScholarDigital Library
- D. Magalhães, R. Calheiros, R. Buyya, and D. Gomes. Workload Modeling for Resource Usage Analysis and Simulation in Cloud Computing. Comput. Electr. Eng., 47:69--81, 2015.Google ScholarDigital Library
- S. G. Makridakis and M. Hibon. Evaluating Accuracy (or Error) Measures. Fontainebleau edition, 1995.Google Scholar
- O. Marcu, A. Costan, G. Antoniu, and M. S. Perez-Hernandez. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks. In 2016 IEEE International Conference on Cluster Computing(CLUSTER), pages 433--442, 2016.Google Scholar
- Andrew McCallumzy, Kamal Nigamy, Jason Renniey, and Kristie Seymorey. Building Domain-Specific Search Engines with Machine Learning Techniques. In Proceedings of the AAAI Spring Symposium on Intelligent Agents in Cyberspace. Citeseer, pages 28--39. Citeseer, 1999.Google Scholar
- Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. MLlib: Machine Learning in Apache Spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016.Google ScholarDigital Library
- Sparsh Mittal, Poonam Rajput, and Sreenivas Subramoney. A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5095--5115, 2022.Google ScholarCross Ref
- Sparsh Mittal and Jeffrey S. Vetter. A Survey of Methods for Analyzing and Improving GPU Energy Efficiency. ACM Comput. Surv., 47(2), aug 2014.Google Scholar
- Ali Mostafaeipour, Amir Jahangard, Mohammad Ahmadi, and Joshuva Arockia Dhanraj. Investigating the Performance of Hadoop and Spark Platforms on Machine Learning Algorithms. The Journal of Supercomputing, 77, 02 2021.Google ScholarCross Ref
- John Ashworth Nelder and Robert WM Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society: Series A (General), 135(3):370--384, 1972.Google ScholarCross Ref
- Nhan Nguyen, Mohammad Maifi Hasan Khan, and Kewen Wang. Towards Automatic Tuning of Apache Spark Configuration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417--425, 2018.Google Scholar
- Anant V. Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J. Omer, Avishaii Abuhatzera, Belliappa Kuttanna, and Sreenivas Subramoney. REDUCT: Keep It Close, Keep It Cool! Efficient Scaling of DNN Inference on Multi-Core CPUs with near-Cache Compute. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA '21, pages 167--180. IEEE Press, 2021.Google ScholarDigital Library
- Gang-Min Park, Yong Seok Heo, and Hyuk-Yoon Kwon. Trade-Off Analysis Between Parallelism and Accuracy of SLIC on Apache Spark. In 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 5--12, 2021.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.Google Scholar
- PCM. Processor Counter Monitor (PCM). https://software.intel.com/content/www/us/en/develop/articles/intel-performance-counter-monitor.html, 2022. Last accessed: Oct 24, 2023.Google Scholar
- Leonardo Piga, Reinaldo Bergamaschi, Felipe Klein, Rodolfo Azevedo, and Sandro Rigo. Empirical Web Server Power Modeling and Characterization. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 75--75, 2011.Google Scholar
- Philipp Probst, Marvin N Wright, and Anne-Laure Boulesteix. Hyperparameters and Tuning Strategies for Random Forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1301, 2019.Google ScholarCross Ref
- J. Ross Quinlan. Induction of Decision Trees. Machine learning, 1(1):81--106, 1986.Google ScholarCross Ref
- Angie K Reyes, Juan C Caicedo, and Jorge E Camargo. Fine-Tuning Deep Convolutional Networks for Plant Recognition. CLEF (Working Notes), 1391:467--475, 2015.Google Scholar
- C.J. Van Rijsbergen. Information Retrieval. Journal of the American Society for Information Science, 30(6):374--375, 1979.Google ScholarCross Ref
- Irina Rish et al. An Empirical Study of The Naive Bayes Classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 41--46, 2001.Google Scholar
- Isabelly Rocha, Nathaniel Morris, Lydia Y. Chen, Pascal Felber, Robert Birke, and Valerio Schiavoni. PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters. In Proceedings of the 21st International Middleware Conference, Middleware '20, pages 89--104, New York, NY, USA, 2020. Association for Computing Machinery.Google Scholar
- Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep Convolutional Neural Networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8614--8618. IEEE, 2013.Google Scholar
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pages 1--10. Ieee, 2010.Google Scholar
- The Apache Software Foundation. Hadoop Commands Guide. https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html#Configuration. Last accessed: Oct 24, 2023.Google Scholar
- Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A. Ryan, Margie L. Homer, and Ramón Huerta. Chemical Gas Sensor Drift Compensation Using Classifier Ensembles. Sensors and Actuators B: Chemical, 166-167:320--329, May 2012.Google ScholarCross Ref
- Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, and Yangqing Jia. Characterizing Deep Learning Training Workloads on Alibaba-PAI. In IEEE International Symposium on Workload Characterization, IISWC 2019, Orlando, FL, USA, November 3-5, 2019, pages 189--202. IEEE, 2019.Google Scholar
- Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945--960, Renton, WA, 2022. USENIX Association.Google Scholar
- Svante Wold, Kim Esbensen, and Paul Geladi. Principal Component Analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37--52, 1987.Google Scholar
- Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.Google Scholar
- Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. Shark: SQL and Rich Analytics at Scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of data, pages 13--24, 2013.Google ScholarDigital Library
- Zehua Yang, Zhisheng Ye, Tianhao Fu, Jing Luo, Xiong Wei, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Tianwei Zhang. Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In 2022 IEEE 40th International Conference on Computer Design (ICCD), pages 672--680, 2022.Google ScholarCross Ref
- Madhu Yedla, Srinivasa Rao Pathakota, and TM Srinivasa. Enhancing K-Means Clustering Algorithm with Improved Initial Center. International Journal of computer science and information technologies, 1(2):121--125, 2010.Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15--28, 2012.Google ScholarDigital Library
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, et al. Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56--65, 2016.Google ScholarDigital Library
- Hongyu Zhu, Mohamed Akrout, Bojian Zheng, AndrewPelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. Benchmarking and Analyzing Deep Neural Network Training. In IEEE International Symposium on Workload Characterization (IISWC'18), North Carolina, October 2018.Google Scholar
- Xiaonan Zou, Yong Hu, Zhewen Tian, and Kaiyuan Shen. Logistic Regression Model Optimization and Case Analysis. In 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), pages 135--139. IEEE, 2019.Google Scholar
Index Terms
- Characterizing Distributed Machine Learning Workloads on Apache Spark: (Experimentation and Deployment Paper)
Recommendations
Performance Analysis of Java Virtual Machine for Machine Learning Workloads using Apache Spark
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsNow a day's data is growing very rapidly, where processing and analyzing data to get useful information is the main task. There are many big data processing tools and framework such as Hadoop, Hive, Cassandra etc. Spark is one of the fastest big data ...
Comments