research-article

Characterizing Distributed Machine Learning Workloads on Apache Spark: (Experimentation and Deployment Paper)

Authors:
Yasmine Djebrouni

University of Grenoble Alps, France

University of Grenoble Alps, France

0009-0007-4391-8021
View Profile

,
Isabelly Rocha

University of Neuchâtel, Switzerland

University of Neuchâtel, Switzerland

0000-0001-7636-9320
View Profile

,
Sara Bouchenak

INSA Lyon, France

INSA Lyon, France

0000-0003-0558-0123
View Profile

,
Lydia Chen

University of Neuchâtel, Switzerland, TU Delft, Netherlands

University of Neuchâtel, Switzerland, TU Delft, Netherlands

0000-0002-4228-6735
View Profile

,
Pascal Felber

University of Neuchâtel, Switzerland

University of Neuchâtel, Switzerland

0000-0003-1574-6721
View Profile

,
Vania Marangozova

University of Grenoble Alps, France

University of Grenoble Alps, France

0000-0002-7042-0161
View Profile

,
Valerio Schiavoni

University of Neuchâtel, Switzerland

University of Neuchâtel, Switzerland

0000-0003-1493-6603
View Profile

Middleware '23: Proceedings of the 24th International Middleware ConferenceNovember 2023Pages 151–164https://doi.org/10.1145/3590140.3629112

Published:27 November 2023Publication History

Middleware '23: Proceedings of the 24th International Middleware Conference

Pages 151–164

ABSTRACT

Distributed machine learning (DML) environments are widely used in many application domains to build decision-making systems. However, the complexity of these environments is overwhelming for novice users. On the one hand, data scientists are more familiar with hyper-parameter tuning and typically lack an understanding of the trade-offs and challenges of parameterizing DML platforms to achieve good performance. On the other hand, system administrators focus on tuning distributed platforms, unaware of the possible implications of the platform on the quality of the learning models. To shed light on such parameter configuration interplay, we run multiple DML workloads on the widely used Apache Spark distributed platform, leveraging 13 popular learning methods and 6 real-world datasets on two distinct clusters. We collect and perform an in-depth analysis of workload execution traces to compare the efficiency of different configuration strategies. We consider tuning only hyper-parameters, tuning only platform parameters, and jointly tuning both hyper-parameters and platform parameters. We publicly release our collected traces and derive key takeaways on DML workloads. Counter-intuitively, platform parameters have a higher impact on the model quality than hyper-parameters. More generally, we show that multi-level parameter configuration can provide better results in terms of model quality and execution time while also optimizing resource costs.

References

MLPerf. https://mlperf.org/. Last accessed: Oct 24, 2023.Google Scholar
News 20 dataset. http://qwone.com/~jason/20Newsgroups. Last accessed: Oct 24, 2023.Google Scholar
Sparkmeasure, a tool for performance troubleshooting of apache spark workloads. https://db-blog.web.cern.ch/blog/luca-canali/2018-08-sparkmeasure-tool-performance-troubleshooting-apache-spark-workloads. Last accessed: Oct 24, 2023.Google Scholar
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Last accessed: Oct 24, 2023.Google Scholar
Kaggle. https://www.kaggle.com/datasets, 2021.Google Scholar
DML Workload Characterization Git Repository. https://github.com/DMLCharacterization/DMLCharacterization/, May 2023.Google Scholar
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Osdi'16, pages 265--283, Berkeley, CA, USA, 2016. USENIX Association.Google Scholar
Sunita B. Aher and L.M.R.J. Lobo. Combination of Machine Learning Algorithms for Recommendation of Courses in E-Learning System Based on Historical Data. Knowledge-Based Systems, 51:1--14, 2013.Google ScholarDigital Library
Laila Alterkawi and Matteo Migliavacca. Parallelism and Partitioning in Large-Scale GAs Using Spark. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO '19, pages 736--744, New York, NY, USA, 2019. Association for Computing Machinery.Google Scholar
Apache Spark. Spark Configuration. https://spark.apache.org/docs/2.4.3/configuration.html. Last accessed: Oct 24, 2023.Google Scholar
AWS. Amazon EC2 On-Demand Pricing. https://aws.amazon.com/fr/ec2/pricing/on-demand/. Last accessed: Oct 24, 2023.Google Scholar
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature Communications, 5(C), July 2014.Google Scholar
Jeff Barnes. Azure Machine Learning. Microsoft Azure Essentials. 1st ed, Microsoft, 2015.Google Scholar
Maria Carla Calzarossa, Luisa Massari, and Daniele Tessera. Workload Characterization: A Survey Revisited. ACM Computing Surveys (CSUR), 48(3):1--43, 2016.Google Scholar
Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Tsung-Yuan Charlie Tai, and Anshumali Shrivastava. SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. In Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze, editors, Proc. of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.Google Scholar
Jian Chen and Russell M. Clapp. Astro: Auto-Generation of Synthetic Traces Using Scaling Pattern Recognition for MPI Workloads. IEEE Transactions on Parallel and Distributed Systems, 28(8):2159--2171, 2017.Google Scholar
Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. AdaNet: Adaptive Structural Learning of Artificial Neural Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 874--883, 06--11 Aug 2017.Google Scholar
Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, et al. BigDL: A Distributed Deep Learning Framework for Big Data. In Proceedings of the ACM Symposium on Cloud Computing, pages 50--60, 2019.Google ScholarDigital Library
Samuel Danziger, Roberta Baronio, Lydia Ho, Linda Hall, Kirsty Salmon, G. Hatfield, Peter Kaiser, and Richard Lathrop. Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning. PLoS computational biology, 5, 09 2009.Google Scholar
Li Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Processing Magazine, 29(6):141--142, 2012.Google ScholarCross Ref
Li Deng and Dong Yu. Deep Learning: Methods and Applications. Foundations and trends in signal processing, 7(3-4):197--387, 2014.Google ScholarCross Ref
Katerine Diaz-Chito, Aura Hernández-Sabaté, and Antonio M. López. A Reduced Feature Set for Driver Head Pose Estimation. Appl. Soft Comput., 45(C):98--107, August 2016.Google Scholar
Radwa Elshawi, Abdul Wahab, Ahmed Barnawi, and Sherif Sakr. DLBench: A Comprehensive Experimental Evaluation of Deep Learning Frameworks. Cluster Computing, February 2021.Google Scholar
Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-Sklearn 2.0: The Next Generation. CoRR, abs/2007.04074, 2020.Google Scholar
Pasi Fränti and Sami Sieranoja. How Much Can k-Means Be Improved by Using Better Initialization and Repeats? Pattern Recognition, 93:95--112, 2019.Google ScholarDigital Library
Paulo Gabriel and Rodrigo Mello. Modelling Distributed Computing Workloads to Support The Study of Scheduling Decisions. International Journal of Computational Science and Engineering, 11:155--166, 01 2015.Google ScholarDigital Library
Hugo E. S. Galindo, Erico A. C. Guedes, Paulo R. M. Maciel, Bruno Silva, and Sergio M. L. Galdino. WGCap: A Synthetic Trace Generation Tool for Capacity Planning of Virtual Server Environments. In 2010 IEEE International Conference on Systems, Man and Cybernetics, pages 2094--2101, 2010.Google Scholar
Matt W Gardner and SR Dorling. Artificial Neural Networks (The Multilayer Perceptron) -- A Review of Applications in The Atmospheric Sciences. Atmospheric environment, 32(14-15):2627--2636, 1998.Google ScholarCross Ref
Jean Gaudart, Bernard Giusiano, and Laetitia Huiart. Comparison of The Performance of Multi-Layer Perceptron and Linear Regression for Epidemiological Data. Computational Statistics and Data Analysis, 44:547--570, 2004.Google ScholarCross Ref
Herodotos Herodotou, Yuxing Chen, and Jiaheng Lu. A Survey on Automatic Parameter Tuning for Big Data Processing Systems. ACM Comput. Surv., 53(2), April 2020.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8):1735--1780, 1997.Google ScholarDigital Library
Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model Complexity of Deep Learning: A Survey. Knowledge and Information Systems, 63:2585--2619, 2021.Google ScholarDigital Library
Jeffrey Jackovich and Ruze Richards. Machine Learning with AWS: Explore The Power of Cloud Services for Your Machine Learning and Artificial Intelligence Projects. Packt Publishing Ltd, 2018.Google Scholar
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947--960, Renton, WA, July 2019. USENIX Association.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675--678, 2014.Google Scholar
Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1946--1956, 2019.Google Scholar
Virginia Klema and Alan Laub. The Singular Value Decomposition: Its Computation and Some Applications. IEEE Transactions on automatic control, 25(2):164--176, 1980.Google ScholarCross Ref
Furkan Koltuk and Ece GÃijran Schmidt. A Novel Method for the Synthetic Generation of Non-I.I.D Workloads for Cloud Data Centers. In 2020 IEEE Symposium on Computers and Communications (ISCC), pages 1--6, 2020.Google ScholarCross Ref
Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, and Dimitrios I Fotiadis. Machine Learning Applications in Cancer Prognosis and Prediction. Computational and structural biotechnology journal, 13:8--17, 2015.Google Scholar
Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. SparkBench: A Comprehensive Benchmarking Suite for in Memory Data Analytic Platform Spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, CF '15, New York, NY, USA, 2015. Association for Computing Machinery.Google Scholar
Weizhe Li, Mike Mikailov, and Weijie Chen. Scaling the Inference of Digital Pathology Deep Learning Models using CPU-based High-Performance Computing. IEEE Transactions on Artificial Intelligence, pages 1--15, 2023.Google Scholar
Petro Liashchynskyi and Pavlo Liashchynskyi. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv preprint arXiv:1912.06059, 2019.Google Scholar
Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. The Global K-Means Clustering Algorithm. Pattern recognition, 36(2):451--461, 2003.Google ScholarCross Ref
Xiu Ma, Guangli Li, Lei Liu, Huaxiao Liu, and Xueying Wang. Accelerating Deep Neural Network Filter Pruning with Mask-Aware Convolutional Computations on Modern CPUs. Neurocomputing, 505:375--387, 2022.Google ScholarDigital Library
D. Magalhães, R. Calheiros, R. Buyya, and D. Gomes. Workload Modeling for Resource Usage Analysis and Simulation in Cloud Computing. Comput. Electr. Eng., 47:69--81, 2015.Google ScholarDigital Library
S. G. Makridakis and M. Hibon. Evaluating Accuracy (or Error) Measures. Fontainebleau edition, 1995.Google Scholar
O. Marcu, A. Costan, G. Antoniu, and M. S. Perez-Hernandez. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks. In 2016 IEEE International Conference on Cluster Computing(CLUSTER), pages 433--442, 2016.Google Scholar
Andrew McCallumzy, Kamal Nigamy, Jason Renniey, and Kristie Seymorey. Building Domain-Specific Search Engines with Machine Learning Techniques. In Proceedings of the AAAI Spring Symposium on Intelligent Agents in Cyberspace. Citeseer, pages 28--39. Citeseer, 1999.Google Scholar
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. MLlib: Machine Learning in Apache Spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016.Google ScholarDigital Library
Sparsh Mittal, Poonam Rajput, and Sreenivas Subramoney. A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5095--5115, 2022.Google ScholarCross Ref
Sparsh Mittal and Jeffrey S. Vetter. A Survey of Methods for Analyzing and Improving GPU Energy Efficiency. ACM Comput. Surv., 47(2), aug 2014.Google Scholar
Ali Mostafaeipour, Amir Jahangard, Mohammad Ahmadi, and Joshuva Arockia Dhanraj. Investigating the Performance of Hadoop and Spark Platforms on Machine Learning Algorithms. The Journal of Supercomputing, 77, 02 2021.Google ScholarCross Ref
John Ashworth Nelder and Robert WM Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society: Series A (General), 135(3):370--384, 1972.Google ScholarCross Ref
Nhan Nguyen, Mohammad Maifi Hasan Khan, and Kewen Wang. Towards Automatic Tuning of Apache Spark Configuration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417--425, 2018.Google Scholar
Anant V. Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J. Omer, Avishaii Abuhatzera, Belliappa Kuttanna, and Sreenivas Subramoney. REDUCT: Keep It Close, Keep It Cool! Efficient Scaling of DNN Inference on Multi-Core CPUs with near-Cache Compute. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA '21, pages 167--180. IEEE Press, 2021.Google ScholarDigital Library
Gang-Min Park, Yong Seok Heo, and Hyuk-Yoon Kwon. Trade-Off Analysis Between Parallelism and Accuracy of SLIC on Apache Spark. In 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 5--12, 2021.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.Google Scholar
PCM. Processor Counter Monitor (PCM). https://software.intel.com/content/www/us/en/develop/articles/intel-performance-counter-monitor.html, 2022. Last accessed: Oct 24, 2023.Google Scholar
Leonardo Piga, Reinaldo Bergamaschi, Felipe Klein, Rodolfo Azevedo, and Sandro Rigo. Empirical Web Server Power Modeling and Characterization. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 75--75, 2011.Google Scholar
Philipp Probst, Marvin N Wright, and Anne-Laure Boulesteix. Hyperparameters and Tuning Strategies for Random Forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1301, 2019.Google ScholarCross Ref
J. Ross Quinlan. Induction of Decision Trees. Machine learning, 1(1):81--106, 1986.Google ScholarCross Ref
Angie K Reyes, Juan C Caicedo, and Jorge E Camargo. Fine-Tuning Deep Convolutional Networks for Plant Recognition. CLEF (Working Notes), 1391:467--475, 2015.Google Scholar
C.J. Van Rijsbergen. Information Retrieval. Journal of the American Society for Information Science, 30(6):374--375, 1979.Google ScholarCross Ref
Irina Rish et al. An Empirical Study of The Naive Bayes Classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 41--46, 2001.Google Scholar
Isabelly Rocha, Nathaniel Morris, Lydia Y. Chen, Pascal Felber, Robert Birke, and Valerio Schiavoni. PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters. In Proceedings of the 21st International Middleware Conference, Middleware '20, pages 89--104, New York, NY, USA, 2020. Association for Computing Machinery.Google Scholar
Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep Convolutional Neural Networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8614--8618. IEEE, 2013.Google Scholar
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pages 1--10. Ieee, 2010.Google Scholar
The Apache Software Foundation. Hadoop Commands Guide. https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html#Configuration. Last accessed: Oct 24, 2023.Google Scholar
Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A. Ryan, Margie L. Homer, and Ramón Huerta. Chemical Gas Sensor Drift Compensation Using Classifier Ensembles. Sensors and Actuators B: Chemical, 166-167:320--329, May 2012.Google ScholarCross Ref
Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, and Yangqing Jia. Characterizing Deep Learning Training Workloads on Alibaba-PAI. In IEEE International Symposium on Workload Characterization, IISWC 2019, Orlando, FL, USA, November 3-5, 2019, pages 189--202. IEEE, 2019.Google Scholar
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945--960, Renton, WA, 2022. USENIX Association.Google Scholar
Svante Wold, Kim Esbensen, and Paul Geladi. Principal Component Analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37--52, 1987.Google Scholar
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.Google Scholar
Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. Shark: SQL and Rich Analytics at Scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of data, pages 13--24, 2013.Google ScholarDigital Library
Zehua Yang, Zhisheng Ye, Tianhao Fu, Jing Luo, Xiong Wei, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Tianwei Zhang. Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In 2022 IEEE 40th International Conference on Computer Design (ICCD), pages 672--680, 2022.Google ScholarCross Ref
Madhu Yedla, Srinivasa Rao Pathakota, and TM Srinivasa. Enhancing K-Means Clustering Algorithm with Improved Initial Center. International Journal of computer science and information technologies, 1(2):121--125, 2010.Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15--28, 2012.Google ScholarDigital Library
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, et al. Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56--65, 2016.Google ScholarDigital Library
Hongyu Zhu, Mohamed Akrout, Bojian Zheng, AndrewPelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. Benchmarking and Analyzing Deep Neural Network Training. In IEEE International Symposium on Workload Characterization (IISWC'18), North Carolina, October 2018.Google Scholar
Xiaonan Zou, Yong Hu, Zhewen Tian, and Kaiyuan Shen. Logistic Regression Model Optimization and Case Analysis. In 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), pages 135--139. IEEE, 2019.Google Scholar

Index Terms

Characterizing Distributed Machine Learning Workloads on Apache Spark: (Experimentation and Deployment Paper)
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
2. General and reference
  1. Cross-computing tools and techniques
    1. Experimentation
    2. Measurement

Recommendations

Learning Apache Spark 2.0
Read More
Performance Analysis of Java Virtual Machine for Machine Learning Workloads using Apache Spark
ICIA-16: Proceedings of the International Conference on Informatics and Analytics

Now a day's data is growing very rapidly, where processing and analyzing data to get useful information is the main task. There are many big data processing tools and framework such as Hadoop, Hive, Cassandra etc. Spark is one of the fastest big data ...
Read More
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

Middleware '23: Proceedings of the 24th International Middleware Conference
November 2023
334 pages
ISBN:9798400701771
DOI:10.1145/3590140

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Distributed Deep Learning
Distributed Machine Learning
Multi-level Configuration
Trace Collection
Workload Characterization
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate203of948submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 99
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Characterizing Distributed Machine Learning Workloads on Apache Spark: (Experimentation and Deployment Paper)

Middleware '23: Proceedings of the 24th International Middleware Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Apache Spark 2.0

Performance Analysis of Java Virtual Machine for Machine Learning Workloads using Apache Spark

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Characterizing Distributed Machine Learning Workloads on Apache Spark: (Experimentation and Deployment Paper)

Middleware '23: Proceedings of the 24th International Middleware Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Apache Spark 2.0

Performance Analysis of Java Virtual Machine for Machine Learning Workloads using Apache Spark

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media