Screening hardware and volume factors in distributed machine learning algorithms on spark

Rodrigues, Jairson B.; Vasconcelos, Germano C.; Maciel, Paulo R. M.

doi:10.1007/s00607-021-00965-3

Screening hardware and volume factors in distributed machine learning algorithms on spark

A design of experiments (DoE) based approach

Regular Paper
Published: 15 June 2021

Volume 103, pages 2203–2225, (2021)
Cite this article

Computing Aims and scope Submit manuscript

Jairson B. Rodrigues ORCID: orcid.org/0000-0003-1176-3903¹,
Germano C. Vasconcelos¹ &
Paulo R. M. Maciel¹

317 Accesses
4 Citations
Explore all metrics

Abstract

This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naïve Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

Francisco Jáñez-Martino, Rocío Alaiz-Rodríguez, … Enrique Alegre

Notes

An experimental unit refers to the complete execution of the ML technique from start to final, embracing training, test, and evaluation tasks.
Created only to process the workload, being destroyed after execution to minimize interference.
If \(\lambda \approx -1 \rightarrow y^* = 1/y\); \(\lambda \approx -0.5 \rightarrow y* = sqrt(1/y)\); \(\lambda \approx 0 \rightarrow y* = log(y)\), \(\lambda \approx 0.5 \rightarrow y* = sqrt(y)\), and \(\lambda \approx 1 \rightarrow \) no transformation.
Based on R Language and Plotly library [62] for web-based data visualization.

References

Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3
Article Google Scholar
Pospelova M (2015) Real time autotuning for mapreduce on hadoop/yarn. Ph.D. thesis, Carleton University Ottawa
Piatetsky-Shapiro G (1991) Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Magazine 11(5):68
Google Scholar
Cox M, Ellsworth D (1997) Managing big data for scientific visualization. ACM Siggraph 97:146–162
Google Scholar
Luvizan S, Meirelles F, Diniz EH (2014) Big Data: publication evolution and research opportunitie. In Anais da 11a Conferência Internacional sobre Sistemas de Informação e Gestão de Tecnologia. São Paulo, SP
Miller H (2013) Big-data in cloud computing: a taxonomy of risks. Information research, 18(1) paper 571. [Available at http://InformationR.net/ir/18-1/paper571.html]
Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, pp. 29–43
Dean J, Ghemawat S. in Proceedings of the 6th conference on symposium on Opearting systems design & implementation-volume 6 (USENIX Association, 2004), pp. 10–10
Zaharia M, Chowdhury M (2012) T. Das, A. Dave, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Tech. rep., Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley. https://doi.org/10.1111/j.1095-8649.2005.00662.x. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: EEE International conference on big data (Big Data (IEEE), pp. 2936–2938
Dharsandiya AN, Patel MR (2016) A review on Frequent Itemset Mining algorithms in social network data. In: Wireless Communications, Signal Processing and Networking (WiSPNET). In: International Conference on (IEEE), pp. 1046–1048
Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl-Based Syst 117:3
Article Google Scholar
Poggi N, Berral JL, Carrera D, Call A, Gagliardi F, Reinauer R, Vujic N, Green D, Blakeley J (2015) From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in aloja. In: IEEE International Conference on Big Data (IEEE, 2015), pp. 1220–1229. https://doi.org/10.1109/BigData.2015.7363876. http://ieeexplore.ieee.org/document/7363876/
Baldacci L, Golfarelli M (2018) A cost model for Spark SQL. IEEE Trans Knowl Data Eng 31(5):819
Article Google Scholar
Munir RF, Abelló A (2019) Automatically configuring parallelism for hybrid layouts. European conference on advances in databases and information systems. Springer, Cham, pp 120–125
Google Scholar
Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21
Google Scholar
Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. Int J Comput Trends Technol 19(1):9–14
Article Google Scholar
Fischer L, Gao S, Bernstein A (2015) Machines tuning machines: Configuring distributed stream processors with bayesian optimization. In: 2015 IEEE International conference on cluster computing (IEEE), pp. 22–31. https://doi.org/10.1109/CLUSTER.2015.13
Ruan J, Zheng Q, Dong B (2015) Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. In: Proceedings of the Doctoral Symposium of the 16th International Middleware Conference on - Middleware Doct Symposium ’15. ACM (ACM Press, New York, New York, USA), pp. 1–4. https://doi.org/10.1145/2843966.2843972. http://dl.acm.org/citation.cfm?doid=2843966.2843972
Marsland S (2014) Machine learning: an algorithmic perspective. CRC Press, Boca Raton
Book Google Scholar
Fisher RA, Wishart J (1945) The arrangement of field experiments and the statistical reduction of the results. 10 (HM Stationery Office)
Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers - CF ’15 (ACM Press, New York, New York, USA), CF ’15, pp. 1–8. https://doi.org/10.1145/2742854.2747283. http://dl.acm.org/citation.cfm?doid=2742854.2747283
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (USENIX Association, Berkeley, CA, USA), OSDI’08, pp. 29–42. http://dl.acm.org/citation.cfm?id=1855741.1855744
Arasanal RM (2013) Rumani DU. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-642-36071-8-8
Article Google Scholar
Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S (2011) Starfish: a self-tuning system for big data analytics. Cidr 11:261–272
Google Scholar
Lin X., Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark. In: 5th IEEE International conference on broadband network & multimedia technology (IEEE), pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956
Ardagna D, Bernardi S, Gianniti E, Aliabadi SK, Perez-Palacin D, Requeno JI (2016) Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 599–613
Chapter Google Scholar
Sidhanta S, Golab W, Mukhopadhyay S (2016) Optex: a deadline-aware cost optimization model for spark. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (IEEE), pp. 193–202
Venkataraman S, Yang Z, Franklin M, Recht B, Nsdi I (2016) Ernest : efficient performance prediction for large-scale advanced analytics. In: NSDI’16 Proceedings of the 13th USENIX conference on networked systems design and implementation, pp. 363–378
Barr RS, Golden BL, Kelly JP, Resende MG, Stewart WR (1995) Designing and reporting on computational experiments with heuristic methods. J Heuristics 1(1):9–32. https://doi.org/10.1007/BF02430363
Article MATH Google Scholar
Hooker J (1995) Testing heuristics: we have it all wrong. J Heuristics 1:33–42. https://doi.org/10.1007/BF02430364.pdf
Article MATH Google Scholar
Wineberg M, Christensen S (2004) An introduction to statistics for EC experimental analysis. Tutorial at the ieee congress on evolutionary computation
Rathod M, Suthar D, Patel H, Shelat P, Parejiya P (2019) Microemulsion based nasal spray: a systemic approach for non-CNS drug, its optimization, characterization and statistical modelling using QbD principles. J Drug Deliv Sci Technol 49:286
Article Google Scholar
Kuo CC, Liu HA, Chang CM (2020) Optimization of vacuum casting process parameters to enhance tensile strength of components using design of experiments approach. Int J Adv Manuf Technol 106(9):3775–3785
Article Google Scholar
Amin MM, Kiani A (2020) Multi-disciplinary analysis of a strip stabilizer using body-fluid-structure interaction simulation and design of experiments (DOE). J Appl Fluid Mech 13(1):261
Article Google Scholar
Packianather M, Drake P, Rowlands H (2000) Optimizing the parameters of multilayered feedforward neural networks through Taguchi design of experiments. Qual Reliab Eng Int 16(6):461
Article Google Scholar
Staelin C (2003) Parameter selection for support vector machines, Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1 1
Bates S, Sienz J, Toropov V (2004) Formulation of the optimal Latin hypercube design of experiments using a permutation genetic algorithm. In: 45th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics & materials conference, p. 2011
Ridge E (2007) Design of experiments for the tuning of optimisation algorithms (Citeseer)
Balestrassi PP, Popova E, Paiva Ad, Lima JM (2009) Design of experiments on neural network’s training for nonlinear time series forecasting. Neurocomputing 72(4–6):1160
Article Google Scholar
Pais MS, Peretta IS, Yamanaka K, Pinto ER (2014) Factorial design analysis applied to the performance of parallel evolutionary algorithms. J Brazil Comput Soc 20(1):6
Article MathSciNet Google Scholar
Durakovic B (2017) Design of experiments application, concepts, examples: state of the art. Period Eng Nat Sci. https://doi.org/10.21533/pen.v5i3.145
Article Google Scholar
Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1
Google Scholar
Chang WL, Grady N et al (2015) Nist big data interoperability framework: Volume 1, big data definitions. Tech Rep. https://doi.org/10.6028/NIST.SP.1500-1
Article Google Scholar
Huai Y, Lee R, Zhang S, Xia CH, Zhang X, in Proceedings of the 2nd ACM symposium on cloud computing (ACM, 2011), p. 4
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Maitrey S, Jha CK (2015) MapReduce: simplified data analysis of big data. Proc Comput Sci 57:563–571. https://doi.org/10.1016/j.procs.2015.07.392
Article Google Scholar
Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc VLDB Endowment 8(13):2110–2121
Article Google Scholar
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24
Article Google Scholar
Lawson J (2014) Design and analysis of experiments with R. CRC Press, Boca Raton
Book MATH Google Scholar
Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley & Sons, Hoboken
Google Scholar
Montgomery D (2017) Design and analysis of experiments. John Wiley & Sons, Hoboken
Google Scholar
Montgomery DC (2017) Design and analysis of experiments. John wiley & sons, Hoboken
Google Scholar
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. John Wiley & Sons, Hoboken
Book MATH Google Scholar
Genuer R, Poggi JM, Tuleau-Malot C, Villa-Vialaneix N (2017) Random forests for big data. Big Data Res 9:28
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–97
MATH Google Scholar
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386
Article Google Scholar
McCallum A, Nigam K et al (1998) AAAI-98 workshop on learning for text categorization. A comparison of event models for naive bayes text classification 752:41–48
Google Scholar
Maciel JRGVP (2020) Pt7 web, an annotated portuguese language corpus. IEEE DataPort. https://doi.org/10.21227/fhrm-n966
Article Google Scholar
Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data, arXiv preprint arXiv:1911.00359
Box GE, Cox DR (1964) An analysis of transformations. J Royal Stat Soc: Series B (Methodological) 26(2):211
MATH Google Scholar
Sievert C (2020) Interactive web-based data visualization with R, plotly, and shiny. CRC Press, Boca Raton
Book Google Scholar
Inselberg A (1985) The plane with parallel coordinates. Vis Comput 1(2):69
Article MathSciNet MATH Google Scholar
Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications, vol 20. Springer, Berlin
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Center for Informatics, Federal University of Pernambuco, Av. Jornalista Aníbal Fernandes, S/N, Cidade Universitária 50., 740-560, Recife, PE, Brazil
Jairson B. Rodrigues, Germano C. Vasconcelos & Paulo R. M. Maciel

Authors

Jairson B. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Germano C. Vasconcelos
View author publications
You can also search for this author in PubMed Google Scholar
Paulo R. M. Maciel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jairson B. Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodrigues, J.B., Vasconcelos, G.C. & Maciel, P.R.M. Screening hardware and volume factors in distributed machine learning algorithms on spark. Computing 103, 2203–2225 (2021). https://doi.org/10.1007/s00607-021-00965-3

Download citation

Received: 01 October 2020
Accepted: 26 May 2021
Published: 15 June 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00607-021-00965-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Screening hardware and volume factors in distributed machine learning algorithms on spark

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Big data analytics on Apache Spark

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Screening hardware and volume factors in distributed machine learning algorithms on spark

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Big data analytics on Apache Spark

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation