Abstract
This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naïve Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
An experimental unit refers to the complete execution of the ML technique from start to final, embracing training, test, and evaluation tasks.
Created only to process the workload, being destroyed after execution to minimize interference.
If \(\lambda \approx -1 \rightarrow y^* = 1/y\); \(\lambda \approx -0.5 \rightarrow y* = sqrt(1/y)\); \(\lambda \approx 0 \rightarrow y* = log(y)\), \(\lambda \approx 0.5 \rightarrow y* = sqrt(y)\), and \(\lambda \approx 1 \rightarrow \) no transformation.
Based on R Language and Plotly library [62] for web-based data visualization.
References
Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3
Pospelova M (2015) Real time autotuning for mapreduce on hadoop/yarn. Ph.D. thesis, Carleton University Ottawa
Piatetsky-Shapiro G (1991) Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Magazine 11(5):68
Cox M, Ellsworth D (1997) Managing big data for scientific visualization. ACM Siggraph 97:146–162
Luvizan S, Meirelles F, Diniz EH (2014) Big Data: publication evolution and research opportunitie. In Anais da 11a Conferência Internacional sobre Sistemas de Informação e Gestão de Tecnologia. São Paulo, SP
Miller H (2013) Big-data in cloud computing: a taxonomy of risks. Information research, 18(1) paper 571. [Available at http://InformationR.net/ir/18-1/paper571.html]
Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, pp. 29–43
Dean J, Ghemawat S. in Proceedings of the 6th conference on symposium on Opearting systems design & implementation-volume 6 (USENIX Association, 2004), pp. 10–10
Zaharia M, Chowdhury M (2012) T. Das, A. Dave, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Tech. rep., Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley. https://doi.org/10.1111/j.1095-8649.2005.00662.x. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: EEE International conference on big data (Big Data (IEEE), pp. 2936–2938
Dharsandiya AN, Patel MR (2016) A review on Frequent Itemset Mining algorithms in social network data. In: Wireless Communications, Signal Processing and Networking (WiSPNET). In: International Conference on (IEEE), pp. 1046–1048
Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl-Based Syst 117:3
Poggi N, Berral JL, Carrera D, Call A, Gagliardi F, Reinauer R, Vujic N, Green D, Blakeley J (2015) From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in aloja. In: IEEE International Conference on Big Data (IEEE, 2015), pp. 1220–1229. https://doi.org/10.1109/BigData.2015.7363876. http://ieeexplore.ieee.org/document/7363876/
Baldacci L, Golfarelli M (2018) A cost model for Spark SQL. IEEE Trans Knowl Data Eng 31(5):819
Munir RF, Abelló A (2019) Automatically configuring parallelism for hybrid layouts. European conference on advances in databases and information systems. Springer, Cham, pp 120–125
Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21
Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. Int J Comput Trends Technol 19(1):9–14
Fischer L, Gao S, Bernstein A (2015) Machines tuning machines: Configuring distributed stream processors with bayesian optimization. In: 2015 IEEE International conference on cluster computing (IEEE), pp. 22–31. https://doi.org/10.1109/CLUSTER.2015.13
Ruan J, Zheng Q, Dong B (2015) Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. In: Proceedings of the Doctoral Symposium of the 16th International Middleware Conference on - Middleware Doct Symposium ’15. ACM (ACM Press, New York, New York, USA), pp. 1–4. https://doi.org/10.1145/2843966.2843972. http://dl.acm.org/citation.cfm?doid=2843966.2843972
Marsland S (2014) Machine learning: an algorithmic perspective. CRC Press, Boca Raton
Fisher RA, Wishart J (1945) The arrangement of field experiments and the statistical reduction of the results. 10 (HM Stationery Office)
Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers - CF ’15 (ACM Press, New York, New York, USA), CF ’15, pp. 1–8. https://doi.org/10.1145/2742854.2747283. http://dl.acm.org/citation.cfm?doid=2742854.2747283
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (USENIX Association, Berkeley, CA, USA), OSDI’08, pp. 29–42. http://dl.acm.org/citation.cfm?id=1855741.1855744
Arasanal RM (2013) Rumani DU. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-642-36071-8-8
Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S (2011) Starfish: a self-tuning system for big data analytics. Cidr 11:261–272
Lin X., Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark. In: 5th IEEE International conference on broadband network & multimedia technology (IEEE), pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956
Ardagna D, Bernardi S, Gianniti E, Aliabadi SK, Perez-Palacin D, Requeno JI (2016) Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 599–613
Sidhanta S, Golab W, Mukhopadhyay S (2016) Optex: a deadline-aware cost optimization model for spark. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (IEEE), pp. 193–202
Venkataraman S, Yang Z, Franklin M, Recht B, Nsdi I (2016) Ernest : efficient performance prediction for large-scale advanced analytics. In: NSDI’16 Proceedings of the 13th USENIX conference on networked systems design and implementation, pp. 363–378
Barr RS, Golden BL, Kelly JP, Resende MG, Stewart WR (1995) Designing and reporting on computational experiments with heuristic methods. J Heuristics 1(1):9–32. https://doi.org/10.1007/BF02430363
Hooker J (1995) Testing heuristics: we have it all wrong. J Heuristics 1:33–42. https://doi.org/10.1007/BF02430364.pdf
Wineberg M, Christensen S (2004) An introduction to statistics for EC experimental analysis. Tutorial at the ieee congress on evolutionary computation
Rathod M, Suthar D, Patel H, Shelat P, Parejiya P (2019) Microemulsion based nasal spray: a systemic approach for non-CNS drug, its optimization, characterization and statistical modelling using QbD principles. J Drug Deliv Sci Technol 49:286
Kuo CC, Liu HA, Chang CM (2020) Optimization of vacuum casting process parameters to enhance tensile strength of components using design of experiments approach. Int J Adv Manuf Technol 106(9):3775–3785
Amin MM, Kiani A (2020) Multi-disciplinary analysis of a strip stabilizer using body-fluid-structure interaction simulation and design of experiments (DOE). J Appl Fluid Mech 13(1):261
Packianather M, Drake P, Rowlands H (2000) Optimizing the parameters of multilayered feedforward neural networks through Taguchi design of experiments. Qual Reliab Eng Int 16(6):461
Staelin C (2003) Parameter selection for support vector machines, Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1 1
Bates S, Sienz J, Toropov V (2004) Formulation of the optimal Latin hypercube design of experiments using a permutation genetic algorithm. In: 45th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics & materials conference, p. 2011
Ridge E (2007) Design of experiments for the tuning of optimisation algorithms (Citeseer)
Balestrassi PP, Popova E, Paiva Ad, Lima JM (2009) Design of experiments on neural network’s training for nonlinear time series forecasting. Neurocomputing 72(4–6):1160
Pais MS, Peretta IS, Yamanaka K, Pinto ER (2014) Factorial design analysis applied to the performance of parallel evolutionary algorithms. J Brazil Comput Soc 20(1):6
Durakovic B (2017) Design of experiments application, concepts, examples: state of the art. Period Eng Nat Sci. https://doi.org/10.21533/pen.v5i3.145
Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1
Chang WL, Grady N et al (2015) Nist big data interoperability framework: Volume 1, big data definitions. Tech Rep. https://doi.org/10.6028/NIST.SP.1500-1
Huai Y, Lee R, Zhang S, Xia CH, Zhang X, in Proceedings of the 2nd ACM symposium on cloud computing (ACM, 2011), p. 4
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107. https://doi.org/10.1145/1327452.1327492
Maitrey S, Jha CK (2015) MapReduce: simplified data analysis of big data. Proc Comput Sci 57:563–571. https://doi.org/10.1016/j.procs.2015.07.392
Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc VLDB Endowment 8(13):2110–2121
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24
Lawson J (2014) Design and analysis of experiments with R. CRC Press, Boca Raton
Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley & Sons, Hoboken
Montgomery D (2017) Design and analysis of experiments. John Wiley & Sons, Hoboken
Montgomery DC (2017) Design and analysis of experiments. John wiley & sons, Hoboken
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. John Wiley & Sons, Hoboken
Genuer R, Poggi JM, Tuleau-Malot C, Villa-Vialaneix N (2017) Random forests for big data. Big Data Res 9:28
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–97
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386
McCallum A, Nigam K et al (1998) AAAI-98 workshop on learning for text categorization. A comparison of event models for naive bayes text classification 752:41–48
Maciel JRGVP (2020) Pt7 web, an annotated portuguese language corpus. IEEE DataPort. https://doi.org/10.21227/fhrm-n966
Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data, arXiv preprint arXiv:1911.00359
Box GE, Cox DR (1964) An analysis of transformations. J Royal Stat Soc: Series B (Methodological) 26(2):211
Sievert C (2020) Interactive web-based data visualization with R, plotly, and shiny. CRC Press, Boca Raton
Inselberg A (1985) The plane with parallel coordinates. Vis Comput 1(2):69
Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications, vol 20. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rodrigues, J.B., Vasconcelos, G.C. & Maciel, P.R.M. Screening hardware and volume factors in distributed machine learning algorithms on spark. Computing 103, 2203–2225 (2021). https://doi.org/10.1007/s00607-021-00965-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-021-00965-3