Skip to main content
Log in

Screening hardware and volume factors in distributed machine learning algorithms on spark

A design of experiments (DoE) based approach

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naïve Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. An experimental unit refers to the complete execution of the ML technique from start to final, embracing training, test, and evaluation tasks.

  2. Created only to process the workload, being destroyed after execution to minimize interference.

  3. If \(\lambda \approx -1 \rightarrow y^* = 1/y\); \(\lambda \approx -0.5 \rightarrow y* = sqrt(1/y)\); \(\lambda \approx 0 \rightarrow y* = log(y)\), \(\lambda \approx 0.5 \rightarrow y* = sqrt(y)\), and \(\lambda \approx 1 \rightarrow \) no transformation.

  4. Based on R Language and Plotly library [62] for web-based data visualization.

References

  1. Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3

    Article  Google Scholar 

  2. Pospelova M (2015) Real time autotuning for mapreduce on hadoop/yarn. Ph.D. thesis, Carleton University Ottawa

  3. Piatetsky-Shapiro G (1991) Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Magazine 11(5):68

    Google Scholar 

  4. Cox M, Ellsworth D (1997) Managing big data for scientific visualization. ACM Siggraph 97:146–162

    Google Scholar 

  5. Luvizan S, Meirelles F, Diniz EH (2014) Big Data: publication evolution and research opportunitie. In Anais da 11a Conferência Internacional sobre Sistemas de Informação e Gestão de Tecnologia. São Paulo, SP

  6. Miller H (2013) Big-data in cloud computing: a taxonomy of risks. Information research, 18(1) paper 571. [Available at http://InformationR.net/ir/18-1/paper571.html]

  7. Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, pp. 29–43

  8. Dean J, Ghemawat S. in Proceedings of the 6th conference on symposium on Opearting systems design & implementation-volume 6 (USENIX Association, 2004), pp. 10–10

  9. Zaharia M, Chowdhury M (2012) T. Das, A. Dave, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Tech. rep., Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley. https://doi.org/10.1111/j.1095-8649.2005.00662.x. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

  10. Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: EEE International conference on big data (Big Data (IEEE), pp. 2936–2938

  11. Dharsandiya AN, Patel MR (2016) A review on Frequent Itemset Mining algorithms in social network data. In: Wireless Communications, Signal Processing and Networking (WiSPNET). In: International Conference on (IEEE), pp. 1046–1048

  12. Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl-Based Syst 117:3

    Article  Google Scholar 

  13. Poggi N, Berral JL, Carrera D, Call A, Gagliardi F, Reinauer R, Vujic N, Green D, Blakeley J (2015) From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in aloja. In: IEEE International Conference on Big Data (IEEE, 2015), pp. 1220–1229. https://doi.org/10.1109/BigData.2015.7363876. http://ieeexplore.ieee.org/document/7363876/

  14. Baldacci L, Golfarelli M (2018) A cost model for Spark SQL. IEEE Trans Knowl Data Eng 31(5):819

    Article  Google Scholar 

  15. Munir RF, Abelló A (2019) Automatically configuring parallelism for hybrid layouts. European conference on advances in databases and information systems. Springer, Cham, pp 120–125

    Google Scholar 

  16. Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21

    Google Scholar 

  17. Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. Int J Comput Trends Technol 19(1):9–14

    Article  Google Scholar 

  18. Fischer L, Gao S, Bernstein A (2015) Machines tuning machines: Configuring distributed stream processors with bayesian optimization. In: 2015 IEEE International conference on cluster computing (IEEE), pp. 22–31. https://doi.org/10.1109/CLUSTER.2015.13

  19. Ruan J, Zheng Q, Dong B (2015) Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. In: Proceedings of the Doctoral Symposium of the 16th International Middleware Conference on - Middleware Doct Symposium ’15. ACM (ACM Press, New York, New York, USA), pp. 1–4. https://doi.org/10.1145/2843966.2843972. http://dl.acm.org/citation.cfm?doid=2843966.2843972

  20. Marsland S (2014) Machine learning: an algorithmic perspective. CRC Press, Boca Raton

    Book  Google Scholar 

  21. Fisher RA, Wishart J (1945) The arrangement of field experiments and the statistical reduction of the results. 10 (HM Stationery Office)

  22. Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers - CF ’15 (ACM Press, New York, New York, USA), CF ’15, pp. 1–8. https://doi.org/10.1145/2742854.2747283. http://dl.acm.org/citation.cfm?doid=2742854.2747283

  23. Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (USENIX Association, Berkeley, CA, USA), OSDI’08, pp. 29–42. http://dl.acm.org/citation.cfm?id=1855741.1855744

  24. Arasanal RM (2013) Rumani DU. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-642-36071-8-8

    Article  Google Scholar 

  25. Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S (2011) Starfish: a self-tuning system for big data analytics. Cidr 11:261–272

    Google Scholar 

  26. Lin X., Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark. In: 5th IEEE International conference on broadband network & multimedia technology (IEEE), pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956

  27. Ardagna D, Bernardi S, Gianniti E, Aliabadi SK, Perez-Palacin D, Requeno JI (2016) Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 599–613

    Chapter  Google Scholar 

  28. Sidhanta S, Golab W, Mukhopadhyay S (2016) Optex: a deadline-aware cost optimization model for spark. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (IEEE), pp. 193–202

  29. Venkataraman S, Yang Z, Franklin M, Recht B, Nsdi I (2016) Ernest : efficient performance prediction for large-scale advanced analytics. In: NSDI’16 Proceedings of the 13th USENIX conference on networked systems design and implementation, pp. 363–378

  30. Barr RS, Golden BL, Kelly JP, Resende MG, Stewart WR (1995) Designing and reporting on computational experiments with heuristic methods. J Heuristics 1(1):9–32. https://doi.org/10.1007/BF02430363

    Article  MATH  Google Scholar 

  31. Hooker J (1995) Testing heuristics: we have it all wrong. J Heuristics 1:33–42. https://doi.org/10.1007/BF02430364.pdf

    Article  MATH  Google Scholar 

  32. Wineberg M, Christensen S (2004) An introduction to statistics for EC experimental analysis. Tutorial at the ieee congress on evolutionary computation

  33. Rathod M, Suthar D, Patel H, Shelat P, Parejiya P (2019) Microemulsion based nasal spray: a systemic approach for non-CNS drug, its optimization, characterization and statistical modelling using QbD principles. J Drug Deliv Sci Technol 49:286

    Article  Google Scholar 

  34. Kuo CC, Liu HA, Chang CM (2020) Optimization of vacuum casting process parameters to enhance tensile strength of components using design of experiments approach. Int J Adv Manuf Technol 106(9):3775–3785

    Article  Google Scholar 

  35. Amin MM, Kiani A (2020) Multi-disciplinary analysis of a strip stabilizer using body-fluid-structure interaction simulation and design of experiments (DOE). J Appl Fluid Mech 13(1):261

    Article  Google Scholar 

  36. Packianather M, Drake P, Rowlands H (2000) Optimizing the parameters of multilayered feedforward neural networks through Taguchi design of experiments. Qual Reliab Eng Int 16(6):461

    Article  Google Scholar 

  37. Staelin C (2003) Parameter selection for support vector machines, Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1 1

  38. Bates S, Sienz J, Toropov V (2004) Formulation of the optimal Latin hypercube design of experiments using a permutation genetic algorithm. In: 45th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics & materials conference, p. 2011

  39. Ridge E (2007) Design of experiments for the tuning of optimisation algorithms (Citeseer)

  40. Balestrassi PP, Popova E, Paiva Ad, Lima JM (2009) Design of experiments on neural network’s training for nonlinear time series forecasting. Neurocomputing 72(4–6):1160

    Article  Google Scholar 

  41. Pais MS, Peretta IS, Yamanaka K, Pinto ER (2014) Factorial design analysis applied to the performance of parallel evolutionary algorithms. J Brazil Comput Soc 20(1):6

    Article  MathSciNet  Google Scholar 

  42. Durakovic B (2017) Design of experiments application, concepts, examples: state of the art. Period Eng Nat Sci. https://doi.org/10.21533/pen.v5i3.145

    Article  Google Scholar 

  43. Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1

    Google Scholar 

  44. Chang WL, Grady N et al (2015) Nist big data interoperability framework: Volume 1, big data definitions. Tech Rep. https://doi.org/10.6028/NIST.SP.1500-1

    Article  Google Scholar 

  45. Huai Y, Lee R, Zhang S, Xia CH, Zhang X, in Proceedings of the 2nd ACM symposium on cloud computing (ACM, 2011), p. 4

  46. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  47. Maitrey S, Jha CK (2015) MapReduce: simplified data analysis of big data. Proc Comput Sci 57:563–571. https://doi.org/10.1016/j.procs.2015.07.392

    Article  Google Scholar 

  48. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc VLDB Endowment 8(13):2110–2121

    Article  Google Scholar 

  49. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24

    Article  Google Scholar 

  50. Lawson J (2014) Design and analysis of experiments with R. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  51. Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley & Sons, Hoboken

    Google Scholar 

  52. Montgomery D (2017) Design and analysis of experiments. John Wiley & Sons, Hoboken

    Google Scholar 

  53. Montgomery DC (2017) Design and analysis of experiments. John wiley & sons, Hoboken

    Google Scholar 

  54. Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. John Wiley & Sons, Hoboken

    Book  MATH  Google Scholar 

  55. Genuer R, Poggi JM, Tuleau-Malot C, Villa-Vialaneix N (2017) Random forests for big data. Big Data Res 9:28

    Article  Google Scholar 

  56. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–97

    MATH  Google Scholar 

  57. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386

    Article  Google Scholar 

  58. McCallum A, Nigam K et al (1998) AAAI-98 workshop on learning for text categorization. A comparison of event models for naive bayes text classification 752:41–48

    Google Scholar 

  59. Maciel JRGVP (2020) Pt7 web, an annotated portuguese language corpus. IEEE DataPort. https://doi.org/10.21227/fhrm-n966

    Article  Google Scholar 

  60. Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data, arXiv preprint arXiv:1911.00359

  61. Box GE, Cox DR (1964) An analysis of transformations. J Royal Stat Soc: Series B (Methodological) 26(2):211

    MATH  Google Scholar 

  62. Sievert C (2020) Interactive web-based data visualization with R, plotly, and shiny. CRC Press, Boca Raton

    Book  Google Scholar 

  63. Inselberg A (1985) The plane with parallel coordinates. Vis Comput 1(2):69

    Article  MathSciNet  MATH  Google Scholar 

  64. Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications, vol 20. Springer, Berlin

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jairson B. Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodrigues, J.B., Vasconcelos, G.C. & Maciel, P.R.M. Screening hardware and volume factors in distributed machine learning algorithms on spark. Computing 103, 2203–2225 (2021). https://doi.org/10.1007/s00607-021-00965-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-021-00965-3

Keywords

Mathematics Subject Classification

Navigation