Abstract
The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abd Elrahman, A., El Helw, M., Elshawi, R., Sakr, S.: D-SmartML: a distributed automated machine learning framework. In: 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp. 1215–1218 (2020). https://doi.org/10.1109/ICDCS47774.2020.00115
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html
Christiansen, B.: Ensemble averaging and the curse of dimensionality. J. Clim. 31(4), 1587–1596 (2018)
Erickson, N., et al.: Autogluon-tabular: robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 2962–2970. Curran Associates, Inc. (2015)
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2012), pp. 419–426 (2012). https://doi.org/10.1109/CCGrid.2012.42
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-030-05318-5
Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. III-1238–III-1246. JMLR.org (2013)
Kumar, K.A., Gluck, J., Deshpande, A., Lin, J.: Hone: “scaling down’’ hadoop on shared-memory systems. Proc. VLDB Endow. 6(12), 1354–1357 (2013). https://doi.org/10.14778/2536274.2536314
van der Laan Mark, J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. Biol. 6(1), 1–23 (2007)
LeDell, E., Poirier, S.: H2O AutoML: scalable automatic machine learning. In: 7th ICML Workshop on Automated Machine Learning (AutoML) (2020)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 1–52 (2018)
Liashchynskyi, P.B., Liashchynskyi, P.B.: Grid search, random search, genetic algorithm: a big comparison for NAS. ArXiv abs/1912.06059 (2019)
March, A., Willcox, K.: Constrained multifidelity optimization using model calibration. Struct. Multidisc. Optim. 46, 93–109 (2012). https://doi.org/10.1007/s00158-011-0749-1
Moore, K., et al.: TransmogrifAI (2017). https://github.com/salesforce/TransmogrifAI
Moriconi, R., Deisenroth, M.P., Sesh Kumar, K.S.: High-dimensional Bayesian optimization using low-dimensional feature spaces. Mach. Learn. 109(9–10), 1925–1943 (2020). https://doi.org/10.1007/s10994-020-05899-z
Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492 (2016)
Parker, C.: Unexpected challenges in large scale machine learning. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2012, pp. 1–6. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2351316.2351317
Pavlyshenko, B.: Using stacking approaches for machine learning models. In: 2018 IEEE Second International Conference on Data Stream Mining and Processing (DSMP), pp. 255–258 (2018). https://doi.org/10.1109/DSMP.2018.8478522
Pei, S., Kim, M.S., Gaudiot, J.L.: Extending Amdahl’s law for heterogeneous multicore processor with consideration of the overhead of data preparation. IEEE Embed. Syst. Lett. 8(1), 26–29 (2016). https://doi.org/10.1109/LES.2016.2519521
Polley, E.C., van der Laan, M.J.: Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 (2010). https://biostats.bepress.com/ucbbiostat/paper266
Sharma, S.R., Singh, B., Kaur, M.: A novel approach of ensemble methods using the stacked generalization for high-dimensional datasets. IETE J. Res. 1–16 (2022). https://doi.org/10.1080/03772063.2022.2028582
Song, H., Triguero, I., Özcan, E.: A review on the self and dual interactions between machine learning and optimisation. Progr. Artif. Intell. 8, 1–23 (2019)
Soper, D.S.: Greed is good: rapid hyperparameter optimization and model selection using greedy k-fold cross validation. Electronics 10(16), 1973 (2021). https://doi.org/10.3390/electronics10161973
Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA. In: Proceedings of the 19th International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)
Vanschoren, J.: Meta-learning. In: Hutter et al. [7], pp. 39–68 (2018)
Waring, J., Lindvall, C., Umeton, R.: Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822 (2020). https://doi.org/10.1016/j.artmed.2020.101822
Yao, Q., et al.: Taking human out of learning applications: a survey on automated machine learning. CoRR (2019)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Acknowledgement
This work is supported by projects A-TIC-434-UGR20 and PID2020-119478GB-I00).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 IFIP International Federation for Information Processing
About this paper
Cite this paper
Dave, R., Angarita-Zapata, J.S., Triguero, I. (2023). Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data. In: Holzinger, A., Kieseberg, P., Cabitza, F., Campagner, A., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2023. Lecture Notes in Computer Science, vol 14065. Springer, Cham. https://doi.org/10.1007/978-3-031-40837-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-40837-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40836-6
Online ISBN: 978-3-031-40837-3
eBook Packages: Computer ScienceComputer Science (R0)