Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Dave, Ryan; Angarita-Zapata, Juan S.; Triguero, Isaac

doi:10.1007/978-3-031-40837-3_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14065))

Included in the following conference series:

International Cross-Domain Conference for Machine Learning and Knowledge Extraction

704 Accesses

Abstract

The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Model averaging in distributed machine learning: a case study with Apache Spark

Article 15 April 2021

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Article 17 October 2023

Notes

1.
https://github.com/jsebanaz90/Hyper-Stacked.

References

Abd Elrahman, A., El Helw, M., Elshawi, R., Sakr, S.: D-SmartML: a distributed automated machine learning framework. In: 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp. 1215–1218 (2020). https://doi.org/10.1109/ICDCS47774.2020.00115
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html
Christiansen, B.: Ensemble averaging and the curse of dimensionality. J. Clim. 31(4), 1587–1596 (2018)
Article Google Scholar
Erickson, N., et al.: Autogluon-tabular: robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 2962–2970. Curran Associates, Inc. (2015)
Google Scholar
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2012), pp. 419–426 (2012). https://doi.org/10.1109/CCGrid.2012.42
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-030-05318-5
Book Google Scholar
Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. III-1238–III-1246. JMLR.org (2013)
Google Scholar
Kumar, K.A., Gluck, J., Deshpande, A., Lin, J.: Hone: “scaling down’’ hadoop on shared-memory systems. Proc. VLDB Endow. 6(12), 1354–1357 (2013). https://doi.org/10.14778/2536274.2536314
Article Google Scholar
van der Laan Mark, J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. Biol. 6(1), 1–23 (2007)
MathSciNet MATH Google Scholar
LeDell, E., Poirier, S.: H2O AutoML: scalable automatic machine learning. In: 7th ICML Workshop on Automated Machine Learning (AutoML) (2020)
Google Scholar
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 1–52 (2018)
MathSciNet MATH Google Scholar
Liashchynskyi, P.B., Liashchynskyi, P.B.: Grid search, random search, genetic algorithm: a big comparison for NAS. ArXiv abs/1912.06059 (2019)
Google Scholar
March, A., Willcox, K.: Constrained multifidelity optimization using model calibration. Struct. Multidisc. Optim. 46, 93–109 (2012). https://doi.org/10.1007/s00158-011-0749-1
Article MATH Google Scholar
Moore, K., et al.: TransmogrifAI (2017). https://github.com/salesforce/TransmogrifAI
Moriconi, R., Deisenroth, M.P., Sesh Kumar, K.S.: High-dimensional Bayesian optimization using low-dimensional feature spaces. Mach. Learn. 109(9–10), 1925–1943 (2020). https://doi.org/10.1007/s10994-020-05899-z
Article MathSciNet MATH Google Scholar
Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492 (2016)
Google Scholar
Parker, C.: Unexpected challenges in large scale machine learning. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2012, pp. 1–6. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2351316.2351317
Pavlyshenko, B.: Using stacking approaches for machine learning models. In: 2018 IEEE Second International Conference on Data Stream Mining and Processing (DSMP), pp. 255–258 (2018). https://doi.org/10.1109/DSMP.2018.8478522
Pei, S., Kim, M.S., Gaudiot, J.L.: Extending Amdahl’s law for heterogeneous multicore processor with consideration of the overhead of data preparation. IEEE Embed. Syst. Lett. 8(1), 26–29 (2016). https://doi.org/10.1109/LES.2016.2519521
Article Google Scholar
Polley, E.C., van der Laan, M.J.: Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 (2010). https://biostats.bepress.com/ucbbiostat/paper266
Sharma, S.R., Singh, B., Kaur, M.: A novel approach of ensemble methods using the stacked generalization for high-dimensional datasets. IETE J. Res. 1–16 (2022). https://doi.org/10.1080/03772063.2022.2028582
Song, H., Triguero, I., Özcan, E.: A review on the self and dual interactions between machine learning and optimisation. Progr. Artif. Intell. 8, 1–23 (2019)
Article Google Scholar
Soper, D.S.: Greed is good: rapid hyperparameter optimization and model selection using greedy k-fold cross validation. Electronics 10(16), 1973 (2021). https://doi.org/10.3390/electronics10161973
Article Google Scholar
Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013)
Google Scholar
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA. In: Proceedings of the 19th International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)
Google Scholar
Vanschoren, J.: Meta-learning. In: Hutter et al. [7], pp. 39–68 (2018)
Google Scholar
Waring, J., Lindvall, C., Umeton, R.: Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822 (2020). https://doi.org/10.1016/j.artmed.2020.101822
Yao, Q., et al.: Taking human out of learning applications: a survey on automated machine learning. CoRR (2019)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Google Scholar

Download references

Acknowledgement

This work is supported by projects A-TIC-434-UGR20 and PID2020-119478GB-I00).

Author information

Authors and Affiliations

School of Computer Science, University of Nottingham, Nottingham, UK
Ryan Dave & Isaac Triguero
Aimsun SLU, Barcelona, Spain
Juan S. Angarita-Zapata
DeustoTech, Faculty of Engineering, University of Deusto, Bilbao, Spain
Juan S. Angarita-Zapata
Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, Granada, Spain
Isaac Triguero

Authors

Ryan Dave
View author publications
You can also search for this author in PubMed Google Scholar
Juan S. Angarita-Zapata
View author publications
You can also search for this author in PubMed Google Scholar
Isaac Triguero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isaac Triguero .

Editor information

Editors and Affiliations

University of Natural Resources and Life, Vienna, Austria
Andreas Holzinger
St. Pölten University of Applied Science, St. Pölten, Austria
Peter Kieseberg
University of Milano-Bicocca, Milan, Italy
Federico Cabitza
University of Milano-Bicocca, Milan, Italy
Andrea Campagner
TU Wien, Vienna, Austria
A Min Tjoa
SBA Research, Vienna, Austria
Edgar Weippl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dave, R., Angarita-Zapata, J.S., Triguero, I. (2023). Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data. In: Holzinger, A., Kieseberg, P., Cabitza, F., Campagner, A., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2023. Lecture Notes in Computer Science, vol 14065. Springer, Cham. https://doi.org/10.1007/978-3-031-40837-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-40837-3_6
Published: 22 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40836-6
Online ISBN: 978-3-031-40837-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data