Skip to main content

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2023)

Abstract

The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/jsebanaz90/Hyper-Stacked.

References

  1. Abd Elrahman, A., El Helw, M., Elshawi, R., Sakr, S.: D-SmartML: a distributed automated machine learning framework. In: 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp. 1215–1218 (2020). https://doi.org/10.1109/ICDCS47774.2020.00115

  2. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(10), 281–305 (2012). https://jmlr.org/papers/v13/bergstra12a.html

  3. Christiansen, B.: Ensemble averaging and the curse of dimensionality. J. Clim. 31(4), 1587–1596 (2018)

    Article  Google Scholar 

  4. Erickson, N., et al.: Autogluon-tabular: robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020)

  5. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 2962–2970. Curran Associates, Inc. (2015)

    Google Scholar 

  6. Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2012), pp. 419–426 (2012). https://doi.org/10.1109/CCGrid.2012.42

  7. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-030-05318-5

    Book  Google Scholar 

  8. Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. III-1238–III-1246. JMLR.org (2013)

    Google Scholar 

  9. Kumar, K.A., Gluck, J., Deshpande, A., Lin, J.: Hone: “scaling down’’ hadoop on shared-memory systems. Proc. VLDB Endow. 6(12), 1354–1357 (2013). https://doi.org/10.14778/2536274.2536314

    Article  Google Scholar 

  10. van der Laan Mark, J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. Biol. 6(1), 1–23 (2007)

    MathSciNet  MATH  Google Scholar 

  11. LeDell, E., Poirier, S.: H2O AutoML: scalable automatic machine learning. In: 7th ICML Workshop on Automated Machine Learning (AutoML) (2020)

    Google Scholar 

  12. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 1–52 (2018)

    MathSciNet  MATH  Google Scholar 

  13. Liashchynskyi, P.B., Liashchynskyi, P.B.: Grid search, random search, genetic algorithm: a big comparison for NAS. ArXiv abs/1912.06059 (2019)

    Google Scholar 

  14. March, A., Willcox, K.: Constrained multifidelity optimization using model calibration. Struct. Multidisc. Optim. 46, 93–109 (2012). https://doi.org/10.1007/s00158-011-0749-1

    Article  MATH  Google Scholar 

  15. Moore, K., et al.: TransmogrifAI (2017). https://github.com/salesforce/TransmogrifAI

  16. Moriconi, R., Deisenroth, M.P., Sesh Kumar, K.S.: High-dimensional Bayesian optimization using low-dimensional feature spaces. Mach. Learn. 109(9–10), 1925–1943 (2020). https://doi.org/10.1007/s10994-020-05899-z

    Article  MathSciNet  MATH  Google Scholar 

  17. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 485–492 (2016)

    Google Scholar 

  18. Parker, C.: Unexpected challenges in large scale machine learning. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2012, pp. 1–6. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2351316.2351317

  19. Pavlyshenko, B.: Using stacking approaches for machine learning models. In: 2018 IEEE Second International Conference on Data Stream Mining and Processing (DSMP), pp. 255–258 (2018). https://doi.org/10.1109/DSMP.2018.8478522

  20. Pei, S., Kim, M.S., Gaudiot, J.L.: Extending Amdahl’s law for heterogeneous multicore processor with consideration of the overhead of data preparation. IEEE Embed. Syst. Lett. 8(1), 26–29 (2016). https://doi.org/10.1109/LES.2016.2519521

    Article  Google Scholar 

  21. Polley, E.C., van der Laan, M.J.: Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 (2010). https://biostats.bepress.com/ucbbiostat/paper266

  22. Sharma, S.R., Singh, B., Kaur, M.: A novel approach of ensemble methods using the stacked generalization for high-dimensional datasets. IETE J. Res. 1–16 (2022). https://doi.org/10.1080/03772063.2022.2028582

  23. Song, H., Triguero, I., Özcan, E.: A review on the self and dual interactions between machine learning and optimisation. Progr. Artif. Intell. 8, 1–23 (2019)

    Article  Google Scholar 

  24. Soper, D.S.: Greed is good: rapid hyperparameter optimization and model selection using greedy k-fold cross validation. Electronics 10(16), 1973 (2021). https://doi.org/10.3390/electronics10161973

    Article  Google Scholar 

  25. Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013)

    Google Scholar 

  26. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA. In: Proceedings of the 19th International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)

    Google Scholar 

  27. Vanschoren, J.: Meta-learning. In: Hutter et al. [7], pp. 39–68 (2018)

    Google Scholar 

  28. Waring, J., Lindvall, C., Umeton, R.: Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822 (2020). https://doi.org/10.1016/j.artmed.2020.101822

  29. Yao, Q., et al.: Taking human out of learning applications: a survey on automated machine learning. CoRR (2019)

    Google Scholar 

  30. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)

    Google Scholar 

Download references

Acknowledgement

This work is supported by projects A-TIC-434-UGR20 and PID2020-119478GB-I00).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isaac Triguero .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dave, R., Angarita-Zapata, J.S., Triguero, I. (2023). Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data. In: Holzinger, A., Kieseberg, P., Cabitza, F., Campagner, A., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2023. Lecture Notes in Computer Science, vol 14065. Springer, Cham. https://doi.org/10.1007/978-3-031-40837-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40837-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40836-6

  • Online ISBN: 978-3-031-40837-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics