Skip to main content

Advertisement

Log in

Learning dataset representation for automatic machine learning algorithm selection

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://www.h2o.ai/.

  2. A number of graph convolutional layers {2,3,4,5,6} and embedding dimensions of {50, 100, 200, 300, 400, 500} were tested, and, respectively, 4 and 300 were found to produce the best results with reasonable efficiency across all models.

  3. https://archive.ics.uci.edu/ml/datasets.php.

  4. https://www.openml.org.

  5. https://sci2s.ugr.es/keel/datasets.php.

  6. https://www.kaggle.com.

  7. https://cutt.ly/hnIUjmP.

References

  1. Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260

    Article  MathSciNet  Google Scholar 

  2. van Rijn JN, Abdulrahman SM, Brazdil P, Vanschoren J (2015) Fast algorithm selection using learning curves. In: International symposium on intelligent data analysis. Springer, pp 298–309

  3. Olson RS, Moore JH (2016) Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on automatic machine learning. PMLR, pp 66–74

  4. Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: ACM SIGKDD, pp 847–855

  5. Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning—applications to data mining

  6. Vainshtein R, Greenstein-Messica A, Katz G, Shapira B, Rokach L (2018) A hybrid approach for automatic model recommendation. In: Proceedings of the 27th ACM CIKM, pp 1623–1626

  7. Ferrari DG, De Castro LN (2015) Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf Sci 301:181–194

    Article  Google Scholar 

  8. Pimentel BA, de Carvalho AC (2019) A new data characterization for selecting clustering algorithms using meta-learning. Inf Sci 477:203–219

    Article  Google Scholar 

  9. Alcobaça E, Siqueira F, Rivolli A, Garcia LP, Oliva JT, de Carvalho AC et al (2020) Mfe: towards reproducible meta-feature extraction. J Mach Learn Res 21(111):1–5

    MATH  Google Scholar 

  10. Cohen-Shapira N, Rokach L (2021) Automatic selection of clustering algorithms using supervised graph embedding. Inf Sci 577:824–851

    Article  MathSciNet  Google Scholar 

  11. Cohen-Shapira N, Rokach L, Shapira B, Katz G, Vainshtein R (2019) Autogrd: model recommendation through graphical dataset representation. In: Proceedings of the 28th ACM CIKM, pp 821–830

  12. Cohen-Shapira N, Rokach L (2021) Trio: task-agnostic dataset representation optimized for automatic algorithm selection. In: Proceedings of the 21th IEEE international conference on data mining ICDM

  13. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  14. Drori I, Krishnamurthy Y, Rampin R, Lourenço R, One J, Cho K, Silva C, Freire J (2018) Alphad3m: machine learning pipeline synthesis. In: AutoML workshop at ICML

  15. Song Q, Wang G, Wang C (2012) Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recognit 45(7):2672–2689

    Article  Google Scholar 

  16. Edwards H, Storkey A (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185

  17. Yaveroğlu, Malod-Dognin N, Davis D, Levnajić Z, Janjic V, Karapandza R, Stojmirovic A, Przulj N (2014) Revealing the hidden language of complex networks. Sci Rep 4:4547

    Article  Google Scholar 

  18. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining. IEEE, pp 911–916

  19. Zhou Z-H, Feng J (2017) Deep forest: towards an alternative to deep neural networks. In: IJCAI, pp 3553–3559

  20. Feng J, Zhou Z (2018) Autoencoder by forest. In: AAAI conference on AI

  21. Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, pp 413–422

  22. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42

    Article  Google Scholar 

  23. Rokach L (2016) Decision forest: twenty years of research. Inf Fusion 27:111–125

    Article  Google Scholar 

  24. Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30(4):927–961

    Article  MathSciNet  Google Scholar 

  25. Poona N, Van Niekerk A, Ismail R (2016) Investigating the utility of oblique tree-based ensembles for the classification of hyperspectral data. Sensors 16(11):1918

    Article  Google Scholar 

  26. Setiono R, Liu H (1999) A connectionist approach to generating oblique decision trees. IEEE Trans Syst Man Cybern B Cybern 29(3):440–444

    Article  Google Scholar 

  27. Montañana R, Gámez JA, Puerta JM (2021) Stree: a single multi-class oblique decision tree based on support vector machines. In: Conference of the Spanish Association for artificial intelligence. Springer, pp 54–64

  28. Vens C, Costa F (2011) Random forest based feature induction. In: 2011 IEEE 11th ICDM, pp 744–753

  29. Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22

    Google Scholar 

  30. Dianati N (2016) Unwinding the hairball graph: pruning algorithms for weighted complex networks. Phys Rev E 93(1):012304

    Article  MathSciNet  Google Scholar 

  31. Chen F, Pan S, Jiang J, Huo H, Long G (2019) Dagcn: dual attention graph convolutional networks. In: IJCNN. IEEE, pp 1–8

  32. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on World Wide Web, pp 1067–1077

  33. Wang M, Zheng D, Ye Z, Gan Q, Li M, Song X et al (2019) Deep graph library: Agraph-centric, highly-performant package for graph neural net. arXiv preprint arXiv:1909.01315

  34. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  35. Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J AI Res 11:169–198

    MATH  Google Scholar 

  36. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30

    MathSciNet  MATH  Google Scholar 

  37. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Ml Res 15(1):3133–3181

    MathSciNet  MATH  Google Scholar 

  38. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(2):66

    MathSciNet  MATH  Google Scholar 

  39. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):66

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noy Cohen-Shapira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 523 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cohen-Shapira, N., Rokach, L. Learning dataset representation for automatic machine learning algorithm selection. Knowl Inf Syst 64, 2599–2635 (2022). https://doi.org/10.1007/s10115-022-01716-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01716-2

Keywords

Navigation