Learning dataset representation for automatic machine learning algorithm selection

Cohen-Shapira, Noy; Rokach, Lior

doi:10.1007/s10115-022-01716-2

Learning dataset representation for automatic machine learning algorithm selection

Regular paper
Published: 04 August 2022

Volume 64, pages 2599–2635, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

656 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dataset2Vec: learning dataset meta-features

Article Open access 25 February 2021

AutoIDL: Automated Imbalanced Data Learning via Collaborative Filtering

CIAMS: clustering indices-based automatic classification model selection

Article 19 August 2023

Notes

https://www.h2o.ai/.
A number of graph convolutional layers {2,3,4,5,6} and embedding dimensions of {50, 100, 200, 300, 400, 500} were tested, and, respectively, 4 and 300 were found to produce the best results with reasonable efficiency across all models.
https://archive.ics.uci.edu/ml/datasets.php.
https://www.openml.org.
https://sci2s.ugr.es/keel/datasets.php.
https://www.kaggle.com.
https://cutt.ly/hnIUjmP.

References

Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Article MathSciNet Google Scholar
van Rijn JN, Abdulrahman SM, Brazdil P, Vanschoren J (2015) Fast algorithm selection using learning curves. In: International symposium on intelligent data analysis. Springer, pp 298–309
Olson RS, Moore JH (2016) Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on automatic machine learning. PMLR, pp 66–74
Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: ACM SIGKDD, pp 847–855
Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning—applications to data mining
Vainshtein R, Greenstein-Messica A, Katz G, Shapira B, Rokach L (2018) A hybrid approach for automatic model recommendation. In: Proceedings of the 27th ACM CIKM, pp 1623–1626
Ferrari DG, De Castro LN (2015) Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf Sci 301:181–194
Article Google Scholar
Pimentel BA, de Carvalho AC (2019) A new data characterization for selecting clustering algorithms using meta-learning. Inf Sci 477:203–219
Article Google Scholar
Alcobaça E, Siqueira F, Rivolli A, Garcia LP, Oliva JT, de Carvalho AC et al (2020) Mfe: towards reproducible meta-feature extraction. J Mach Learn Res 21(111):1–5
MATH Google Scholar
Cohen-Shapira N, Rokach L (2021) Automatic selection of clustering algorithms using supervised graph embedding. Inf Sci 577:824–851
Article MathSciNet Google Scholar
Cohen-Shapira N, Rokach L, Shapira B, Katz G, Vainshtein R (2019) Autogrd: model recommendation through graphical dataset representation. In: Proceedings of the 28th ACM CIKM, pp 821–830
Cohen-Shapira N, Rokach L (2021) Trio: task-agnostic dataset representation optimized for automatic algorithm selection. In: Proceedings of the 21th IEEE international conference on data mining ICDM
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Drori I, Krishnamurthy Y, Rampin R, Lourenço R, One J, Cho K, Silva C, Freire J (2018) Alphad3m: machine learning pipeline synthesis. In: AutoML workshop at ICML
Song Q, Wang G, Wang C (2012) Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recognit 45(7):2672–2689
Article Google Scholar
Edwards H, Storkey A (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185
Yaveroğlu, Malod-Dognin N, Davis D, Levnajić Z, Janjic V, Karapandza R, Stojmirovic A, Przulj N (2014) Revealing the hidden language of complex networks. Sci Rep 4:4547
Article Google Scholar
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining. IEEE, pp 911–916
Zhou Z-H, Feng J (2017) Deep forest: towards an alternative to deep neural networks. In: IJCAI, pp 3553–3559
Feng J, Zhou Z (2018) Autoencoder by forest. In: AAAI conference on AI
Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, pp 413–422
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Article Google Scholar
Rokach L (2016) Decision forest: twenty years of research. Inf Fusion 27:111–125
Article Google Scholar
Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30(4):927–961
Article MathSciNet Google Scholar
Poona N, Van Niekerk A, Ismail R (2016) Investigating the utility of oblique tree-based ensembles for the classification of hyperspectral data. Sensors 16(11):1918
Article Google Scholar
Setiono R, Liu H (1999) A connectionist approach to generating oblique decision trees. IEEE Trans Syst Man Cybern B Cybern 29(3):440–444
Article Google Scholar
Montañana R, Gámez JA, Puerta JM (2021) Stree: a single multi-class oblique decision tree based on support vector machines. In: Conference of the Spanish Association for artificial intelligence. Springer, pp 54–64
Vens C, Costa F (2011) Random forest based feature induction. In: 2011 IEEE 11th ICDM, pp 744–753
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22
Google Scholar
Dianati N (2016) Unwinding the hairball graph: pruning algorithms for weighted complex networks. Phys Rev E 93(1):012304
Article MathSciNet Google Scholar
Chen F, Pan S, Jiang J, Huo H, Long G (2019) Dagcn: dual attention graph convolutional networks. In: IJCNN. IEEE, pp 1–8
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on World Wide Web, pp 1067–1077
Wang M, Zheng D, Ye Z, Gan Q, Li M, Song X et al (2019) Deep graph library: Agraph-centric, highly-performant package for graph neural net. arXiv preprint arXiv:1909.01315
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J AI Res 11:169–198
MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Ml Res 15(1):3133–3181
MathSciNet MATH Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(2):66
MathSciNet MATH Google Scholar
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):66
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Ben-Gurion University of the Negev, Beer Sheva, Israel
Noy Cohen-Shapira & Lior Rokach

Authors

Noy Cohen-Shapira
View author publications
You can also search for this author inPubMed Google Scholar
Lior Rokach
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Noy Cohen-Shapira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 523 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cohen-Shapira, N., Rokach, L. Learning dataset representation for automatic machine learning algorithm selection. Knowl Inf Syst 64, 2599–2635 (2022). https://doi.org/10.1007/s10115-022-01716-2

Download citation

Received: 28 December 2021
Revised: 12 June 2022
Accepted: 18 June 2022
Published: 04 August 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s10115-022-01716-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning dataset representation for automatic machine learning algorithm selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dataset2Vec: learning dataset meta-features

AutoIDL: Automated Imbalanced Data Learning via Collaborative Filtering

CIAMS: clustering indices-based automatic classification model selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 523 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now