A review on preprocessing algorithm selection with meta-learning

Pio, Pedro B.; Rivolli, Adriano; Carvalho, André C. P. L. F. de; Garcia, Luís P. F.

doi:10.1007/s10115-023-01970-y

A review on preprocessing algorithm selection with meta-learning

Review
Published: 28 August 2023

Volume 66, pages 1–28, (2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pedro B. Pio¹,
Adriano Rivolli²,
André C. P. L. F. de Carvalho³ &
…
Luís P. F. Garcia¹

414 Accesses
Explore all metrics

Abstract

Several AutoML tools aim to facilitate the usability of machine learning algorithms, automatically recommending algorithms using techniques such as meta-learning, grid search, and genetic programming. However, the preprocessing step is usually not well handled by those tools. Thus, in this work, we present a systematic review of preprocessing algorithms selection with meta-learning, aiming to find the state of the art in this field. To perform this task, we acquired 450 references, of which we selected 37 to be evaluated and analyzed according to a set of questions earlier defined. Thus, we managed to identify information such as what was published on the subject; the topics more often presented in those works; the most frequently recommended preprocessing algorithms; the most used features selected to extract information for the meta-learning; the machine learning algorithms employed as meta-learners and base-learners in those works; and the performance metrics that are chosen as the target of the applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on genetic algorithm: past, present, and future

Article 31 October 2020

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Article Open access 28 May 2016

Notes

http://www.scopus.com/.
http://www.webofscience.com/.
http://ieeexplore.ieee.org/.
http://dl.acm.org/.
We considered as replicable the work that contains all information needed to implement the proposed method and obtain similar results.

References

Aduviri R, Matos D, Villanueva E (2018) Feature selection algorithm recommendation for gene expression data through gradient boosting and neural network metamodels. In: IEEE international conference on bioinformatics and biomedicine. IEEE, pp 2726–2728
Alexandropoulos SAN, Kotsiantis SB, Vrahatis MN (2019) Data preprocessing in predictive data mining. Knowl Eng Rev 34:e1
Article Google Scholar
Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518
Article Google Scholar
Bilalli B, Abelló A, Aluja-Banet T et al (2016) Automated data pre-processing via meta-learning. In: International conference on model and data engineering. Springer, pp 194–208
Bilalli B, Abelló Gamazo A, Aluja Banet T (2017) On the predictive power of meta-features in openml. Int J Appl Math Comput Sci 27(4):697–712
Article MathSciNet Google Scholar
Bilalli B, Abelló A, Aluja-Banet T et al (2018) Presistant: data pre-processing assistant. In: International conference on advanced information systems engineering. Springer, pp 57–65
Bilalli B, Abelló A, Aluja-Banet T et al (2018) Intelligent assistance for data pre-processing. Comput Stand Interfaces 57:101–109
Article Google Scholar
Bilalli B, Abelló A, Aluja-Banet T et al (2019) Presistant: learning based assistant for data pre-processing. Data Knowl Eng 123:1–22
Article Google Scholar
Brazdil P, Giraud-Carrier C, Soares C et al (2009) Metalearning—applications to data mining, 1st edn. Cognitive Technologies, Springer, Berlin
Book Google Scholar
Brereton P, Kitchenham BA, Budgen D et al (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583
Article Google Scholar
Brezočnik L, Fister I Jr, Podgorelec V (2018) Swarm intelligence algorithms for feature selection: a review. Appl Sci 8(9):1521
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Denyer D, Tranfield D (2009) Producing a systematic review. In: Buchanan D, Bryman A (eds) The Sage handbook of organizational research methods. Sage Publications Ltd, Thousand Oaks, pp 671–689
Google Scholar
das Dôres SN, Soares C, Ruiz DD (2017) Effect of metalearning on feature selection employment. In: AutoML@ PKDD/ECML, pp 84–90
Famili A, Shen WM, Weber R et al (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1(1):3–23
Article Google Scholar
Fayyad UM, Haussler D, Stolorz PE (1996) KDD for science data analysis: issues and examples. In: Second international conference on knowledge discovery & data mining. AAAI Press, Portland, OR, pp 50–56
Feurer M, Klein A, Eggensperger K et al (2015) Efficient and robust automated machine learning. In: Advances in neural information processing systems, pp 1–9
Filchenkov A, Pendryak A (2015) Datasets meta-feature description for recommending feature selection algorithm. In: Artificial intelligence and natural language and information extraction, social media and web search FRUCT conference, IEEE, pp 11–18
Garcia LP, de Carvalho AC, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25
Article Google Scholar
Garcia LP, Lorena AC, Matwin S et al (2016) Ensembles of label noise filters: a ranking approach. Data Min Knowl Disc 30(5):1192–1216
Article MathSciNet Google Scholar
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining, vol 72, 1st edn. Springer, Cham
Book Google Scholar
García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29
Article Google Scholar
Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I. Algorithms survey. In: International conference on artificial intelligence and soft computing. Springer, pp 598–603
Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: EDML@ SDM, pp 32–41
Leyva E, González A, Pérez R (2013) Knowledge-based instance selection: a compromise between efficiency and versatility. Knowl-Based Syst 47:65–76
Article Google Scholar
Leyva E, Caises Y, González A et al (2014) On the use of meta-learning for instance selection: an architecture and an experimental study. Inf Sci 266:16–30
Article Google Scholar
Leyva E, González A, Perez R (2014) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367
Article Google Scholar
Liu Q, Hauswirth M (2020) A provenance meta learning framework for missing data handling methods selection. In: 11th IEEE annual ubiquitous computing, electronics & mobile communication conference, pp 0349–0358
López V, Fernández A, García S et al (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Luebke K, Weihs C (2011) Linear dimension reduction in classification: adaptive procedure for optimum results. Adv Data Anal Classif 5(3):201–213
Article MathSciNet Google Scholar
Martínez-Plumed F, Contreras-Ochando L, Ferri C et al (2019) CRISP-DM twenty years later: from data mining processes to data science trajectories. IEEE Trans Knowl Data Eng 33(8):3048–3061
Article Google Scholar
Moniz N, Cerqueira V (2021) Automated imbalanced classification via meta-learning. Expert Syst Appl 178(115):011
Google Scholar
de Morais RF, Miranda PB, Silva RM (2016) A meta-learning method to select under-sampling algorithms for imbalanced data sets. In: 5th Brazilian conference on intelligent systems. IEEE, pp 385–390
de Morais RF, de Miranda PB, Silva RM (2017) A multi-criteria meta-learning method to select under-sampling algorithms for imbalanced datasets. In: ESANN
Munson MA (2012) A study on the importance of and time spent on different modeling steps. ACM SIGKDD Explor Newsl 13(2):65–71
Article MathSciNet Google Scholar
Nagarajah T, Poravi G (2019) A review on automated machine learning (AutoML) systems. In: 5th international conference for convergence in technology. IEEE, Bombay, India, pp 1–6
Neutatz F, Biessmann F, Abedjan Z (2021) Enforcing constraints for machine learning systems via declarative feature selection: an experimental study. In: International conference on management of data, pp 1345–1358
Nguyen G, Dlugolinsky S, Bobák M et al (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124
Article Google Scholar
de Oliveira Moura S, de Freitas MB, Cardoso HA, et al (2014) Choosing instance selection method using meta-learning. In: IEEE international conference on systems, man, and cybernetics. IEEE, pp 2003–2007
Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. Automated machine learning: methods, systems, challenges. Springer, Berlin, pp 151–160
Chapter Google Scholar
Parmezan ARS, Lee HD, Wu FC (2017) Metalearning for choosing feature selection algorithms in data mining: proposal of a new framework. Expert Syst Appl 75:1–24
Article Google Scholar
Parmezan ARS, Lee HD, Spolaôr N et al (2021) Automatic recommendation of feature selection algorithms based on dataset characteristics. Expert Syst Appl 185(115):589
Google Scholar
Pisani PH, Lorena AC (2013) A systematic review on keystroke dynamics. J Braz Comput Soc 19(4):573–587
Article Google Scholar
Post MJ, Putten Pvd, Rijn JNv (2016) Does feature selection improve classification? A large scale experiment in OpenML. In: International symposium on intelligent data analysis. Springer, pp 158–170
Quemy A (2020) Two-stage optimization for machine learning workflow. Inf Syst 92(101):483
Google Scholar
Rice JR (1976) The algorithm selection problem. Adv Comput 15:65–118
Article Google Scholar
Rivolli A, Garcia LP, Soares C et al (2022) Meta-features for meta-learning. Knowl-Based Syst 240(108):101
Google Scholar
Sahni D, Pappu SJ, Bhatt N (2021) Aided selection of sampling methods for imbalanced data classification. In: 8th ACM IKDD CODS and 26th COMAD. Association for Computing Machinery, pp 198–202
Shen Z, Chen X, Garibaldi JM (2020) A novel meta learning framework for feature selection using data synthesis and fuzzy similarity. In: IEEE international conference on fuzzy systems. IEEE, pp 1–8
Shilbayeh S, Vadera S (2014) Feature selection in meta learning framework. In: Science and information conference. IEEE, pp 269–275
Smith-Miles K (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):1–25
Article Google Scholar
Smith-Miles K, Islam R (2010) Meta-learning for data summarization based on instance selection method. In: IEEE congress on evolutionary computation. IEEE, pp 1–8
Smith-Miles KA, Islam R (2011) Meta-learning of instance selection for data summarization. In: Meta-learning in computational intelligence. Springer, pp 77–95
Smolyakov D, Korotin A, Erofeev P et al (2019) Meta-learning for resampling recommendation systems. In: Eleventh international conference on machine vision, SPIE, pp 472–484
Tanfilev I, Filchenkov A, Smetannikov I (2017) Feature selection algorithm ensembling based on meta-learning. In: 10th international congress on image and signal processing, biomedical engineering and informatics. IEEE, pp 1–6
Thornton C, Hutter F, Hoos HH et al (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 847–855
Truong A, Walters A, Goodsitt J et al (2019) Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In: 31st international conference on tools with artificial intelligence. IEEE, Portland, OR, pp 1471–1479
Vanschoren J (2019) Meta-learning. Automated machine learning. Springer, Cham, pp 35–61
Chapter Google Scholar
Wang G, Song Q, Sun H et al (2013) A feature subset selection algorithm automatic recommendation method. J Artif Intell Res 47:1–34
Article Google Scholar
Wolpert DH (2021) What is important about the no free lunch theorems? In: Black box optimization, machine learning, and no-free lunch theorems. Springer, pp 373–388
Zagatti FR, Silva LC, Silva LNDS, et al (2021) Metaprep: data preparation pipelines recommendation via meta-learning. In: 20th IEEE international conference on machine learning and applications. IEEE, pp 1197–1202
Zhao Y, Rossi R, Akoglu L (2021) Automatic unsupervised outlier model selection. Adv Neural Inf Process Syst 34:4489–4502
Google Scholar
Zou Y, An A, Huang X (2005) Evaluation and automatic selection of methods for handling missing data. In: IEEE international conference on granular computing. IEEE, pp 728–733

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001.

Author information

Authors and Affiliations

Department of Computer Science (CIC), University of Brasília (UnB), Asa Norte, Brasília, Distrito Federal, 70910-900, Brazil
Pedro B. Pio & Luís P. F. Garcia
Câmpus Cornélio Procópio, Federal University of Technology (UTFPR), Paraná, Brazil, Av. Alberto Carazzai 1640, Cornélio Procópio, Paraná, 86300-000, Brazil
Adriano Rivolli
Institute of Mathematical and Computer Sciences (ICMC), University of São Paulo (USP), Av. Trabalhador São-carlense 400, São Carlos, São Paulo, 13560-970, Brazil
André C. P. L. F. de Carvalho

Authors

Pedro B. Pio
View author publications
You can also search for this author in PubMed Google Scholar
Adriano Rivolli
View author publications
You can also search for this author in PubMed Google Scholar
André C. P. L. F. de Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Luís P. F. Garcia
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.B.P. conducted the systematic literature review and also wrote and reviewed the manuscript. A.R. helped to write and review the manuscript. A.C.P.L.F.d.C. reviewed the manuscript. L.P.F.G. supervised the systematic literature review and reviewed the manuscript.

Corresponding author

Correspondence to Pedro B. Pio.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Extended data

See Table 3.

Table 3 Table containing how each work was evaluated according to the 11 binary questions presented in the planning phase

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pio, P.B., Rivolli, A., Carvalho, A.C.P.L.F.d. et al. A review on preprocessing algorithm selection with meta-learning. Knowl Inf Syst 66, 1–28 (2024). https://doi.org/10.1007/s10115-023-01970-y

Download citation

Received: 21 March 2023
Revised: 09 July 2023
Accepted: 17 August 2023
Published: 28 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10115-023-01970-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on preprocessing algorithm selection with meta-learning

Abstract

Access this article

Similar content being viewed by others

A review on genetic algorithm: past, present, and future

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A Extended data

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A review on preprocessing algorithm selection with meta-learning

Abstract

Access this article

Similar content being viewed by others

A review on genetic algorithm: past, present, and future

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A Extended data

Appendix A Extended data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation