Skip to main content

Advertisement

Log in

A review on preprocessing algorithm selection with meta-learning

  • Review
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Several AutoML tools aim to facilitate the usability of machine learning algorithms, automatically recommending algorithms using techniques such as meta-learning, grid search, and genetic programming. However, the preprocessing step is usually not well handled by those tools. Thus, in this work, we present a systematic review of preprocessing algorithms selection with meta-learning, aiming to find the state of the art in this field. To perform this task, we acquired 450 references, of which we selected 37 to be evaluated and analyzed according to a set of questions earlier defined. Thus, we managed to identify information such as what was published on the subject; the topics more often presented in those works; the most frequently recommended preprocessing algorithms; the most used features selected to extract information for the meta-learning; the machine learning algorithms employed as meta-learners and base-learners in those works; and the performance metrics that are chosen as the target of the applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.scopus.com/.

  2. http://www.webofscience.com/.

  3. http://ieeexplore.ieee.org/.

  4. http://dl.acm.org/.

  5. We considered as replicable the work that contains all information needed to implement the proposed method and obtain similar results.

References

  1. Aduviri R, Matos D, Villanueva E (2018) Feature selection algorithm recommendation for gene expression data through gradient boosting and neural network metamodels. In: IEEE international conference on bioinformatics and biomedicine. IEEE, pp 2726–2728

  2. Alexandropoulos SAN, Kotsiantis SB, Vrahatis MN (2019) Data preprocessing in predictive data mining. Knowl Eng Rev 34:e1

    Article  Google Scholar 

  3. Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518

    Article  Google Scholar 

  4. Bilalli B, Abelló A, Aluja-Banet T et al (2016) Automated data pre-processing via meta-learning. In: International conference on model and data engineering. Springer, pp 194–208

  5. Bilalli B, Abelló Gamazo A, Aluja Banet T (2017) On the predictive power of meta-features in openml. Int J Appl Math Comput Sci 27(4):697–712

    Article  MathSciNet  Google Scholar 

  6. Bilalli B, Abelló A, Aluja-Banet T et al (2018) Presistant: data pre-processing assistant. In: International conference on advanced information systems engineering. Springer, pp 57–65

  7. Bilalli B, Abelló A, Aluja-Banet T et al (2018) Intelligent assistance for data pre-processing. Comput Stand Interfaces 57:101–109

    Article  Google Scholar 

  8. Bilalli B, Abelló A, Aluja-Banet T et al (2019) Presistant: learning based assistant for data pre-processing. Data Knowl Eng 123:1–22

    Article  Google Scholar 

  9. Brazdil P, Giraud-Carrier C, Soares C et al (2009) Metalearning—applications to data mining, 1st edn. Cognitive Technologies, Springer, Berlin

    Book  Google Scholar 

  10. Brereton P, Kitchenham BA, Budgen D et al (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583

    Article  Google Scholar 

  11. Brezočnik L, Fister I Jr, Podgorelec V (2018) Swarm intelligence algorithms for feature selection: a review. Appl Sci 8(9):1521

    Article  Google Scholar 

  12. Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  13. Denyer D, Tranfield D (2009) Producing a systematic review. In: Buchanan D, Bryman A (eds) The Sage handbook of organizational research methods. Sage Publications Ltd, Thousand Oaks, pp 671–689

    Google Scholar 

  14. das Dôres SN, Soares C, Ruiz DD (2017) Effect of metalearning on feature selection employment. In: AutoML@ PKDD/ECML, pp 84–90

  15. Famili A, Shen WM, Weber R et al (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1(1):3–23

    Article  Google Scholar 

  16. Fayyad UM, Haussler D, Stolorz PE (1996) KDD for science data analysis: issues and examples. In: Second international conference on knowledge discovery & data mining. AAAI Press, Portland, OR, pp 50–56

  17. Feurer M, Klein A, Eggensperger K et al (2015) Efficient and robust automated machine learning. In: Advances in neural information processing systems, pp 1–9

  18. Filchenkov A, Pendryak A (2015) Datasets meta-feature description for recommending feature selection algorithm. In: Artificial intelligence and natural language and information extraction, social media and web search FRUCT conference, IEEE, pp 11–18

  19. Garcia LP, de Carvalho AC, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25

    Article  Google Scholar 

  20. Garcia LP, Lorena AC, Matwin S et al (2016) Ensembles of label noise filters: a ranking approach. Data Min Knowl Disc 30(5):1192–1216

    Article  MathSciNet  Google Scholar 

  21. García S, Luengo J, Herrera F (2015) Data preprocessing in data mining, vol 72, 1st edn. Springer, Cham

    Book  Google Scholar 

  22. García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29

    Article  Google Scholar 

  23. Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I. Algorithms survey. In: International conference on artificial intelligence and soft computing. Springer, pp 598–603

  24. Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: EDML@ SDM, pp 32–41

  25. Leyva E, González A, Pérez R (2013) Knowledge-based instance selection: a compromise between efficiency and versatility. Knowl-Based Syst 47:65–76

    Article  Google Scholar 

  26. Leyva E, Caises Y, González A et al (2014) On the use of meta-learning for instance selection: an architecture and an experimental study. Inf Sci 266:16–30

    Article  Google Scholar 

  27. Leyva E, González A, Perez R (2014) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367

    Article  Google Scholar 

  28. Liu Q, Hauswirth M (2020) A provenance meta learning framework for missing data handling methods selection. In: 11th IEEE annual ubiquitous computing, electronics & mobile communication conference, pp 0349–0358

  29. López V, Fernández A, García S et al (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  30. Luebke K, Weihs C (2011) Linear dimension reduction in classification: adaptive procedure for optimum results. Adv Data Anal Classif 5(3):201–213

    Article  MathSciNet  Google Scholar 

  31. Martínez-Plumed F, Contreras-Ochando L, Ferri C et al (2019) CRISP-DM twenty years later: from data mining processes to data science trajectories. IEEE Trans Knowl Data Eng 33(8):3048–3061

    Article  Google Scholar 

  32. Moniz N, Cerqueira V (2021) Automated imbalanced classification via meta-learning. Expert Syst Appl 178(115):011

    Google Scholar 

  33. de Morais RF, Miranda PB, Silva RM (2016) A meta-learning method to select under-sampling algorithms for imbalanced data sets. In: 5th Brazilian conference on intelligent systems. IEEE, pp 385–390

  34. de Morais RF, de Miranda PB, Silva RM (2017) A multi-criteria meta-learning method to select under-sampling algorithms for imbalanced datasets. In: ESANN

  35. Munson MA (2012) A study on the importance of and time spent on different modeling steps. ACM SIGKDD Explor Newsl 13(2):65–71

    Article  MathSciNet  Google Scholar 

  36. Nagarajah T, Poravi G (2019) A review on automated machine learning (AutoML) systems. In: 5th international conference for convergence in technology. IEEE, Bombay, India, pp 1–6

  37. Neutatz F, Biessmann F, Abedjan Z (2021) Enforcing constraints for machine learning systems via declarative feature selection: an experimental study. In: International conference on management of data, pp 1345–1358

  38. Nguyen G, Dlugolinsky S, Bobák M et al (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124

    Article  Google Scholar 

  39. de Oliveira Moura S, de Freitas MB, Cardoso HA, et al (2014) Choosing instance selection method using meta-learning. In: IEEE international conference on systems, man, and cybernetics. IEEE, pp 2003–2007

  40. Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. Automated machine learning: methods, systems, challenges. Springer, Berlin, pp 151–160

    Chapter  Google Scholar 

  41. Parmezan ARS, Lee HD, Wu FC (2017) Metalearning for choosing feature selection algorithms in data mining: proposal of a new framework. Expert Syst Appl 75:1–24

    Article  Google Scholar 

  42. Parmezan ARS, Lee HD, Spolaôr N et al (2021) Automatic recommendation of feature selection algorithms based on dataset characteristics. Expert Syst Appl 185(115):589

    Google Scholar 

  43. Pisani PH, Lorena AC (2013) A systematic review on keystroke dynamics. J Braz Comput Soc 19(4):573–587

    Article  Google Scholar 

  44. Post MJ, Putten Pvd, Rijn JNv (2016) Does feature selection improve classification? A large scale experiment in OpenML. In: International symposium on intelligent data analysis. Springer, pp 158–170

  45. Quemy A (2020) Two-stage optimization for machine learning workflow. Inf Syst 92(101):483

    Google Scholar 

  46. Rice JR (1976) The algorithm selection problem. Adv Comput 15:65–118

    Article  Google Scholar 

  47. Rivolli A, Garcia LP, Soares C et al (2022) Meta-features for meta-learning. Knowl-Based Syst 240(108):101

    Google Scholar 

  48. Sahni D, Pappu SJ, Bhatt N (2021) Aided selection of sampling methods for imbalanced data classification. In: 8th ACM IKDD CODS and 26th COMAD. Association for Computing Machinery, pp 198–202

  49. Shen Z, Chen X, Garibaldi JM (2020) A novel meta learning framework for feature selection using data synthesis and fuzzy similarity. In: IEEE international conference on fuzzy systems. IEEE, pp 1–8

  50. Shilbayeh S, Vadera S (2014) Feature selection in meta learning framework. In: Science and information conference. IEEE, pp 269–275

  51. Smith-Miles K (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):1–25

    Article  Google Scholar 

  52. Smith-Miles K, Islam R (2010) Meta-learning for data summarization based on instance selection method. In: IEEE congress on evolutionary computation. IEEE, pp 1–8

  53. Smith-Miles KA, Islam R (2011) Meta-learning of instance selection for data summarization. In: Meta-learning in computational intelligence. Springer, pp 77–95

  54. Smolyakov D, Korotin A, Erofeev P et al (2019) Meta-learning for resampling recommendation systems. In: Eleventh international conference on machine vision, SPIE, pp 472–484

  55. Tanfilev I, Filchenkov A, Smetannikov I (2017) Feature selection algorithm ensembling based on meta-learning. In: 10th international congress on image and signal processing, biomedical engineering and informatics. IEEE, pp 1–6

  56. Thornton C, Hutter F, Hoos HH et al (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 847–855

  57. Truong A, Walters A, Goodsitt J et al (2019) Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In: 31st international conference on tools with artificial intelligence. IEEE, Portland, OR, pp 1471–1479

  58. Vanschoren J (2019) Meta-learning. Automated machine learning. Springer, Cham, pp 35–61

    Chapter  Google Scholar 

  59. Wang G, Song Q, Sun H et al (2013) A feature subset selection algorithm automatic recommendation method. J Artif Intell Res 47:1–34

    Article  Google Scholar 

  60. Wolpert DH (2021) What is important about the no free lunch theorems? In: Black box optimization, machine learning, and no-free lunch theorems. Springer, pp 373–388

  61. Zagatti FR, Silva LC, Silva LNDS, et al (2021) Metaprep: data preparation pipelines recommendation via meta-learning. In: 20th IEEE international conference on machine learning and applications. IEEE, pp 1197–1202

  62. Zhao Y, Rossi R, Akoglu L (2021) Automatic unsupervised outlier model selection. Adv Neural Inf Process Syst 34:4489–4502

    Google Scholar 

  63. Zou Y, An A, Huang X (2005) Evaluation and automatic selection of methods for handling missing data. In: IEEE international conference on granular computing. IEEE, pp 728–733

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001.

Author information

Authors and Affiliations

Authors

Contributions

P.B.P. conducted the systematic literature review and also wrote and reviewed the manuscript. A.R. helped to write and review the manuscript. A.C.P.L.F.d.C. reviewed the manuscript. L.P.F.G. supervised the systematic literature review and reviewed the manuscript.

Corresponding author

Correspondence to Pedro B. Pio.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Extended data

Appendix A Extended data

See Table 3.

Table 3 Table containing how each work was evaluated according to the 11 binary questions presented in the planning phase

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pio, P.B., Rivolli, A., Carvalho, A.C.P.L.F.d. et al. A review on preprocessing algorithm selection with meta-learning. Knowl Inf Syst 66, 1–28 (2024). https://doi.org/10.1007/s10115-023-01970-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01970-y

Keywords

Navigation