Skip to main content
Log in

Impact of preprocessing on medical data classification

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The significance of the preprocessing stage in any data mining task is well known. Before attempting medical data classification, characteristics ofmedical datasets, including noise, incompleteness, and the existence of multiple and possibly irrelevant features, need to be addressed. In this paper, we show that selecting the right combination of preprocessing methods has a considerable impact on the classification potential of a dataset. The preprocessing operations considered include the discretization of numeric attributes, the selection of attribute subset(s), and the handling of missing values. The classification is performed by an ant colony optimization algorithm as a case study. Experimental results on 25 real-world medical datasets show that a significant relative improvement in predictive accuracy, exceeding 60% in some cases, is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Pham H N A, Triantaphyllou E. An application of a new metaheuristic for optimizing the classification accuracy when analyzing some medical datasets. Expert Systems with Applications, 2009, 36: 9240–9249

    Article  Google Scholar 

  2. Almuhaideb S, El-Bachir Menai M. Hybrid metaheuristics for medical data classification. In: El-Ghazali T, ed. Hybrid Metaheuristics. Springer, 2013, 187–217

    Chapter  Google Scholar 

  3. Penã-Reyes C A, Sipper M. Evolutionary computation in medicine: an overview. Artificial Intelligence in Medicine, 2000, 19(1): 1–23

    Article  Google Scholar 

  4. Tanwani A K, Afridi J, Shafiq M Z, Farooq M. Guidelines to select machine learning scheme for classification of biomedical datasets. In: Pizzuti C, Ritchie M D, Giacobini M, eds. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 2009, 28–139

    Google Scholar 

  5. Almuhaideb S, El-Bachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014, 3(1): 59–80

    Article  Google Scholar 

  6. Milne D, Witten I H. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 2013, 194: 222–239

    Article  MathSciNet  Google Scholar 

  7. Alcalá-fdez J, L. Sánchez L, García S, del Jesus MJ, Ventura S, Garrell J M, Otero J, Bacardit J, Rivas V M, Fernández J C, Herrera F. KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Computing, 2009, 13(3): 307–318

    Article  Google Scholar 

  8. Martens D, de Backer M, Haesen R, Vanthienen J, Snoeck M, Baesens B. Classification with ant colony optimization. IEEE Transactions on Evolutionary Computation, 2007, 11(5): 651–665

    Article  Google Scholar 

  9. Tanwani A K, Farooq M. Performance evaluation of evolutionary algorithms in classification of biomedical datasets. In: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation: Late Breaking Papers. 2009, 2617–2624

    Google Scholar 

  10. Tanwani A K, Farooq M. The role of biomedical dataset inclassification. In: Proceedings of Conference on Artificial Intelligence in Medicine in Europe. 2009

    Google Scholar 

  11. Tanwani A K, Farooq M. Classification potential vs. classification accuracy: a comprehensive study of evolutionary algorithms with biomedical datasets. Learning Classifier System, 2010: 127–144

    Google Scholar 

  12. Kotsiantis S B. Feature selection for machine learning classification problems: a recent overview. Artificial Intelligence Review, 2011: 249–268

    Google Scholar 

  13. Whitney A W. A direct method of nonparametric measurement selection. IEEE Transactions on Computers, 1971, 20(9): 1100–1103

    Article  MathSciNet  MATH  Google Scholar 

  14. Marill T, Green D. On the effectiveness of receptors in recognition systems. IEEE Transactions on Information Theory, 1963, 9(1): 11–17

    Article  Google Scholar 

  15. Pudil P, Novovicová J, Kittler J. Floating search methods in features election. Pattern Recognition Letters, 1994, 15(10): 1119–1125

    Article  Google Scholar 

  16. Yusta S C. Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters, 2009, 30(5): 525–534

    Article  Google Scholar 

  17. Jourdan L, Dhaenens C, Talbi E G. A genetic algorithm for features election in datamining for genetics. In: Proceedings of the 4th Metaheuristics International Conference Porto. 2010: 29–34

    Google Scholar 

  18. Huang J J, Cai Y Z, Xu X M. A hybrid genetic algorithm for features election wrapper based on mutual information. Pattern Recognition Letters, 2007, 28(13): 1825–1844

    Article  Google Scholar 

  19. AI-Ani A. Feature subset selection using ant colony optimization. International Journal of Computational Intelligence, 2005, 2(1): 53–58

    Google Scholar 

  20. Unler A, Murat A. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research, 2010, 206(3): 528–539

    Article  MATH  Google Scholar 

  21. Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 2003, 3: 1183–1208

    MATH  Google Scholar 

  22. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge Discovery and Data Engineering, 2005, 17(4): 491–502

    Article  Google Scholar 

  23. Shin K, Fernandes D, Miyazaki S. Consistency measures for features election: a formal definition, relative sensitivity comparison, and a fast algorithm. In: Proceedings of International Conference on Artificial Intelligence (IJCAI). 2011, 1491–1497

    Google Scholar 

  24. Kerber R. ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. 1992, 123–128

    Google Scholar 

  25. Liu H, Setiono R. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 1997, 9(4): 642–645

    Article  Google Scholar 

  26. Fayyad U M, Irani K B. Multi-interval discretization of continuousvalued attributes for classification learning. In: Proceedings of International Conference on Artificial Intelligence. 1993, 1022–1029

    Google Scholar 

  27. Jin R M, Breitbart Y, Muoh C. Data discretization unification. Knowledge and Information Systems, 2009, 19(1): 1–29

    Article  Google Scholar 

  28. Quinlan R. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993

  29. Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003, 3: 1157–1182

    MATH  Google Scholar 

  30. Kohavi R, John G H. Wrappers for feature subsets election. Artificial Intelligence, 1997, 97(1–2): 273–324

    Article  MATH  Google Scholar 

  31. Caruana R, Freitag D. Greedy attribute selection. In: Proceedings of International Conference on Machine Learning. 1994, 28–36

    Google Scholar 

  32. Koza J R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press, 1992

  33. Breiman L, Friedman J H, Olshen R A, Stone C J. Classification and Regression Trees. New York, NY: Chapman & Hall, 1984

  34. Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of International Conference on Machine Learning. 2001, 74–81

    Google Scholar 

  35. Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd edition. London, UK: Morgan Kaufmann Publishers, 2006

  36. Chlebus B S, Nguyen S H. On finding optimal discretizations for two attributes. In: Polkowski L, Skowron A, eds. Rough Sets and Current Trends in Computing. Springer, 1998, 537–544

  37. García S, Luengo J, Sáez J A, López V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 734–750

    Article  Google Scholar 

  38. Wong A K C, Chiu D K Y. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987, 9(6): 796–805

    Article  Google Scholar 

  39. Garcá-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: a review. Neural Computing and Applications, 2010, 19(2): 263–282

    Article  Google Scholar 

  40. Grzymala-Busse JW, Goodwin L K, Grzymala-Busse WJ, Zheng X Q. Handling missing attribute values in preterm birth data sets. In: Slezak D, Yao J T, Peters J F, Ziarko W, Hu X H, eds. Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. Springer, 2005, 342–351

  41. Batista G E A P A, Monard MC. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 2003, 17(5–6): 519–533

    Article  Google Scholar 

  42. Feng H H, Chen G S, Yin C, Yang B R, Chen Y M. A SVM regression based approach to filling in missing values. In: Khosla R, Howlett R J, Jain L C, eds. Knowledge-Based Intelligent Information and Engineering Systems. Springer, 2005, 581–587

  43. Gupta A, Lam M S. Estimating missing values using neural networks. Journal of the Operational Research Society, 1996, 47(2): 229–238

    Article  MATH  Google Scholar 

  44. Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 1977, 39(1): 1–38

    MathSciNet  MATH  Google Scholar 

  45. Schneider T. Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 2001, 14: 853–871

    Article  Google Scholar 

  46. Gourraud P A, Génin E, Cambon-Thomsen A. Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies. European Journal of Human Genetics, 2004, 12: 805–812

    Article  Google Scholar 

  47. Mcculloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 1943, 5: 115–133

    Article  MathSciNet  MATH  Google Scholar 

  48. Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor: The University of Michigan Press, 1975

  49. Dorigo M. Optimization, learning and natural algorithms. Dissertation for the Doctoral Degree. Politecnico di Milano, Italy, 1992

    Google Scholar 

  50. Kennedy J, Eberhart R. Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks. 1995, 1942–1948

    Chapter  Google Scholar 

  51. Sato T, Hagiwara M. Bee system: finding solution by a concentrated search. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics. 1997

    Google Scholar 

  52. Karaboga D. An idea based on honey bee swarm for numerical optimization. Technical Report TR06, Erciyes University, 2005

  53. Dorigo M, Gambardella L M. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1997, 1(1): 53–66

    Article  Google Scholar 

  54. Parpinelli R S, Lopes H S, Freitas A A. Data mining with an ant colony optimization algorithm. IEEE Transactions Evolutionary Computation, 2002, 6(4): 321–332

    Article  MATH  Google Scholar 

  55. Stützle T, Hoos H H. MAX-MIN ant system. Future Generation Computer Systems, 2000, 16(8): 889–914

    Article  MATH  Google Scholar 

  56. Pellegrini P, Ellero A. The small world of pheromone trails. In: Dorigo M, Birattari M, Blum C, Clerc M, Stützle T, Winfield A F T, eds. Ant Colony Optimzation and Swarm Intelligence. Springer, 2008, 387–394

  57. Cohen W W. Fast effective rule induction. In: Prieditis A, Russell S J, eds. International Conference on Machine Learning. Morgan Kaufmann, 1995, 115–123

  58. Minnaert B, Martens D, de Baker M, Baesens B. To tune or not to tune: rule evaluation for metaheuristic-based sequential covering algorithms. Data Mining and Knowledge Discovery, 2015, 29(1): 237–272

    Article  MathSciNet  Google Scholar 

  59. Almuhaideb S, ElBachir Menai M. A new hybrid metaheuristic for medical data classification. International Journal of Metaheuristics, 2014: 1–17

    Google Scholar 

  60. Rissanen J. Modeling by shortest data description. Automatica, 1978, 14(5): 465–471

    Article  MATH  Google Scholar 

  61. Kononenko I. On biases in estimating multi-valued attributes. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1034–1040

    Google Scholar 

  62. Kira K, Rendell L A. A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning. 1992

    Google Scholar 

  63. Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of European Conference on Machine Learning. 1994, 171–182

    Google Scholar 

  64. Hall M A. Correlation-based feature selection for machine learning. Dissertation for the Dotoral Degree. Hamilton, New Zealand: University of Waikato, 1999

    Google Scholar 

  65. Liu H, Setiono R. A probabilistic approach to feature selection—a filter solution. In: Proceedings of International Conference on Machine Learning. 1996, 319–327

    Google Scholar 

  66. Frank E, Witten I H. Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning. 1998, 144–151

    Google Scholar 

  67. Holte R C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 1993, 11(1): 63–91

    Article  MathSciNet  MATH  Google Scholar 

  68. Klösgan W. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems, 1992, 7(7): 649–673

    Article  MATH  Google Scholar 

  69. Janssen F, Fürnkranz J. On the quest for optimal rule learning heuristics. Machine Learning, 2010, 78(3): 343–379

    Article  MathSciNet  Google Scholar 

  70. Martens D, Baesens B, Fawcett T. Editorial survey: swarm intelligence for data mining. Machine Learning, 2010, 82(1): 1–42

    Article  MathSciNet  Google Scholar 

  71. Hanczara B, Dougherty E R. The reliability of estimated confidence intervals for classification error rates when only a single sample is available. Pattern Recognition, 2013, 64(3): 1067–1077

    Article  Google Scholar 

  72. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Conference on Artificial Intelligence. 1995, 1137–1145

    Google Scholar 

  73. García S, Fernández A, Luengo J, Herrera F. A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 2009, 13(10): 959–977

    Article  Google Scholar 

  74. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin, 1945, 1(6): 80–83

    Article  Google Scholar 

  75. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. American Statistical Association, 1937, 32(200): 675–701

    Article  MATH  Google Scholar 

  76. Frank A, Asuncion A. UCI machine learning repository. Irvine, CA: University of California, 2010

  77. Napierala K, Stefanowski J. BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 2012, 39(2): 335–373

    Article  Google Scholar 

  78. Orriols-Puig A, Bernadó-Mansilla E. The class imbalance problem in UCS classifier system: a preliminary study. In: Proceedings of the 2003–2005 International Conference on Learning Classifier Systems. 2007, 161–180

    Chapter  Google Scholar 

  79. Pazzani M J, Mani S, Shankle W R. Acceptance of rules generated by machine learning among medical experts. Methods of Information in Medicine, 2001, 40(5): 380–385

    Google Scholar 

  80. Vapnik V N. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982

  81. Vapnik V N. The Nature of Statistical Learning Theory. New York: Springer, 1995

  82. Lim T S, Loh W Y, Shih Y S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000, 40(3): 203–228

    Article  MATH  Google Scholar 

  83. Gonzalez A, Perez R. Slave: a genetic learning system based on an iterative approach. IEEE Transactions on Fuzzy Systems, 1999, 7(2): 176–191

    Article  Google Scholar 

  84. Bernadó-Mansilla E, Garrell-Guiu J M. Accuracy based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation, 2003, 11(3): 209–238

    Article  Google Scholar 

  85. Wilson S W. Classifier fitness based on accuracy. Evolutionary Computation, 1995, 3(2): 149–175

    Article  Google Scholar 

  86. Orriols-Puig A, Casillas J, Bernadó-Mansilla E. A comparative study of several geneticbased supervised learning systems. In: Bull L, Bernadó-Mansilla E, Holmes J H, eds. Learning Classifier Systems in Data Mining. Springer, 2008, 205–230

    Chapter  Google Scholar 

  87. Troyanskaya O G, Cantor M, Sherlock G, Brown P O, Hastie T, Tibshirani R, Botstein D, Altman R B. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17(6): 520–525

    Article  Google Scholar 

  88. Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998, 209(1–2): 237–260

    Article  MathSciNet  MATH  Google Scholar 

  89. Bacardit J, Butz M. Data mining in learning classifier systems: comparing XCS with gassist. In: Proceedings of International Conference on Learning Classifier Systems (IWLCS 2003–2005). 2004, 282–290

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarab Almuhaideb.

Additional information

Sarab Almuhaideb is a PhD student in the Department of Computer Science, King Saud University, Saudi Arabia. She is a lecturer in the Department of Computer Science, Prince Sultan University, Saudi Arabia. Her research interests include issues related to machine learning, evolutionary computation, and hybrid metaheuristics.

Mohamed El Bachir Menai received his PhD degree in computer science from Mentouri University of Constantine, Algeria, and University of Paris VIII, France in 2005. He also received a “Habilitation universitaire” in computer science from Mentouri University of Constantine, in 2007 (it is the highest academic qualification in Algeria, France and Germany). He is currently a professor in the Department of Computer Science at King Saud University, Saudi Arabia. His main interests include evolutionary computing, data mining, machine learning, natural language processing, and satisfiability problems.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Almuhaideb, S., Menai, M.E.B. Impact of preprocessing on medical data classification. Front. Comput. Sci. 10, 1082–1102 (2016). https://doi.org/10.1007/s11704-016-5203-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-016-5203-5

Keywords

Navigation