Skip to main content
Log in

Estimation of incomplete values in heterogeneous attribute large datasets using discretized Bayesian max–min ant colony optimization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The size of datasets is becoming larger nowadays and missing values in such datasets pose serious threat to data analysts. Although various techniques have been developed by researchers to handle missing values in different kinds of datasets, there is not much effort to deal with the missing values in mixed attributes in large datasets. This paper has proposed novel strategies for dealing with this issue. The significant attributes (covariates) required for imputation are first selected using gain ratio measure to decrease the computational complexity. Since analysis of continuous attributes in imputation process is complex, they are first discretized using a novel methodology called Bayesian classifier-based discretization. Then, missing values in them are imputed using Bayesian max–min ant colony optimization algorithm which hybridizes ACO with Bayesian principles. The local search technique is also introduced in ACO implementation to improve its exploitative capability. The proposed methodology is implemented in real datasets with different missing rates ranging from 5 to 50% and from the experimental results, it is observed that the proposed discretization and imputation algorithms produce better results than the existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Abdulkader MMS, Gajpal Y, ElMekkawy TY (2015) Hybridized ant colony algorithm for the multi compartment vehicle routing problem. Appl Soft Comput 37:196–203

    Article  Google Scholar 

  2. Ali R, Siddiqi MH, Lee S (2015) Rough set-based approaches for discretization: a compact Review. Artif Intell Rev 44(2):235–263

    Article  Google Scholar 

  3. Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci Int J 233:25–35

    Google Scholar 

  4. Bai J, Yang G-K, Chen Y-W, Hu L-H, Pan C-C (2013) A model induced max–min ant colony optimization for asymmetric travelling salesman problem. Appl Soft Comput 13:1365–1375

    Article  Google Scholar 

  5. Baragona R, Battaglia F, Poli (2011) Evolutionary statistical procedures. Springer, Berlin

    Book  MATH  Google Scholar 

  6. Berrichi A, Yalaoui F, Amodeo L, Mezghiche M (2010) Computers Bi-objective ant colony optimization approach to optimize production and maintenance scheduling. Oper Res 37:1584–1596

    MathSciNet  MATH  Google Scholar 

  7. Boyles S (2011) A comparison of interpolation methods for missing traffic volume data. In: Proceedings of the 90th annual meeting of the transportation research board, pp 23–27

  8. Blum C (2005) Ant colony optimization: introduction and recent trends. Phys Life Rev 2:353–373

    Article  Google Scholar 

  9. Bobbie-Jo M, Webb-Robertson Wiberg HK, Matzke MM et al (2015) Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J Proteome Res 14(5):1993–2001

    Article  Google Scholar 

  10. Borrotti G, Minervini D, Lucrezia D, Poli I (2016) Naïve Bayes ant colony optimization for designing high dimensional experiments. Appl Soft Comput 49:259–268

    Article  Google Scholar 

  11. Boulle M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65:131–165

    Article  Google Scholar 

  12. Chen J, Huang H, Tian F, Tian S (2008) A selective Bayes classifier for classifying incomplete data based on gain ratio. Knowl Based Syst 21(7):530–534

    Article  Google Scholar 

  13. Cheng X, Cook D, Hofmann H (2015) Visually exploring missing values in multivariable data using a graphical user interface. J Stat Soft 68(6):1–23

    Article  Google Scholar 

  14. D’Andreagiovanni F, Krolikowski J, Pulaj J (2015) A fast hybrid primal heuristic for multiband robust capacitated network design with multiple time periods. Appl Soft Comput 26:497–507

    Article  Google Scholar 

  15. D’Andreagiovanni F, Nardin A (2015) Towards the fast and robust optimal design of wireless body area networks. Appl Soft Comput 37:971–982

    Article  Google Scholar 

  16. Deng Y, Chang C, Ido MS, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci Rep 6(21689):1–10

    Google Scholar 

  17. DeviPriya R, Kuppuswami S (2014) Drawing inferences from clinical studies with missing values using genetic algorithm. Int J Bioinform Res Appl 10(6):613–627

    Article  Google Scholar 

  18. Dorigo M, Maniezzo V, Colorni A (1996) Ant system: optimization by a colony of cooperating agents. IEEE Trans Syst Man Cybern Part B 26(1):1–13

    Article  Google Scholar 

  19. Dorigo M, Stützle T (2010) Ant colony optimization: overview and recent advances. In: Gendreau M, Potvin JY (eds) Handbook of metaheuristics. International series in operations research & management science, vol 146. Springer, Boston

  20. Duan P, Yong AI (2016) Research on an improved ant colony optimization algorithm and its application. Int J Hybrid Inf Technol 9(4):223–234

    Article  Google Scholar 

  21. Euchi J, Mraihi R (2012) The urban bus routing problem in the Tunisian case by the hybrid artificial ant colony algorithm. Swarm Evol Comput 2:15–24

    Article  Google Scholar 

  22. Friedman N, Goldszmidt M (1996) Discretizing continuous attributes while learning Bayesian networks. In: Proceedings of 13th international conference on machine learning 1996

  23. Gambardella L, Montemanni R, Weyland D (2012) Coupling ant colony systems with strong local searches. Eur J Oper Res 220(3):831–843

    Article  MathSciNet  MATH  Google Scholar 

  24. Garcia J, Lopez-Bueno I, Fernandez F, Borrajo D (2010) A comparative study of discretization approaches for state space generalization in the keep away soccer task. Reinforcement learning: algorithms, implementations and applications. Nova Science Publishers, Hauppauge

    Google Scholar 

  25. Garcia-Laencina P-J, Abreu PH, Abreu MH, Afonoso N (2015) Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 59:125–133

    Article  Google Scholar 

  26. Grzymala-Busse JW, Mroczek T (2016) A comparison of four approaches to discretization based on entropy. Entropy 18(69):1–11

    Google Scholar 

  27. Han T, Lee S, Oh S (2015) Improving discretization by post- processing procedure. Int J Eng Technol 7(2):414–421

    Google Scholar 

  28. Herrera F, Luengo J, Saez JA, Lopez V, Garcia S (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. Proc IEEE Trans Knowl Data Eng 25:734–750

    Article  Google Scholar 

  29. Huang C-L, Huang W-C, Chang H-Y, Yeh Y-C, Tsai C-Y (2013) Hybridization strategies for continuous ant colony optimization and particle swarm optimization applied to data clustering. Appl Soft Comput 13:3864–3872

    Article  Google Scholar 

  30. Huang W, Pan Y, Wu J (2013) Supervised discretization with GK - \(\tau \). Proc Int Confer Inf Technol Quant Manag Proc Comput Sci 17:114–120

    Google Scholar 

  31. Huang W, Pan Y, Wu J (2014) Supervised discretization for optimal prediction. Supervised Discretization for optimal prediction. In: Proceedings of 1st international conference on data science, vol 30, pp 75 – 80

  32. Ismkhan H (2017) Effective heuristics for ant colony optimization to handle large-scale problems. Swarm Evol Comput 32:140–149

  33. Janicki R, Malec D (2013) A Bayesian model averaging approach to analyzing categorical data with nonignorable nonresponse. Comput Stat Data Anal 57(1):600–614

    Article  MathSciNet  MATH  Google Scholar 

  34. Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Soft 70(1):1–23

    Article  Google Scholar 

  35. Kabir MM, Shahjahan Md, Murase K (2012) A new hybrid ant colony optimization algorithm for feature selection. Exp Syst Appl 39:3747–3763

    Article  Google Scholar 

  36. Kapelner A, Bleich J (2015) Prediction with missing data via Bayesian additive regression trees. Can J Stat 43(2):224–239

    Article  MathSciNet  MATH  Google Scholar 

  37. Komarudin K, Wong Y (2010) Applying ant system for solving unequal area facility layout problems. Eur J Oper Res 202:730–746

    Article  MATH  Google Scholar 

  38. Lazar C, Gatto L, Ferro M, Bruley C, Burger T (2016) Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J Proteome Res 15:1116–1125

    Article  Google Scholar 

  39. Lee MC, Mitra R (2016) Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalized linear models. Comput Stat Data Anal 95:24–38

    Article  Google Scholar 

  40. Lorenzo-Seva U, Joost R, Ginkel V (2016) Multiple imputation of missing values in exploratory factor analysis of multidimensional scales: estimating latent trait scores. Anal Psicol 32(2):596–608

    Article  Google Scholar 

  41. Liu Z, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95

    Article  Google Scholar 

  42. Lu J, Yang Y, Webb GI (2006) Incremental discretization for naïve-bayes classifier. In: Li X, Zaïane OR, Li Z-H (eds) ADMA 2006. LNCS, vol 4093. Springer, Heidelberg, pp 223–238

  43. Lustgarten JL, Visweswaran S, Gopalakrishnan V et al (2011) Application of an efficient Bayesian discretization method to biomedical data. BMC Bioinform 12:309

    Article  Google Scholar 

  44. Maslove DM, Podchiyska T, Lowe HJ (2013) Discretization of continuous features in clinical datasets. J Am Med Inform Assoc 20:544–553

    Article  Google Scholar 

  45. Mousa AA (2014) Hybrid ant optimization system for multiobjective economic emission load dispatch problem under fuzziness. Swarm Evol Comput 18:11–21

    Article  Google Scholar 

  46. Mirkes EM, Coats TJ, Levesley J, Gorban AN (2016) Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes. Comput Biol Med 75:203–216

    Article  Google Scholar 

  47. Murray JS, Reiter JP (2014) multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Technical report. arXiv:1410.0438

  48. Niknam T, Amiri B (2010) An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis. Appl Soft Comput 10:183–197

    Article  Google Scholar 

  49. Otero FEB, Freitas AA, Johnson CG (2012) Inducing decision trees with an ant colony optimization algorithm. Appl Soft Comput 12:3615–3626

    Article  Google Scholar 

  50. Peng L, Ting-ting Z, Tian-ge L, Kai-hui Z (2015) Missing value imputation method based on density clustering and grey relational analysis. Int J Multimed Ubiq Engg 10(11):133–142

    Article  Google Scholar 

  51. Qu L, Li L, Zhang Y, Hu J (2009) PPCA-based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans Intell Transp Syst 10(3):512–522

    Article  Google Scholar 

  52. Ramirez-Gallego S, Garcia S, Mourino-Talin H, Martinez-Rego D, Bolon-Canedo V, Alonso-Betanzos A, Benitez JM, Herrer F (2016) Data discretization: taxonomy and big data challenge. WIREs Data Min Knowl Disc 6:5–21

    Article  Google Scholar 

  53. Razzaghi T, Roderick O, Safro I, Marko N (2015) fast imbalanced classification of healthcare data with missing values. arXiv:1503.06250v1 [stat.ML]

  54. Rosen GL, Reichenberger ER, Rosenfeld AM (2011) NBC: the Naïve Bayes classification tool web server for taxonomic classification of meta genomic reads. Bioinformatics 27(1):127–129

    Article  Google Scholar 

  55. Saha S, Ghosh A, Seal DB, Dey KN (2016) An improved fuzzy based missing value estimation in DNA microarray validated by gene ranking. Adv Fuzzy Syst. Article ID 6134736

  56. Salama KM, Freitas AA (2014) Classification with cluster-based Bayesian multi-nets using ant colony optimisation. Swarm Evol Comput 18:54–70

    Article  Google Scholar 

  57. Shah JS, Brock GN, Rai SN (2015) Metabolomics data analysis and missing value issues with application to infarcted mouse hearts. BMC Bioinform 16(Suppl 15):P16

    Article  Google Scholar 

  58. Singh N, Javeed A, Chhabra S, Kumar P (2015) Missing value imputation with unsupervised kohonen self organizing map. In: Shetty NR et al (eds) in emerging research in computing, information, communication and applications, pp 61–76

  59. Tang J, Zhang G, Wang Y, Wang H, Liu F (2015) A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation. Transp Res Part C 51:29–40

    Article  Google Scholar 

  60. Tsutsui S, Fujimoto N (2011) Fast QAP solving by ACO with 2-opt local search on a GPU. In: 2011 IEEE congress on evolutionary computation

  61. Voillet V, Besse P, Liaubet L, Cristobal MS, Gonzalez I (2016) Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. BMC Bioinform 17(1):402

    Article  Google Scholar 

  62. Wan Y, Wang M, Yeb Z, Laia X (2016) A feature selection method based on modified binary coded ant colony optimization algorithm. Appl Soft Comput 49:248–258

    Article  Google Scholar 

  63. Wang S, Min F, Wang Z, Cao T (2009) OFFD: Optimal flexible frequency discretization for Naïve Bayes classification. In: ADMA 2009. LNAI, vol 5678, pp 704–712

  64. Xiao J, Xu Q, Wu C, Gao Y, Hua T, Xu C (2016) Performance evaluation of missing-value imputation clustering based on a multivariate Gaussian mixture model. PLoS ONE 11(8):e0161112

    Article  Google Scholar 

  65. Xu E, Liangshan S, Yongchang R, Hao W, Feng Q (2010) A new discretization approach of continuous attributes. In: Proceedings of Asia-Pacific conference on wearable computing systems

  66. Yang J, Shi X, Marchese M, Liang Y (2008) An ant colony optimization method for generalized TSP problem. Prog Nat Sci 18:1417–1422

    Article  MathSciNet  Google Scholar 

  67. Yang Y, Webb GI (2001) Proportional k-interval discretization for naive-Bayes Classifiers. In: Proceedings of the 12th European conference on machine learning, pp 564–575

  68. Yang Y, Xu Z, Song D (2016) Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinform 17(suppl 1):10

    Article  Google Scholar 

  69. Zhang Z (2015) Missing values in big data research: some basic skills. Ann Transl Med 3(21):323

    Google Scholar 

  70. Zhang Z, Gao C, Lu Y, Liu Y, Liang M (2016) Multi-Objective ant colony optimization based on the physarum-inspired mathematical model for Bi-objective traveling salesman problems. PLoS ONE 11(1):e0146709

    Article  Google Scholar 

  71. Zhu W, Wang J, Zhang Y, Jia L (2010) A discretization algorithm based on information distance criterion and ant colony optimization algorithm for knowledge extracting on industrial database. In: Proceedings of international conference on mechatronics and automation

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to DeviPriya Rangasamy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rajappan, S., Rangasamy, D. Estimation of incomplete values in heterogeneous attribute large datasets using discretized Bayesian max–min ant colony optimization. Knowl Inf Syst 56, 309–334 (2018). https://doi.org/10.1007/s10115-017-1123-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1123-4

Keywords

Navigation