Abstract
The size of datasets is becoming larger nowadays and missing values in such datasets pose serious threat to data analysts. Although various techniques have been developed by researchers to handle missing values in different kinds of datasets, there is not much effort to deal with the missing values in mixed attributes in large datasets. This paper has proposed novel strategies for dealing with this issue. The significant attributes (covariates) required for imputation are first selected using gain ratio measure to decrease the computational complexity. Since analysis of continuous attributes in imputation process is complex, they are first discretized using a novel methodology called Bayesian classifier-based discretization. Then, missing values in them are imputed using Bayesian max–min ant colony optimization algorithm which hybridizes ACO with Bayesian principles. The local search technique is also introduced in ACO implementation to improve its exploitative capability. The proposed methodology is implemented in real datasets with different missing rates ranging from 5 to 50% and from the experimental results, it is observed that the proposed discretization and imputation algorithms produce better results than the existing methods.
Similar content being viewed by others
References
Abdulkader MMS, Gajpal Y, ElMekkawy TY (2015) Hybridized ant colony algorithm for the multi compartment vehicle routing problem. Appl Soft Comput 37:196–203
Ali R, Siddiqi MH, Lee S (2015) Rough set-based approaches for discretization: a compact Review. Artif Intell Rev 44(2):235–263
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci Int J 233:25–35
Bai J, Yang G-K, Chen Y-W, Hu L-H, Pan C-C (2013) A model induced max–min ant colony optimization for asymmetric travelling salesman problem. Appl Soft Comput 13:1365–1375
Baragona R, Battaglia F, Poli (2011) Evolutionary statistical procedures. Springer, Berlin
Berrichi A, Yalaoui F, Amodeo L, Mezghiche M (2010) Computers Bi-objective ant colony optimization approach to optimize production and maintenance scheduling. Oper Res 37:1584–1596
Boyles S (2011) A comparison of interpolation methods for missing traffic volume data. In: Proceedings of the 90th annual meeting of the transportation research board, pp 23–27
Blum C (2005) Ant colony optimization: introduction and recent trends. Phys Life Rev 2:353–373
Bobbie-Jo M, Webb-Robertson Wiberg HK, Matzke MM et al (2015) Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J Proteome Res 14(5):1993–2001
Borrotti G, Minervini D, Lucrezia D, Poli I (2016) Naïve Bayes ant colony optimization for designing high dimensional experiments. Appl Soft Comput 49:259–268
Boulle M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65:131–165
Chen J, Huang H, Tian F, Tian S (2008) A selective Bayes classifier for classifying incomplete data based on gain ratio. Knowl Based Syst 21(7):530–534
Cheng X, Cook D, Hofmann H (2015) Visually exploring missing values in multivariable data using a graphical user interface. J Stat Soft 68(6):1–23
D’Andreagiovanni F, Krolikowski J, Pulaj J (2015) A fast hybrid primal heuristic for multiband robust capacitated network design with multiple time periods. Appl Soft Comput 26:497–507
D’Andreagiovanni F, Nardin A (2015) Towards the fast and robust optimal design of wireless body area networks. Appl Soft Comput 37:971–982
Deng Y, Chang C, Ido MS, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci Rep 6(21689):1–10
DeviPriya R, Kuppuswami S (2014) Drawing inferences from clinical studies with missing values using genetic algorithm. Int J Bioinform Res Appl 10(6):613–627
Dorigo M, Maniezzo V, Colorni A (1996) Ant system: optimization by a colony of cooperating agents. IEEE Trans Syst Man Cybern Part B 26(1):1–13
Dorigo M, Stützle T (2010) Ant colony optimization: overview and recent advances. In: Gendreau M, Potvin JY (eds) Handbook of metaheuristics. International series in operations research & management science, vol 146. Springer, Boston
Duan P, Yong AI (2016) Research on an improved ant colony optimization algorithm and its application. Int J Hybrid Inf Technol 9(4):223–234
Euchi J, Mraihi R (2012) The urban bus routing problem in the Tunisian case by the hybrid artificial ant colony algorithm. Swarm Evol Comput 2:15–24
Friedman N, Goldszmidt M (1996) Discretizing continuous attributes while learning Bayesian networks. In: Proceedings of 13th international conference on machine learning 1996
Gambardella L, Montemanni R, Weyland D (2012) Coupling ant colony systems with strong local searches. Eur J Oper Res 220(3):831–843
Garcia J, Lopez-Bueno I, Fernandez F, Borrajo D (2010) A comparative study of discretization approaches for state space generalization in the keep away soccer task. Reinforcement learning: algorithms, implementations and applications. Nova Science Publishers, Hauppauge
Garcia-Laencina P-J, Abreu PH, Abreu MH, Afonoso N (2015) Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 59:125–133
Grzymala-Busse JW, Mroczek T (2016) A comparison of four approaches to discretization based on entropy. Entropy 18(69):1–11
Han T, Lee S, Oh S (2015) Improving discretization by post- processing procedure. Int J Eng Technol 7(2):414–421
Herrera F, Luengo J, Saez JA, Lopez V, Garcia S (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. Proc IEEE Trans Knowl Data Eng 25:734–750
Huang C-L, Huang W-C, Chang H-Y, Yeh Y-C, Tsai C-Y (2013) Hybridization strategies for continuous ant colony optimization and particle swarm optimization applied to data clustering. Appl Soft Comput 13:3864–3872
Huang W, Pan Y, Wu J (2013) Supervised discretization with GK - \(\tau \). Proc Int Confer Inf Technol Quant Manag Proc Comput Sci 17:114–120
Huang W, Pan Y, Wu J (2014) Supervised discretization for optimal prediction. Supervised Discretization for optimal prediction. In: Proceedings of 1st international conference on data science, vol 30, pp 75 – 80
Ismkhan H (2017) Effective heuristics for ant colony optimization to handle large-scale problems. Swarm Evol Comput 32:140–149
Janicki R, Malec D (2013) A Bayesian model averaging approach to analyzing categorical data with nonignorable nonresponse. Comput Stat Data Anal 57(1):600–614
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Soft 70(1):1–23
Kabir MM, Shahjahan Md, Murase K (2012) A new hybrid ant colony optimization algorithm for feature selection. Exp Syst Appl 39:3747–3763
Kapelner A, Bleich J (2015) Prediction with missing data via Bayesian additive regression trees. Can J Stat 43(2):224–239
Komarudin K, Wong Y (2010) Applying ant system for solving unequal area facility layout problems. Eur J Oper Res 202:730–746
Lazar C, Gatto L, Ferro M, Bruley C, Burger T (2016) Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J Proteome Res 15:1116–1125
Lee MC, Mitra R (2016) Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalized linear models. Comput Stat Data Anal 95:24–38
Lorenzo-Seva U, Joost R, Ginkel V (2016) Multiple imputation of missing values in exploratory factor analysis of multidimensional scales: estimating latent trait scores. Anal Psicol 32(2):596–608
Liu Z, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95
Lu J, Yang Y, Webb GI (2006) Incremental discretization for naïve-bayes classifier. In: Li X, Zaïane OR, Li Z-H (eds) ADMA 2006. LNCS, vol 4093. Springer, Heidelberg, pp 223–238
Lustgarten JL, Visweswaran S, Gopalakrishnan V et al (2011) Application of an efficient Bayesian discretization method to biomedical data. BMC Bioinform 12:309
Maslove DM, Podchiyska T, Lowe HJ (2013) Discretization of continuous features in clinical datasets. J Am Med Inform Assoc 20:544–553
Mousa AA (2014) Hybrid ant optimization system for multiobjective economic emission load dispatch problem under fuzziness. Swarm Evol Comput 18:11–21
Mirkes EM, Coats TJ, Levesley J, Gorban AN (2016) Handling missing data in large healthcare dataset: a case study of unknown trauma outcomes. Comput Biol Med 75:203–216
Murray JS, Reiter JP (2014) multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Technical report. arXiv:1410.0438
Niknam T, Amiri B (2010) An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis. Appl Soft Comput 10:183–197
Otero FEB, Freitas AA, Johnson CG (2012) Inducing decision trees with an ant colony optimization algorithm. Appl Soft Comput 12:3615–3626
Peng L, Ting-ting Z, Tian-ge L, Kai-hui Z (2015) Missing value imputation method based on density clustering and grey relational analysis. Int J Multimed Ubiq Engg 10(11):133–142
Qu L, Li L, Zhang Y, Hu J (2009) PPCA-based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans Intell Transp Syst 10(3):512–522
Ramirez-Gallego S, Garcia S, Mourino-Talin H, Martinez-Rego D, Bolon-Canedo V, Alonso-Betanzos A, Benitez JM, Herrer F (2016) Data discretization: taxonomy and big data challenge. WIREs Data Min Knowl Disc 6:5–21
Razzaghi T, Roderick O, Safro I, Marko N (2015) fast imbalanced classification of healthcare data with missing values. arXiv:1503.06250v1 [stat.ML]
Rosen GL, Reichenberger ER, Rosenfeld AM (2011) NBC: the Naïve Bayes classification tool web server for taxonomic classification of meta genomic reads. Bioinformatics 27(1):127–129
Saha S, Ghosh A, Seal DB, Dey KN (2016) An improved fuzzy based missing value estimation in DNA microarray validated by gene ranking. Adv Fuzzy Syst. Article ID 6134736
Salama KM, Freitas AA (2014) Classification with cluster-based Bayesian multi-nets using ant colony optimisation. Swarm Evol Comput 18:54–70
Shah JS, Brock GN, Rai SN (2015) Metabolomics data analysis and missing value issues with application to infarcted mouse hearts. BMC Bioinform 16(Suppl 15):P16
Singh N, Javeed A, Chhabra S, Kumar P (2015) Missing value imputation with unsupervised kohonen self organizing map. In: Shetty NR et al (eds) in emerging research in computing, information, communication and applications, pp 61–76
Tang J, Zhang G, Wang Y, Wang H, Liu F (2015) A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation. Transp Res Part C 51:29–40
Tsutsui S, Fujimoto N (2011) Fast QAP solving by ACO with 2-opt local search on a GPU. In: 2011 IEEE congress on evolutionary computation
Voillet V, Besse P, Liaubet L, Cristobal MS, Gonzalez I (2016) Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. BMC Bioinform 17(1):402
Wan Y, Wang M, Yeb Z, Laia X (2016) A feature selection method based on modified binary coded ant colony optimization algorithm. Appl Soft Comput 49:248–258
Wang S, Min F, Wang Z, Cao T (2009) OFFD: Optimal flexible frequency discretization for Naïve Bayes classification. In: ADMA 2009. LNAI, vol 5678, pp 704–712
Xiao J, Xu Q, Wu C, Gao Y, Hua T, Xu C (2016) Performance evaluation of missing-value imputation clustering based on a multivariate Gaussian mixture model. PLoS ONE 11(8):e0161112
Xu E, Liangshan S, Yongchang R, Hao W, Feng Q (2010) A new discretization approach of continuous attributes. In: Proceedings of Asia-Pacific conference on wearable computing systems
Yang J, Shi X, Marchese M, Liang Y (2008) An ant colony optimization method for generalized TSP problem. Prog Nat Sci 18:1417–1422
Yang Y, Webb GI (2001) Proportional k-interval discretization for naive-Bayes Classifiers. In: Proceedings of the 12th European conference on machine learning, pp 564–575
Yang Y, Xu Z, Song D (2016) Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinform 17(suppl 1):10
Zhang Z (2015) Missing values in big data research: some basic skills. Ann Transl Med 3(21):323
Zhang Z, Gao C, Lu Y, Liu Y, Liang M (2016) Multi-Objective ant colony optimization based on the physarum-inspired mathematical model for Bi-objective traveling salesman problems. PLoS ONE 11(1):e0146709
Zhu W, Wang J, Zhang Y, Jia L (2010) A discretization algorithm based on information distance criterion and ant colony optimization algorithm for knowledge extracting on industrial database. In: Proceedings of international conference on mechatronics and automation
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rajappan, S., Rangasamy, D. Estimation of incomplete values in heterogeneous attribute large datasets using discretized Bayesian max–min ant colony optimization. Knowl Inf Syst 56, 309–334 (2018). https://doi.org/10.1007/s10115-017-1123-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1123-4