Abstract
By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Some authors use the opposite convention, letting the indicator express non-missingness.
- 2.
We are grateful to an anonymous reviewer for this example.
- 3.
This is acknowledged by authors working under the assumption of MAR, e.g. “When data are missing for reasons beyond the investigator’s control, one can never be certain whether MAR holds. The MAR hypothesis in such datasets cannot be formally tested unless the missing values, or at least a sample of them, are available from an external source.” [69].
- 4.
Presumably, they use one-hot encoding for categorical attributes, in which case zero imputation is equivalent to treating missing values as a separate category, but they do not state this explicitly.
- 5.
For categorical values, encoding missing values as a separate category; for numerical values, encoding missing values as an extremely large value that can always be split from the other values.
- 6.
The target column of the echocardiogram dataset (‘alive-at-1’) is supposed to denote whether a patient survived for at least one year, but it doesn’t appear to agree with the columns from which it is derived, that denote how long a patient (has) survived and whether they were alive at the end of that period. The audiology dataset has a large number of small classes with complex labels and should perhaps be analysed with multi-label classification. In addition, it has ordinal attributes where the order of the values is not entirely clear, and three different values that potentially denote missingness (‘?’, ‘unmeasured’ and ‘absent’), and it is not completely clear how they relate to each other. The house-votes-84 dataset contains ‘?’ values, but its documentation explicitly states that these values are not unknown, but indicate different forms of abstention. The ozone dataset is a time-series problem, while the task associated with the sponge and water-treatment datasets is clustering, with no obvious target for classification among their respective attributes. Finally, the breast-cancer (9), cleveland (7), dermatology (8), lung-cancer (5), post-operative (3) and wisconsin (16) datasets contain only very few missing values, and any performance difference between missing value approaches on these datasets may to a large extent be coincidental.
- 7.
For the nomao dataset, iterative imputation diverged, so we had to restrict imputation to the interval \([-100, 100]\).
- 8.
Setting aside 10% of the data for validation, stopping when validation loss has not decreased by at least 0.0001 for ten iterations, with a maximum of 10 000 iterations.
- 9.
LR is an exception here, we have no explanation for this.
References
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, pp. 267–281. Akadémiai Kiadó (1971)
Allison, P.D.: Missing Data. Sage Publications, Thousand Oaks (2001)
Allison, P.D.: Missing data. In: Marsden, P.V., Wright, J.D. (eds.) Handbook of Survey Research, 2nd edn., chap. 20, pp. 631–657. Emerald Group Publishing, Bingley (2010)
Anderson, A.B., Basilevsky, A., Hum, D.P.J.: Missing data: a review of the literature. In: Rossi, P.H., Wright, J.D., Anderson, A.B. (eds.) Handbook of Survey Research. Quantitive Studies in Social Relations, chap. 12, pp. 415–494. Academic Press, New York (1983)
Aste, M., Boninsegna, M., Freno, A., Trentin, E.: Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal. Appl. 18(1), 1–29 (2015)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. The Wadsworth Statistics/probability Series. Wadsworth, Monterey, California (1984)
van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Candillier, L., Lemaire, V.: Design and analysis of the Nomao challenge: active learning in the real-world. In: ECML-PKDD 2012: Active Learning in Real-world Applications Workshop (2012)
Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 86: a knowledge-elicitation tool for sophisticated users. In: EWSL 87: Proceedings of the 2nd European Working Session on Learning, pp. 31–45. Sigma Press (1987)
Chow, W.K.: A look at various estimators in logistic models in the presence of missing values. Technical report, N-1324-HEW. Rand Corporation, Santa Monica (1979)
Cohen, J.: Multiple regression as a general data-analytic system. Psychol. Bull. 70(6), 426–443 (1968)
Cohen, J., Cohen, P.: Missing data. In: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, pp. 265–290. Lawrence Erlbaum Associates, Hillsdale (1975)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Cox, D.R.: Some procedures connected with the logistic qualitative response curve. In: David, F.N. (ed.) Research Papers in Statistics: Festschrift for J. Neyman, pp. 55–71. Wiley, London (1966)
Das, S., Datta, S., Chaudhuri, B.B.: Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn. 81, 674–693 (2018)
Detrano, R., et al.: International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64(5), 304–310 (1989)
Ding, Y., Simonoff, J.S.: An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11(1), 131–170 (2010)
Dixon, J.K.: Pattern recognition with partly missing data. IEEE Trans. Syst. Man Cybern. 9(10), 617–621 (1979)
Dua, D., Graff, C.: UCI machine learning repository (2019). http://archive.ics.uci.edu/ml
Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)
Efron, B., Gong, G.: Statistical theory and the computer. In: Eddy, W.F. (ed.) Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pp. 3–7. Springer, New York (1981). https://doi.org/10.1007/978-1-4613-9464-8_1
Eirola, E.: Machine learning methods for incomplete data and variable selection. Ph.D. thesis, Aalto University, Espoo (2014)
Elter, M., Schulz-Wendtland, R., Wittenberg, T.: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med. Phys. 34(11), 4164–4172 (2007)
Enders, C.K.: Applied Missing Data Analysis. Methodology in the Social Sciences. The Guilford Press, New York (2010)
Evans, B., Fisher, D.: Overcoming process delays with decision tree induction. IEEE Expert 9(1), 60–66 (1994)
Costa, C.F., Nascimento, M.A.: IDA 2016 industrial challenge: using machine learning for predicting failures. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 381–386. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46349-0_33
Fix, E., Hodges Jr., J.: Discriminatory analysis—nonparametric discrimination: consistency properties. Technical report, 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas (1951)
Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59119-2_166
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Fukushima, K.: Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 5(4), 322–333 (1969)
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library. Dealing with Missing Values vol. 72, chap. 4. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10247-4
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS 2010: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Golovenkin, S.E., et al.: Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data. GigaScience 9(11), giaa128 (2020)
Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)
Grzymala-Busse, J.W.: Knowledge acquisition under uncertainty-a rough set approach. J. Intell. Rob. Syst. 1(1), 3–16 (1988)
Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS, vol. 2005, pp. 378–385. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45554-X_46
Güvenir, H.A., Acar, B., Demiröz, G., Çekin, A.: A supervised machine learning algorithm for arrhythmia analysis. In: Proceedings of the 24th Annual Meeting of Computers in Cardiology. Computers in Cardiology, vol. 24, pp. 433–436. IEEE (1997)
Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)
van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
Hutcheson, J.D., Jr., Prather, J.E.: Interpreting the effects of missing data in survey research. Southeastern Polit. Rev. 9(2), 129–143 (1981)
Ipsen, N., Mattei, P.A., Frellsen, J.: How to deal with missing data in supervised deep learning? In: Artemiss 2020: First ICML Workshop on the Art of Learning with Missing Values (2020)
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91(433), 222–230 (1996)
Josse, J., Chen, J.M., Prost, N., Varoquaux, G., Scornet, E.: On the consistency of supervised learning with missing values. Stat. Papers 65(9) (2024). https://doi.org/10.1007/s00362-024-01550-4
Kim, J.O., Curry, J.: The treatment of missing data in multivariate analysis. Sociol. Methods Res. 6(2), 215–240 (1977)
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR 2015: 3rd International Conference on Learning Representations (2015)
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. AAAI Press (1996)
Le Morvan, M., Josse, J., Scornet, E., Varoquaux, G.: What’s a good imputation to predict with missing values? In: NeurIPS 2021: Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 34, pp. 11530–11540. NIPS Foundation (2021)
Luengo, J., García, S., Herrera, F.: On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32(1), 77–108 (2012)
Luengo, J., Sáez, J.A., Herrera, F.: Missing data imputation for fuzzy rule-based classification systems. Soft. Comput. 16(5), 863–881 (2012)
Marlin, B.M.: Missing data problems in machine learning. Ph.D. thesis, University of Toronto (2008)
McCann, M., Li, Y., Maguire, L., Johnston, A.: Causality challenge: benchmarking relevant signal components for effective monitoring and process control. In: NIPS 2008: Proceedings of Workshop on Causality. Proceedings of Machine Learning Research, vol. 6, pp. 277–288. JMLR Workshop and Conference Proceedings (2008)
McLeish, M., Cecile, M.: Enhancing medical expert systems with knowledge obtained from statistical data. Ann. Math. Artif. Intell. 2(1–4), 261–276 (1990)
Michalski, R.S., Chilausky, R.L.: Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Int. J. Policy Anal. Inf. Syst. 4(2), 125–161 (1980)
Ng, C.G., Yusoff, M.S.B.: Missing values in data analysis: ignore or impute? Educ. Med. J. 3(1) (2011)
Orme, J.G., Reis, J.: Multiple regression with missing data. J. Soc. Serv. Res. 15(1–2), 61–91 (1991)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
Pereira Barata, A., Takes, F.W., van den Herik, H.J., Veenman, C.J.: Imputation methods outperform missing-indicator for data missing completely at random. In: ICDM 2019: Proceedings of the Workshops, pp. 407–414. IEEE (2019)
Perez-Lebel, A., Varoquaux, G., Le Morvan, M., Josse, J., Poline, J.B.: Benchmarking missing-values approaches for predictive models on health databases. GigaScience 11(1), giac013 (2022)
Pigott, T.D.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)
Quinlan, J.R.: Unknown attribute values in induction. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 164–168. Morgan Kaufmann (1989)
Quinlan, J.R., Compton, P.J., Horn, K.A., Lazarus, L.: Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on Applications of Expert Systems, pp. 157–173. Turing Institute Press (1986)
Rosenblatt, F.: Principles of neurodynamics—perceptrons and the theory of brain mechanisms. Technical report VG-1196-G-8, Cornell Aeronautical Laboratory, Buffalo, New York (1961)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Rubini, L.J., Eswaran, P.: Generating comparative analysis of early stage prediction of chronic kidney disease. Int. J. Mod. Eng. Res. 5(7), 49–55 (2015)
Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 58, 49–59 (2015)
Schafer, J.L.: Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall, London (1997)
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147–177 (2002)
Schlimmer, J.C.: Concept acquisition through representational adjustment. Ph.D. thesis, University of California, Irvine (1987)
Śmieja, M., Struski, Ł, Tabor, J., Marzec, M.: Generalized RBF kernel for incomplete data. Knowl.-Based Syst. 173, 150–162 (2019)
Śmieja, M., Struski, Ł., Tabor, J., Zieliński, B., Spurek, P.: Processing of missing data by neural networks. In: NeurIPS 2018: Proceedings of the Thirty-second Annual Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, vol. 31, pp. 689–696. NIPS Foundation (2018)
Soltani Zarrin, P., Röckendorf, N., Wenger, C.: In-vitro classification of saliva samples of COPD patients and healthy controls using machine learning tools. IEEE Access 8, 168053–168060 (2020)
Sperrin, M., Martin, G.P., Sisk, R., Peek, N.: Missing data should be handled differently for prediction than for description or causal explanation. J. Clin. Epidemiol. 125, 183–187 (2020)
Stumpf, S.A.: A note on handling missing data. J. Manag. 4(1), 65–73 (1978)
Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)
Tresp, V., Neuneier, R., Ahmad, S.: Efficient methods for dealing with missing data in supervised learning. In: NIPS-94: Proceedings of the Eighth Annual Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 7, pp. 689–696. MIT Press (1994)
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Twala, B.E., Jones, M., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29(7), 950–956 (2008)
Vamplew, P., Adams, A.: Missing values in a backpropagation neural net. In: ACNN 1992: Proceedings of the Third Australian Conference on Neural Networks, pp. 64–66. Sydney University Electrical Engineering (1992)
Wilcoxon, F.: Individual comparisons by ranking methods. Biomet. Bull. 1(6), 80–83 (1945)
Zhu, J., Zou, H., Rosset, S., Hastie, T.: Multi-class AdaBoost. Stat. Interface 2(3), 349–360 (2009)
Acknowledgement
The research reported in this paper was conducted with the financial support of the Odysseus programme of the Research Foundation – Flanders (FWO). This publication is part of the project Digital Twin with project number P18-03 of the research programme TTW Perspective, which is (partly) financed by the Dutch Research Council (NWO). We would like to express our thanks to Geert van der Heijden for answering a question about [41].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Data and Code
Datasets and the code to reproduce our experiments are available at https://cwi.ugent.be/~oulenz/code/lenz-2024-no.tar.gz.
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lenz, O.U., Peralta, D., Cornelis, C. (2025). No Imputation Without Representation. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-74650-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74649-9
Online ISBN: 978-3-031-74650-5
eBook Packages: Artificial Intelligence (R0)