Skip to main content

No Imputation Without Representation

  • Conference paper
  • First Online:
Artificial Intelligence and Machine Learning (BNAIC/Benelearn 2023)

Abstract

By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Some authors use the opposite convention, letting the indicator express non-missingness.

  2. 2.

    We are grateful to an anonymous reviewer for this example.

  3. 3.

    This is acknowledged by authors working under the assumption of MAR, e.g. “When data are missing for reasons beyond the investigator’s control, one can never be certain whether MAR holds. The MAR hypothesis in such datasets cannot be formally tested unless the missing values, or at least a sample of them, are available from an external source.” [69].

  4. 4.

    Presumably, they use one-hot encoding for categorical attributes, in which case zero imputation is equivalent to treating missing values as a separate category, but they do not state this explicitly.

  5. 5.

    For categorical values, encoding missing values as a separate category; for numerical values, encoding missing values as an extremely large value that can always be split from the other values.

  6. 6.

    The target column of the echocardiogram dataset (‘alive-at-1’) is supposed to denote whether a patient survived for at least one year, but it doesn’t appear to agree with the columns from which it is derived, that denote how long a patient (has) survived and whether they were alive at the end of that period. The audiology dataset has a large number of small classes with complex labels and should perhaps be analysed with multi-label classification. In addition, it has ordinal attributes where the order of the values is not entirely clear, and three different values that potentially denote missingness (‘?’, ‘unmeasured’ and ‘absent’), and it is not completely clear how they relate to each other. The house-votes-84 dataset contains ‘?’ values, but its documentation explicitly states that these values are not unknown, but indicate different forms of abstention. The ozone dataset is a time-series problem, while the task associated with the sponge and water-treatment datasets is clustering, with no obvious target for classification among their respective attributes. Finally, the breast-cancer (9), cleveland (7), dermatology (8), lung-cancer (5), post-operative (3) and wisconsin (16) datasets contain only very few missing values, and any performance difference between missing value approaches on these datasets may to a large extent be coincidental.

  7. 7.

    For the nomao dataset, iterative imputation diverged, so we had to restrict imputation to the interval \([-100, 100]\).

  8. 8.

    Setting aside 10% of the data for validation, stopping when validation loss has not decreased by at least 0.0001 for ten iterations, with a maximum of 10 000 iterations.

  9. 9.

    LR is an exception here, we have no explanation for this.

References

  1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, pp. 267–281. Akadémiai Kiadó (1971)

    Google Scholar 

  2. Allison, P.D.: Missing Data. Sage Publications, Thousand Oaks (2001)

    Google Scholar 

  3. Allison, P.D.: Missing data. In: Marsden, P.V., Wright, J.D. (eds.) Handbook of Survey Research, 2nd edn., chap. 20, pp. 631–657. Emerald Group Publishing, Bingley (2010)

    Google Scholar 

  4. Anderson, A.B., Basilevsky, A., Hum, D.P.J.: Missing data: a review of the literature. In: Rossi, P.H., Wright, J.D., Anderson, A.B. (eds.) Handbook of Survey Research. Quantitive Studies in Social Relations, chap. 12, pp. 415–494. Academic Press, New York (1983)

    Google Scholar 

  5. Aste, M., Boninsegna, M., Freno, A., Trentin, E.: Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal. Appl. 18(1), 1–29 (2015)

    Article  MathSciNet  Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. The Wadsworth Statistics/probability Series. Wadsworth, Monterey, California (1984)

    Google Scholar 

  8. van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)

    Article  Google Scholar 

  9. Candillier, L., Lemaire, V.: Design and analysis of the Nomao challenge: active learning in the real-world. In: ECML-PKDD 2012: Active Learning in Real-world Applications Workshop (2012)

    Google Scholar 

  10. Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 86: a knowledge-elicitation tool for sophisticated users. In: EWSL 87: Proceedings of the 2nd European Working Session on Learning, pp. 31–45. Sigma Press (1987)

    Google Scholar 

  11. Chow, W.K.: A look at various estimators in logistic models in the presence of missing values. Technical report, N-1324-HEW. Rand Corporation, Santa Monica (1979)

    Google Scholar 

  12. Cohen, J.: Multiple regression as a general data-analytic system. Psychol. Bull. 70(6), 426–443 (1968)

    Article  Google Scholar 

  13. Cohen, J., Cohen, P.: Missing data. In: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, pp. 265–290. Lawrence Erlbaum Associates, Hillsdale (1975)

    Google Scholar 

  14. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    Article  Google Scholar 

  15. Cox, D.R.: Some procedures connected with the logistic qualitative response curve. In: David, F.N. (ed.) Research Papers in Statistics: Festschrift for J. Neyman, pp. 55–71. Wiley, London (1966)

    Google Scholar 

  16. Das, S., Datta, S., Chaudhuri, B.B.: Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn. 81, 674–693 (2018)

    Article  ADS  Google Scholar 

  17. Detrano, R., et al.: International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64(5), 304–310 (1989)

    Article  CAS  PubMed  Google Scholar 

  18. Ding, Y., Simonoff, J.S.: An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11(1), 131–170 (2010)

    MathSciNet  Google Scholar 

  19. Dixon, J.K.: Pattern recognition with partly missing data. IEEE Trans. Syst. Man Cybern. 9(10), 617–621 (1979)

    Article  Google Scholar 

  20. Dua, D., Graff, C.: UCI machine learning repository (2019). http://archive.ics.uci.edu/ml

  21. Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)

    Article  Google Scholar 

  22. Efron, B., Gong, G.: Statistical theory and the computer. In: Eddy, W.F. (ed.) Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pp. 3–7. Springer, New York (1981). https://doi.org/10.1007/978-1-4613-9464-8_1

  23. Eirola, E.: Machine learning methods for incomplete data and variable selection. Ph.D. thesis, Aalto University, Espoo (2014)

    Google Scholar 

  24. Elter, M., Schulz-Wendtland, R., Wittenberg, T.: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med. Phys. 34(11), 4164–4172 (2007)

    Article  CAS  PubMed  Google Scholar 

  25. Enders, C.K.: Applied Missing Data Analysis. Methodology in the Social Sciences. The Guilford Press, New York (2010)

    Google Scholar 

  26. Evans, B., Fisher, D.: Overcoming process delays with decision tree induction. IEEE Expert 9(1), 60–66 (1994)

    Article  Google Scholar 

  27. Costa, C.F., Nascimento, M.A.: IDA 2016 industrial challenge: using machine learning for predicting failures. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 381–386. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46349-0_33

    Chapter  Google Scholar 

  28. Fix, E., Hodges Jr., J.: Discriminatory analysis—nonparametric discrimination: consistency properties. Technical report, 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas (1951)

    Google Scholar 

  29. Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59119-2_166

    Chapter  Google Scholar 

  30. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  Google Scholar 

  31. Fukushima, K.: Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 5(4), 322–333 (1969)

    Article  Google Scholar 

  32. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library. Dealing with Missing Values vol. 72, chap. 4. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10247-4

  33. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

    Article  Google Scholar 

  34. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS 2010: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  35. Golovenkin, S.E., et al.: Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data. GigaScience 9(11), giaa128 (2020)

    Google Scholar 

  36. Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)

    Article  PubMed  Google Scholar 

  37. Grzymala-Busse, J.W.: Knowledge acquisition under uncertainty-a rough set approach. J. Intell. Rob. Syst. 1(1), 3–16 (1988)

    Article  MathSciNet  Google Scholar 

  38. Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS, vol. 2005, pp. 378–385. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45554-X_46

    Chapter  Google Scholar 

  39. Güvenir, H.A., Acar, B., Demiröz, G., Çekin, A.: A supervised machine learning algorithm for arrhythmia analysis. In: Proceedings of the 24th Annual Meeting of Computers in Cardiology. Computers in Cardiology, vol. 24, pp. 433–436. IEEE (1997)

    Google Scholar 

  40. Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)

    Article  Google Scholar 

  41. van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)

    Article  PubMed  Google Scholar 

  42. Hutcheson, J.D., Jr., Prather, J.E.: Interpreting the effects of missing data in survey research. Southeastern Polit. Rev. 9(2), 129–143 (1981)

    Article  Google Scholar 

  43. Ipsen, N., Mattei, P.A., Frellsen, J.: How to deal with missing data in supervised deep learning? In: Artemiss 2020: First ICML Workshop on the Art of Learning with Missing Values (2020)

    Google Scholar 

  44. Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91(433), 222–230 (1996)

    Article  MathSciNet  Google Scholar 

  45. Josse, J., Chen, J.M., Prost, N., Varoquaux, G., Scornet, E.: On the consistency of supervised learning with missing values. Stat. Papers 65(9) (2024). https://doi.org/10.1007/s00362-024-01550-4

  46. Kim, J.O., Curry, J.: The treatment of missing data in multivariate analysis. Sociol. Methods Res. 6(2), 215–240 (1977)

    Article  Google Scholar 

  47. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR 2015: 3rd International Conference on Learning Representations (2015)

    Google Scholar 

  48. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. AAAI Press (1996)

    Google Scholar 

  49. Le Morvan, M., Josse, J., Scornet, E., Varoquaux, G.: What’s a good imputation to predict with missing values? In: NeurIPS 2021: Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 34, pp. 11530–11540. NIPS Foundation (2021)

    Google Scholar 

  50. Luengo, J., García, S., Herrera, F.: On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32(1), 77–108 (2012)

    Article  Google Scholar 

  51. Luengo, J., Sáez, J.A., Herrera, F.: Missing data imputation for fuzzy rule-based classification systems. Soft. Comput. 16(5), 863–881 (2012)

    Article  Google Scholar 

  52. Marlin, B.M.: Missing data problems in machine learning. Ph.D. thesis, University of Toronto (2008)

    Google Scholar 

  53. McCann, M., Li, Y., Maguire, L., Johnston, A.: Causality challenge: benchmarking relevant signal components for effective monitoring and process control. In: NIPS 2008: Proceedings of Workshop on Causality. Proceedings of Machine Learning Research, vol. 6, pp. 277–288. JMLR Workshop and Conference Proceedings (2008)

    Google Scholar 

  54. McLeish, M., Cecile, M.: Enhancing medical expert systems with knowledge obtained from statistical data. Ann. Math. Artif. Intell. 2(1–4), 261–276 (1990)

    Article  Google Scholar 

  55. Michalski, R.S., Chilausky, R.L.: Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Int. J. Policy Anal. Inf. Syst. 4(2), 125–161 (1980)

    Google Scholar 

  56. Ng, C.G., Yusoff, M.S.B.: Missing values in data analysis: ignore or impute? Educ. Med. J. 3(1) (2011)

    Google Scholar 

  57. Orme, J.G., Reis, J.: Multiple regression with missing data. J. Soc. Serv. Res. 15(1–2), 61–91 (1991)

    Article  Google Scholar 

  58. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)

    MathSciNet  Google Scholar 

  59. Pereira Barata, A., Takes, F.W., van den Herik, H.J., Veenman, C.J.: Imputation methods outperform missing-indicator for data missing completely at random. In: ICDM 2019: Proceedings of the Workshops, pp. 407–414. IEEE (2019)

    Google Scholar 

  60. Perez-Lebel, A., Varoquaux, G., Le Morvan, M., Josse, J., Poline, J.B.: Benchmarking missing-values approaches for predictive models on health databases. GigaScience 11(1), giac013 (2022)

    Google Scholar 

  61. Pigott, T.D.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)

    Article  Google Scholar 

  62. Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)

    Article  Google Scholar 

  63. Quinlan, J.R.: Unknown attribute values in induction. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 164–168. Morgan Kaufmann (1989)

    Google Scholar 

  64. Quinlan, J.R., Compton, P.J., Horn, K.A., Lazarus, L.: Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on Applications of Expert Systems, pp. 157–173. Turing Institute Press (1986)

    Google Scholar 

  65. Rosenblatt, F.: Principles of neurodynamics—perceptrons and the theory of brain mechanisms. Technical report VG-1196-G-8, Cornell Aeronautical Laboratory, Buffalo, New York (1961)

    Google Scholar 

  66. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  Google Scholar 

  67. Rubini, L.J., Eswaran, P.: Generating comparative analysis of early stage prediction of chronic kidney disease. Int. J. Mod. Eng. Res. 5(7), 49–55 (2015)

    Google Scholar 

  68. Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 58, 49–59 (2015)

    Article  PubMed  Google Scholar 

  69. Schafer, J.L.: Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall, London (1997)

    Google Scholar 

  70. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147–177 (2002)

    Article  PubMed  Google Scholar 

  71. Schlimmer, J.C.: Concept acquisition through representational adjustment. Ph.D. thesis, University of California, Irvine (1987)

    Google Scholar 

  72. Śmieja, M., Struski, Ł, Tabor, J., Marzec, M.: Generalized RBF kernel for incomplete data. Knowl.-Based Syst. 173, 150–162 (2019)

    Article  Google Scholar 

  73. Śmieja, M., Struski, Ł., Tabor, J., Zieliński, B., Spurek, P.: Processing of missing data by neural networks. In: NeurIPS 2018: Proceedings of the Thirty-second Annual Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, vol. 31, pp. 689–696. NIPS Foundation (2018)

    Google Scholar 

  74. Soltani Zarrin, P., Röckendorf, N., Wenger, C.: In-vitro classification of saliva samples of COPD patients and healthy controls using machine learning tools. IEEE Access 8, 168053–168060 (2020)

    Article  Google Scholar 

  75. Sperrin, M., Martin, G.P., Sisk, R., Peek, N.: Missing data should be handled differently for prediction than for description or causal explanation. J. Clin. Epidemiol. 125, 183–187 (2020)

    Article  PubMed  Google Scholar 

  76. Stumpf, S.A.: A note on handling missing data. J. Manag. 4(1), 65–73 (1978)

    Google Scholar 

  77. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)

    MathSciNet  Google Scholar 

  78. Tresp, V., Neuneier, R., Ahmad, S.: Efficient methods for dealing with missing data in supervised learning. In: NIPS-94: Proceedings of the Eighth Annual Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 7, pp. 689–696. MIT Press (1994)

    Google Scholar 

  79. Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  CAS  PubMed  Google Scholar 

  80. Twala, B.E., Jones, M., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29(7), 950–956 (2008)

    Article  ADS  Google Scholar 

  81. Vamplew, P., Adams, A.: Missing values in a backpropagation neural net. In: ACNN 1992: Proceedings of the Third Australian Conference on Neural Networks, pp. 64–66. Sydney University Electrical Engineering (1992)

    Google Scholar 

  82. Wilcoxon, F.: Individual comparisons by ranking methods. Biomet. Bull. 1(6), 80–83 (1945)

    Article  Google Scholar 

  83. Zhu, J., Zou, H., Rosset, S., Hastie, T.: Multi-class AdaBoost. Stat. Interface 2(3), 349–360 (2009)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgement

The research reported in this paper was conducted with the financial support of the Odysseus programme of the Research Foundation – Flanders (FWO). This publication is part of the project Digital Twin with project number P18-03 of the research programme TTW Perspective, which is (partly) financed by the Dutch Research Council (NWO). We would like to express our thanks to Geert van der Heijden for answering a question about [41].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Urs Lenz .

Editor information

Editors and Affiliations

Ethics declarations

Data and Code

Datasets and the code to reproduce our experiments are available at https://cwi.ugent.be/~oulenz/code/lenz-2024-no.tar.gz.

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lenz, O.U., Peralta, D., Cornelis, C. (2025). No Imputation Without Representation. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74650-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74649-9

  • Online ISBN: 978-3-031-74650-5

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics