No Imputation Without Representation

Lenz, Oliver Urs; Peralta, Daniel; Cornelis, Chris

doi:10.1007/978-3-031-74650-5_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2187))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

22 Accesses

Abstract

By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Some authors use the opposite convention, letting the indicator express non-missingness.
2.
We are grateful to an anonymous reviewer for this example.
3.
This is acknowledged by authors working under the assumption of MAR, e.g. “When data are missing for reasons beyond the investigator’s control, one can never be certain whether MAR holds. The MAR hypothesis in such datasets cannot be formally tested unless the missing values, or at least a sample of them, are available from an external source.” [69].
4.
Presumably, they use one-hot encoding for categorical attributes, in which case zero imputation is equivalent to treating missing values as a separate category, but they do not state this explicitly.
5.
For categorical values, encoding missing values as a separate category; for numerical values, encoding missing values as an extremely large value that can always be split from the other values.
6.
The target column of the echocardiogram dataset (‘alive-at-1’) is supposed to denote whether a patient survived for at least one year, but it doesn’t appear to agree with the columns from which it is derived, that denote how long a patient (has) survived and whether they were alive at the end of that period. The audiology dataset has a large number of small classes with complex labels and should perhaps be analysed with multi-label classification. In addition, it has ordinal attributes where the order of the values is not entirely clear, and three different values that potentially denote missingness (‘?’, ‘unmeasured’ and ‘absent’), and it is not completely clear how they relate to each other. The house-votes-84 dataset contains ‘?’ values, but its documentation explicitly states that these values are not unknown, but indicate different forms of abstention. The ozone dataset is a time-series problem, while the task associated with the sponge and water-treatment datasets is clustering, with no obvious target for classification among their respective attributes. Finally, the breast-cancer (9), cleveland (7), dermatology (8), lung-cancer (5), post-operative (3) and wisconsin (16) datasets contain only very few missing values, and any performance difference between missing value approaches on these datasets may to a large extent be coincidental.
7.
For the nomao dataset, iterative imputation diverged, so we had to restrict imputation to the interval $[-100, 100]$.
8.
Setting aside 10% of the data for validation, stopping when validation loss has not decreased by at least 0.0001 for ten iterations, with a maximum of 10 000 iterations.
9.
LR is an exception here, we have no explanation for this.

References

Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, pp. 267–281. Akadémiai Kiadó (1971)
Google Scholar
Allison, P.D.: Missing Data. Sage Publications, Thousand Oaks (2001)
Google Scholar
Allison, P.D.: Missing data. In: Marsden, P.V., Wright, J.D. (eds.) Handbook of Survey Research, 2nd edn., chap. 20, pp. 631–657. Emerald Group Publishing, Bingley (2010)
Google Scholar
Anderson, A.B., Basilevsky, A., Hum, D.P.J.: Missing data: a review of the literature. In: Rossi, P.H., Wright, J.D., Anderson, A.B. (eds.) Handbook of Survey Research. Quantitive Studies in Social Relations, chap. 12, pp. 415–494. Academic Press, New York (1983)
Google Scholar
Aste, M., Boninsegna, M., Freno, A., Trentin, E.: Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal. Appl. 18(1), 1–29 (2015)
Article MathSciNet Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. The Wadsworth Statistics/probability Series. Wadsworth, Monterey, California (1984)
Google Scholar
van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Article Google Scholar
Candillier, L., Lemaire, V.: Design and analysis of the Nomao challenge: active learning in the real-world. In: ECML-PKDD 2012: Active Learning in Real-world Applications Workshop (2012)
Google Scholar
Cestnik, B., Kononenko, I., Bratko, I.: ASSISTANT 86: a knowledge-elicitation tool for sophisticated users. In: EWSL 87: Proceedings of the 2nd European Working Session on Learning, pp. 31–45. Sigma Press (1987)
Google Scholar
Chow, W.K.: A look at various estimators in logistic models in the presence of missing values. Technical report, N-1324-HEW. Rand Corporation, Santa Monica (1979)
Google Scholar
Cohen, J.: Multiple regression as a general data-analytic system. Psychol. Bull. 70(6), 426–443 (1968)
Article Google Scholar
Cohen, J., Cohen, P.: Missing data. In: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, pp. 265–290. Lawrence Erlbaum Associates, Hillsdale (1975)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Article Google Scholar
Cox, D.R.: Some procedures connected with the logistic qualitative response curve. In: David, F.N. (ed.) Research Papers in Statistics: Festschrift for J. Neyman, pp. 55–71. Wiley, London (1966)
Google Scholar
Das, S., Datta, S., Chaudhuri, B.B.: Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn. 81, 674–693 (2018)
Article ADS Google Scholar
Detrano, R., et al.: International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64(5), 304–310 (1989)
Article CAS PubMed Google Scholar
Ding, Y., Simonoff, J.S.: An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11(1), 131–170 (2010)
MathSciNet Google Scholar
Dixon, J.K.: Pattern recognition with partly missing data. IEEE Trans. Syst. Man Cybern. 9(10), 617–621 (1979)
Article Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2019). http://archive.ics.uci.edu/ml
Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)
Article Google Scholar
Efron, B., Gong, G.: Statistical theory and the computer. In: Eddy, W.F. (ed.) Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pp. 3–7. Springer, New York (1981). https://doi.org/10.1007/978-1-4613-9464-8_1
Eirola, E.: Machine learning methods for incomplete data and variable selection. Ph.D. thesis, Aalto University, Espoo (2014)
Google Scholar
Elter, M., Schulz-Wendtland, R., Wittenberg, T.: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med. Phys. 34(11), 4164–4172 (2007)
Article CAS PubMed Google Scholar
Enders, C.K.: Applied Missing Data Analysis. Methodology in the Social Sciences. The Guilford Press, New York (2010)
Google Scholar
Evans, B., Fisher, D.: Overcoming process delays with decision tree induction. IEEE Expert 9(1), 60–66 (1994)
Article Google Scholar
Costa, C.F., Nascimento, M.A.: IDA 2016 industrial challenge: using machine learning for predicting failures. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 381–386. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46349-0_33
Chapter Google Scholar
Fix, E., Hodges Jr., J.: Discriminatory analysis—nonparametric discrimination: consistency properties. Technical report, 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas (1951)
Google Scholar
Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59119-2_166
Chapter Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Article MathSciNet Google Scholar
Fukushima, K.: Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 5(4), 322–333 (1969)
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library. Dealing with Missing Values vol. 72, chap. 4. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-10247-4
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS 2010: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Golovenkin, S.E., et al.: Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data. GigaScience 9(11), giaa128 (2020)
Google Scholar
Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)
Article PubMed Google Scholar
Grzymala-Busse, J.W.: Knowledge acquisition under uncertainty-a rough set approach. J. Intell. Rob. Syst. 1(1), 3–16 (1988)
Article MathSciNet Google Scholar
Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS, vol. 2005, pp. 378–385. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45554-X_46
Chapter Google Scholar
Güvenir, H.A., Acar, B., Demiröz, G., Çekin, A.: A supervised machine learning algorithm for arrhythmia analysis. In: Proceedings of the 24th Annual Meeting of Computers in Cardiology. Computers in Cardiology, vol. 24, pp. 433–436. IEEE (1997)
Google Scholar
Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)
Article Google Scholar
van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
Article PubMed Google Scholar
Hutcheson, J.D., Jr., Prather, J.E.: Interpreting the effects of missing data in survey research. Southeastern Polit. Rev. 9(2), 129–143 (1981)
Article Google Scholar
Ipsen, N., Mattei, P.A., Frellsen, J.: How to deal with missing data in supervised deep learning? In: Artemiss 2020: First ICML Workshop on the Art of Learning with Missing Values (2020)
Google Scholar
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91(433), 222–230 (1996)
Article MathSciNet Google Scholar
Josse, J., Chen, J.M., Prost, N., Varoquaux, G., Scornet, E.: On the consistency of supervised learning with missing values. Stat. Papers 65(9) (2024). https://doi.org/10.1007/s00362-024-01550-4
Kim, J.O., Curry, J.: The treatment of missing data in multivariate analysis. Sociol. Methods Res. 6(2), 215–240 (1977)
Article Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR 2015: 3rd International Conference on Learning Representations (2015)
Google Scholar
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. AAAI Press (1996)
Google Scholar
Le Morvan, M., Josse, J., Scornet, E., Varoquaux, G.: What’s a good imputation to predict with missing values? In: NeurIPS 2021: Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 34, pp. 11530–11540. NIPS Foundation (2021)
Google Scholar
Luengo, J., García, S., Herrera, F.: On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32(1), 77–108 (2012)
Article Google Scholar
Luengo, J., Sáez, J.A., Herrera, F.: Missing data imputation for fuzzy rule-based classification systems. Soft. Comput. 16(5), 863–881 (2012)
Article Google Scholar
Marlin, B.M.: Missing data problems in machine learning. Ph.D. thesis, University of Toronto (2008)
Google Scholar
McCann, M., Li, Y., Maguire, L., Johnston, A.: Causality challenge: benchmarking relevant signal components for effective monitoring and process control. In: NIPS 2008: Proceedings of Workshop on Causality. Proceedings of Machine Learning Research, vol. 6, pp. 277–288. JMLR Workshop and Conference Proceedings (2008)
Google Scholar
McLeish, M., Cecile, M.: Enhancing medical expert systems with knowledge obtained from statistical data. Ann. Math. Artif. Intell. 2(1–4), 261–276 (1990)
Article Google Scholar
Michalski, R.S., Chilausky, R.L.: Learning by being told and learning from examples: an experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Int. J. Policy Anal. Inf. Syst. 4(2), 125–161 (1980)
Google Scholar
Ng, C.G., Yusoff, M.S.B.: Missing values in data analysis: ignore or impute? Educ. Med. J. 3(1) (2011)
Google Scholar
Orme, J.G., Reis, J.: Multiple regression with missing data. J. Soc. Serv. Res. 15(1–2), 61–91 (1991)
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
MathSciNet Google Scholar
Pereira Barata, A., Takes, F.W., van den Herik, H.J., Veenman, C.J.: Imputation methods outperform missing-indicator for data missing completely at random. In: ICDM 2019: Proceedings of the Workshops, pp. 407–414. IEEE (2019)
Google Scholar
Perez-Lebel, A., Varoquaux, G., Le Morvan, M., Josse, J., Poline, J.B.: Benchmarking missing-values approaches for predictive models on health databases. GigaScience 11(1), giac013 (2022)
Google Scholar
Pigott, T.D.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)
Article Google Scholar
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)
Article Google Scholar
Quinlan, J.R.: Unknown attribute values in induction. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 164–168. Morgan Kaufmann (1989)
Google Scholar
Quinlan, J.R., Compton, P.J., Horn, K.A., Lazarus, L.: Inductive knowledge acquisition: a case study. In: Proceedings of the Second Australian Conference on Applications of Expert Systems, pp. 157–173. Turing Institute Press (1986)
Google Scholar
Rosenblatt, F.: Principles of neurodynamics—perceptrons and the theory of brain mechanisms. Technical report VG-1196-G-8, Cornell Aeronautical Laboratory, Buffalo, New York (1961)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet Google Scholar
Rubini, L.J., Eswaran, P.: Generating comparative analysis of early stage prediction of chronic kidney disease. Int. J. Mod. Eng. Res. 5(7), 49–55 (2015)
Google Scholar
Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 58, 49–59 (2015)
Article PubMed Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability, vol. 72. Chapman & Hall, London (1997)
Google Scholar
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147–177 (2002)
Article PubMed Google Scholar
Schlimmer, J.C.: Concept acquisition through representational adjustment. Ph.D. thesis, University of California, Irvine (1987)
Google Scholar
Śmieja, M., Struski, Ł, Tabor, J., Marzec, M.: Generalized RBF kernel for incomplete data. Knowl.-Based Syst. 173, 150–162 (2019)
Article Google Scholar
Śmieja, M., Struski, Ł., Tabor, J., Zieliński, B., Spurek, P.: Processing of missing data by neural networks. In: NeurIPS 2018: Proceedings of the Thirty-second Annual Conference on Neural Information Processing Systems. Advances in Neural Information Processing Systems, vol. 31, pp. 689–696. NIPS Foundation (2018)
Google Scholar
Soltani Zarrin, P., Röckendorf, N., Wenger, C.: In-vitro classification of saliva samples of COPD patients and healthy controls using machine learning tools. IEEE Access 8, 168053–168060 (2020)
Article Google Scholar
Sperrin, M., Martin, G.P., Sisk, R., Peek, N.: Missing data should be handled differently for prediction than for description or causal explanation. J. Clin. Epidemiol. 125, 183–187 (2020)
Article PubMed Google Scholar
Stumpf, S.A.: A note on handling missing data. J. Manag. 4(1), 65–73 (1978)
Google Scholar
Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)
MathSciNet Google Scholar
Tresp, V., Neuneier, R., Ahmad, S.: Efficient methods for dealing with missing data in supervised learning. In: NIPS-94: Proceedings of the Eighth Annual Conference on Neural Information Processing Systems. Advances in neural information processing systems, vol. 7, pp. 689–696. MIT Press (1994)
Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article CAS PubMed Google Scholar
Twala, B.E., Jones, M., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29(7), 950–956 (2008)
Article ADS Google Scholar
Vamplew, P., Adams, A.: Missing values in a backpropagation neural net. In: ACNN 1992: Proceedings of the Third Australian Conference on Neural Networks, pp. 64–66. Sydney University Electrical Engineering (1992)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biomet. Bull. 1(6), 80–83 (1945)
Article Google Scholar
Zhu, J., Zou, H., Rosset, S., Hastie, T.: Multi-class AdaBoost. Stat. Interface 2(3), 349–360 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgement

The research reported in this paper was conducted with the financial support of the Odysseus programme of the Research Foundation – Flanders (FWO). This publication is part of the project Digital Twin with project number P18-03 of the research programme TTW Perspective, which is (partly) financed by the Dutch Research Council (NWO). We would like to express our thanks to Geert van der Heijden for answering a question about [41].

Author information

Authors and Affiliations

Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Oliver Urs Lenz
Research Group for Computational Web Intelligence, Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
Oliver Urs Lenz & Chris Cornelis
Department of Information Technology, Ghent University, Ghent, Belgium
Daniel Peralta

Authors

Oliver Urs Lenz
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Peralta
View author publications
You can also search for this author in PubMed Google Scholar
Chris Cornelis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Urs Lenz .

Editor information

Editors and Affiliations

Delft University of Technology, Delft, The Netherlands
Frans A. Oliehoek
Delft University of Technology, Delft, The Netherlands
Manon Kok
Delft University of Technology, Delft, The Netherlands
Sicco Verwer

Ethics declarations

Data and Code

Datasets and the code to reproduce our experiments are available at https://cwi.ugent.be/~oulenz/code/lenz-2024-no.tar.gz.

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lenz, O.U., Peralta, D., Cornelis, C. (2025). No Imputation Without Representation. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-74650-5_1
Published: 02 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74649-9
Online ISBN: 978-3-031-74650-5
eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics