Dropping Incomplete Records is (not so) Straightforward

Schouten, Rianne M.; Taşcău, Victoria; Ziegler, Gabriel G.; Casano, Davide; Ardizzone, Marco; Erotokritou, Michael-Angelos

doi:10.1007/978-3-031-30047-9_30

Rianne M. Schouten¹⁰,
Victoria Taşcău¹⁰,
Gabriel G. Ziegler¹⁰,
Davide Casano¹⁰,
Marco Ardizzone¹⁰ &
…
Michael-Angelos Erotokritou¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13876))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

821 Accesses
1 Altmetric

Abstract

A straightforward approach to handling missing values is dropping incomplete records from the dataset. However, for many forms of missingness, this method is known to affect the center and spread of the data distribution. In this paper, we perform an extensive empirical evaluation of the effect of the drop method on the data distribution. In particular, we analyze two scenarios that are likely to occur in practice but are not often considered in simulation studies: 1) when features are skewed rather than symmetrically distributed and 2) when multiple forms of missingness occur simultaneously in one feature. Furthermore, we investigate implications of the drop method for classification accuracy and demonstrate that dropping incomplete records is doubtful, even when test cases are dropped as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Ignoring Non-ignorable Missingness

Article Open access 20 December 2022

Feature Based Multivariate Data Imputation

Quality Control, Data Cleaning, Imputation

Notes

1.
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).
2.
N.B.: in the general case, this may affect training and test distribution, but it is unclear how. Homogeneity might increase, but the data might also become more scattered and hence variance might increase. Since the distribution can be affected in a wide variety of possible ways, we will simply ignore this effect; note that technically this might affect the definition of accuracy.

References

Acuna, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation, pp. 639–647. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_60
Brand, J.P., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003)
Article MathSciNet Google Scholar
van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018)
Google Scholar
van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76(12), 1049–1064 (2006)
Article MathSciNet MATH Google Scholar
van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)
Article Google Scholar
Correia, A., Peharz, R., de Campos, C.P.: Joints in random forests. Adv. Neural Inf. Process. Syst. 33, 11404–11415 (2020)
Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Article Google Scholar
Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)
Article Google Scholar
Hoogland, J., et al.: Handling missing predictor values when validating and applying a prediction model to new patients. Stat. Med. 39(25), 3591–3607 (2020)
Article MathSciNet Google Scholar
Little, R.J.: Regression with missing X’s: a review. J. Am. Stat. Assoc. 87(420), 1227–1237 (1992)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, vol. 793. Wiley, Hoboken (2019)
Google Scholar
Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43(4), 570–577 (1995)
Article MathSciNet MATH Google Scholar
Miller, I., Miller, M., Freund, J.E.: John E. Freund’s Mathematical Statistics, 6th edn. Prentice Hall, Upper Saddle River, N.J. (1999)
Google Scholar
Raji, I.D., Kumar, I.E., Horowitz, A., Selbst, A.: The fallacy of AI functionality. In: ACM Conference on Fairness, Accountability, and Transparency, pp. 959–972 (2022)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet MATH Google Scholar
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)
Article Google Scholar
Schouten, R.M., Lugtig, P., Vink, G.: Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Simul. 88(15), 2909–2930 (2018)
Article MathSciNet MATH Google Scholar
Schouten, R.M., Vink, G.: The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50(3), 1243–1258 (2021)
Article MathSciNet Google Scholar
Schouten, R.M., Zamanzadeh, D., Singh, P.: pyampute: a python library for data amputation, August 2022. https://doi.org/10.25080/majora-212e5952-03e
Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomedical Visualization. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1905, pp. 861–870, July 1993
Google Scholar
Toutenburg, H., Srivastava, V.K.: Shalabh: amputation versus imputation of missing values through ratio method in sample surveys. Stat. Pap. 49(2), 237–247 (2008)
Article MATH Google Scholar
Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society. SBD, vol. 16, pp. 91–114. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-26989-4_4
Chapter Google Scholar

Download references

Acknowledgments

Many thanks to dr. Wouter Duivesteijn and prof. Mykola Pechenizkiy for their continuous support in all possible ways. Thank you Hilde Weerts for being a sparring partner.

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Rianne M. Schouten, Victoria Taşcău, Gabriel G. Ziegler, Davide Casano, Marco Ardizzone & Michael-Angelos Erotokritou

Authors

Rianne M. Schouten
View author publications
You can also search for this author in PubMed Google Scholar
Victoria Taşcău
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel G. Ziegler
View author publications
You can also search for this author in PubMed Google Scholar
Davide Casano
View author publications
You can also search for this author in PubMed Google Scholar
Marco Ardizzone
View author publications
You can also search for this author in PubMed Google Scholar
Michael-Angelos Erotokritou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victoria Taşcău .

Editor information

Editors and Affiliations

Université de Caen Normandie, Caen, France
Bruno Crémilleux
Eindhoven University of Technology, Eindhoven, The Netherlands
Sibylle Hess
UCLouvain, Louvain-la-Neuve, Belgium
Siegfried Nijssen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schouten, R.M., Taşcău, V., Ziegler, G.G., Casano, D., Ardizzone, M., Erotokritou, MA. (2023). Dropping Incomplete Records is (not so) Straightforward. In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham. https://doi.org/10.1007/978-3-031-30047-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-30047-9_30
Published: 01 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30046-2
Online ISBN: 978-3-031-30047-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dropping Incomplete Records is (not so) Straightforward

Abstract

Access this chapter

Similar content being viewed by others

Ignoring Non-ignorable Missingness

Feature Based Multivariate Data Imputation

Quality Control, Data Cleaning, Imputation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Dropping Incomplete Records is (not so) Straightforward

Abstract

Access this chapter

Similar content being viewed by others

Ignoring Non-ignorable Missingness

Feature Based Multivariate Data Imputation

Quality Control, Data Cleaning, Imputation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation