Abstract
The presence of missing values is a common issue that frequently leads to incomplete data in a wide range of research. They diminish the accessibility of the dataset that can be utilized and degrade the statistical power of the analysis. A significant focus in numerous studies has been directed toward the methods of missing value imputation. In cases where the dataset includes outliers, the imputation of missing values might be incorrect or significantly deviate from the actual values. One of the challenges that impacts the quality of data is the handling of missing values and outliers simultaneously. Several studies removed outliers before imputing missing values or deleted observations with missing values before detecting outliers. The removal approach leads to a lack of information included within the data. Other researchers integrate clustering methods into the process of missing value imputation to mitigate the impact of outliers and data variations, thereby enhancing the accuracy of the imputation model. This paper proposes a new clustering-based sequential multivariate outlier detection (SMOD) method to effectively handle incomplete data within outliers. The method is applied to an official economic statistics dataset that involves outliers and performs a missing value rate scenario of about 50 percent. In comparison with a well-known and widely used clustering technique, i.e., model-based clustering (MBC), the proposed method works well in missing value imputation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581
Peng Y, Little RJA, Raghunathan TE (2004) An extended general location model for causal inferences from data subject to noncompliance and missing values. Biometrics 60(3):598–607. https://doi.org/10.1111/j.0006-341X.2004.00208.x
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177. https://doi.org/10.1037/1082-989X.7.2.147
GarcÃa-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282. https://doi.org/10.1007/s00521-009-0295-6
Rubin DB (1978) Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Am Stat Assoc 1:20–34
Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classification. In: Technical paper. Department of Mathematical Science, University of Puerto Rico Mayaguez, pp 1–25 [Online]. Available: http://paperout.pdf
Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ (2020) Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biometrical J 62(2):444–466. https://doi.org/10.1002/bimj.201900051
Samad MD, Abrar S, Diawara N (2022) Missing value estimation using clustering and deep learning within multiple imputation framework. Knowl-Based Syst 249:108968. https://doi.org/10.1016/j.knosys.2022.108968
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1
Zhang Z, Fang H, Wang H (2016) Multiple imputation based clustering validation (MIV) for big longitudinal trial data with missing values in eHealth. J Med Syst 40(6). https://doi.org/10.1007/s10916-016-0499-0
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering algorithm for multiple imputation of missing data. IEEE Int Conf Fuzzy Syst. https://doi.org/10.1109/FUZZ-IEEE.2017.8015560
Tsai CF, Li ML, Lin WC (2018) A class center based approach for missing value imputation. Knowl-Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026
Xue Y, Klabjan D, Luo Y (2016) Mixture-based multiple imputation model for clinical data with a temporal dimension. In: Proceedings of 2019 IEEE international conference on Big Data, Big Data 2019, pp 245–252. https://doi.org/10.1109/BigData47090.2019.9005672
Lin J, Li NH, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell 50(3):860–877. https://doi.org/10.1007/s10489-019-01560-y
van Buuren S, Oudshoorn CGM (2007) MICE: multivariate imputation by chained equations. R Packag Version 1(3):2007
Bedrick EJ, Lapidus J, Powell JF (2000) Estimating the Mahalanobis distance from mixed continuous and discrete data. Biometrics 56(2):394–401. https://doi.org/10.1111/j.0006-341X.2000.00394.x
Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points: rejoinder. J Am Stat Assoc 85(411):648. https://doi.org/10.2307/2289999
Filzmoser P, Gschwandtner M (2022) Package ‘mvoutlier’ R Package. version [Online]. Available: http://cstat.tuwien.ac.at/filz
Banfield JD, Raftery AE (1993) Banfield & Raftery (1993) model based gaussian dan non gaussian.pdf. Biometrics
Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792. https://doi.org/10.1080/01621459.1993.10476339
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013
Takahashi M, Ito T (2013) Multiple imputation of missing values in economic surveys: comparison of competing algorithms. In: Proceedings 59th ISI world statistics congress, no. August, pp 3240–3245
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Agustini, M., Fithriasari, K., Prastyo, D.D. (2024). The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_17
Download citation
DOI: https://doi.org/10.1007/978-981-97-0293-0_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)