The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values

Agustini, Mety; Fithriasari, Kartika; Prastyo, Dedy Dwi

doi:10.1007/978-981-97-0293-0_17

Mety Agustini^5,6,
Kartika Fithriasari⁵ &
Dedy Dwi Prastyo⁵

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 191))

Included in the following conference series:

The International Conference on Data Science and Emerging Technologies

37 Accesses

Abstract

The presence of missing values is a common issue that frequently leads to incomplete data in a wide range of research. They diminish the accessibility of the dataset that can be utilized and degrade the statistical power of the analysis. A significant focus in numerous studies has been directed toward the methods of missing value imputation. In cases where the dataset includes outliers, the imputation of missing values might be incorrect or significantly deviate from the actual values. One of the challenges that impacts the quality of data is the handling of missing values and outliers simultaneously. Several studies removed outliers before imputing missing values or deleted observations with missing values before detecting outliers. The removal approach leads to a lack of information included within the data. Other researchers integrate clustering methods into the process of missing value imputation to mitigate the impact of outliers and data variations, thereby enhancing the accuracy of the imputation model. This paper proposes a new clustering-based sequential multivariate outlier detection (SMOD) method to effectively handle incomplete data within outliers. The method is applied to an official economic statistics dataset that involves outliers and performs a missing value rate scenario of about 50 percent. In comparison with a well-known and widely used clustering technique, i.e., model-based clustering (MBC), the proposed method works well in missing value imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581
Article MathSciNet Google Scholar
Peng Y, Little RJA, Raghunathan TE (2004) An extended general location model for causal inferences from data subject to noncompliance and missing values. Biometrics 60(3):598–607. https://doi.org/10.1111/j.0006-341X.2004.00208.x
Article MathSciNet Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177. https://doi.org/10.1037/1082-989X.7.2.147
Article Google Scholar
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282. https://doi.org/10.1007/s00521-009-0295-6
Article Google Scholar
Rubin DB (1978) Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Am Stat Assoc 1:20–34
Google Scholar
Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classification. In: Technical paper. Department of Mathematical Science, University of Puerto Rico Mayaguez, pp 1–25 [Online]. Available: http://paperout.pdf
Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ (2020) Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biometrical J 62(2):444–466. https://doi.org/10.1002/bimj.201900051
Article MathSciNet Google Scholar
Samad MD, Abrar S, Diawara N (2022) Missing value estimation using clustering and deep learning within multiple imputation framework. Knowl-Based Syst 249:108968. https://doi.org/10.1016/j.knosys.2022.108968
Article Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1
Article Google Scholar
Zhang Z, Fang H, Wang H (2016) Multiple imputation based clustering validation (MIV) for big longitudinal trial data with missing values in eHealth. J Med Syst 40(6). https://doi.org/10.1007/s10916-016-0499-0
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering algorithm for multiple imputation of missing data. IEEE Int Conf Fuzzy Syst. https://doi.org/10.1109/FUZZ-IEEE.2017.8015560
Article Google Scholar
Tsai CF, Li ML, Lin WC (2018) A class center based approach for missing value imputation. Knowl-Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026
Article Google Scholar
Xue Y, Klabjan D, Luo Y (2016) Mixture-based multiple imputation model for clinical data with a temporal dimension. In: Proceedings of 2019 IEEE international conference on Big Data, Big Data 2019, pp 245–252. https://doi.org/10.1109/BigData47090.2019.9005672
Lin J, Li NH, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell 50(3):860–877. https://doi.org/10.1007/s10489-019-01560-y
Article Google Scholar
van Buuren S, Oudshoorn CGM (2007) MICE: multivariate imputation by chained equations. R Packag Version 1(3):2007
Google Scholar
Bedrick EJ, Lapidus J, Powell JF (2000) Estimating the Mahalanobis distance from mixed continuous and discrete data. Biometrics 56(2):394–401. https://doi.org/10.1111/j.0006-341X.2000.00394.x
Article MathSciNet Google Scholar
Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points: rejoinder. J Am Stat Assoc 85(411):648. https://doi.org/10.2307/2289999
Article Google Scholar
Filzmoser P, Gschwandtner M (2022) Package ‘mvoutlier’ R Package. version [Online]. Available: http://cstat.tuwien.ac.at/filz
Banfield JD, Raftery AE (1993) Banfield & Raftery (1993) model based gaussian dan non gaussian.pdf. Biometrics
Google Scholar
Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792. https://doi.org/10.1080/01621459.1993.10476339
Article MathSciNet Google Scholar
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013
Article Google Scholar
Takahashi M, Ito T (2013) Multiple imputation of missing values in economic surveys: comparison of competing algorithms. In: Proceedings 59th ISI world statistics congress, no. August, pp 3240–3245
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Mety Agustini, Kartika Fithriasari & Dedy Dwi Prastyo
Badan Pusat Statistik Provinsi Kepulauan Bangka Belitung, Pangkalpinang, Indonesia
Mety Agustini

Authors

Mety Agustini
View author publications
You can also search for this author in PubMed Google Scholar
Kartika Fithriasari
View author publications
You can also search for this author in PubMed Google Scholar
Dedy Dwi Prastyo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kartika Fithriasari .

Editor information

Editors and Affiliations

UNITAR Graduate School, UNITAR International University, Petaling Jaya, Malaysia
Yap Bee Wah
Faculty of Engineering and Technology, Liverpool John Moores University, Liverpool, UK
Dhiya Al-Jumeily OBE
University of Tennessee, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agustini, M., Fithriasari, K., Prastyo, D.D. (2024). The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_17

Download citation

DOI: https://doi.org/10.1007/978-981-97-0293-0_17
Published: 27 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics