Skip to main content

The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values

  • Conference paper
  • First Online:
Data Science and Emerging Technologies (DaSET 2023)

Abstract

The presence of missing values is a common issue that frequently leads to incomplete data in a wide range of research. They diminish the accessibility of the dataset that can be utilized and degrade the statistical power of the analysis. A significant focus in numerous studies has been directed toward the methods of missing value imputation. In cases where the dataset includes outliers, the imputation of missing values might be incorrect or significantly deviate from the actual values. One of the challenges that impacts the quality of data is the handling of missing values and outliers simultaneously. Several studies removed outliers before imputing missing values or deleted observations with missing values before detecting outliers. The removal approach leads to a lack of information included within the data. Other researchers integrate clustering methods into the process of missing value imputation to mitigate the impact of outliers and data variations, thereby enhancing the accuracy of the imputation model. This paper proposes a new clustering-based sequential multivariate outlier detection (SMOD) method to effectively handle incomplete data within outliers. The method is applied to an official economic statistics dataset that involves outliers and performs a missing value rate scenario of about 50 percent. In comparison with a well-known and widely used clustering technique, i.e., model-based clustering (MBC), the proposed method works well in missing value imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581

    Article  MathSciNet  Google Scholar 

  2. Peng Y, Little RJA, Raghunathan TE (2004) An extended general location model for causal inferences from data subject to noncompliance and missing values. Biometrics 60(3):598–607. https://doi.org/10.1111/j.0006-341X.2004.00208.x

    Article  MathSciNet  Google Scholar 

  3. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177. https://doi.org/10.1037/1082-989X.7.2.147

    Article  Google Scholar 

  4. García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282. https://doi.org/10.1007/s00521-009-0295-6

    Article  Google Scholar 

  5. Rubin DB (1978) Multiple imputations in sample surveys—a phenomenological Bayesian approach to nonresponse. Am Stat Assoc 1:20–34

    Google Scholar 

  6. Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classification. In: Technical paper. Department of Mathematical Science, University of Puerto Rico Mayaguez, pp 1–25 [Online]. Available: http://paperout.pdf

  7. Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ (2020) Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biometrical J 62(2):444–466. https://doi.org/10.1002/bimj.201900051

    Article  MathSciNet  Google Scholar 

  8. Samad MD, Abrar S, Diawara N (2022) Missing value estimation using clustering and deep learning within multiple imputation framework. Knowl-Based Syst 249:108968. https://doi.org/10.1016/j.knosys.2022.108968

    Article  Google Scholar 

  9. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1

    Article  Google Scholar 

  10. Zhang Z, Fang H, Wang H (2016) Multiple imputation based clustering validation (MIV) for big longitudinal trial data with missing values in eHealth. J Med Syst 40(6). https://doi.org/10.1007/s10916-016-0499-0

  11. Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering algorithm for multiple imputation of missing data. IEEE Int Conf Fuzzy Syst. https://doi.org/10.1109/FUZZ-IEEE.2017.8015560

    Article  Google Scholar 

  12. Tsai CF, Li ML, Lin WC (2018) A class center based approach for missing value imputation. Knowl-Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026

    Article  Google Scholar 

  13. Xue Y, Klabjan D, Luo Y (2016) Mixture-based multiple imputation model for clinical data with a temporal dimension. In: Proceedings of 2019 IEEE international conference on Big Data, Big Data 2019, pp 245–252. https://doi.org/10.1109/BigData47090.2019.9005672

  14. Lin J, Li NH, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell 50(3):860–877. https://doi.org/10.1007/s10489-019-01560-y

    Article  Google Scholar 

  15. van Buuren S, Oudshoorn CGM (2007) MICE: multivariate imputation by chained equations. R Packag Version 1(3):2007

    Google Scholar 

  16. Bedrick EJ, Lapidus J, Powell JF (2000) Estimating the Mahalanobis distance from mixed continuous and discrete data. Biometrics 56(2):394–401. https://doi.org/10.1111/j.0006-341X.2000.00394.x

    Article  MathSciNet  Google Scholar 

  17. Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points: rejoinder. J Am Stat Assoc 85(411):648. https://doi.org/10.2307/2289999

    Article  Google Scholar 

  18. Filzmoser P, Gschwandtner M (2022) Package ‘mvoutlier’ R Package. version [Online]. Available: http://cstat.tuwien.ac.at/filz

  19. Banfield JD, Raftery AE (1993) Banfield & Raftery (1993) model based gaussian dan non gaussian.pdf. Biometrics

    Google Scholar 

  20. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792. https://doi.org/10.1080/01621459.1993.10476339

    Article  MathSciNet  Google Scholar 

  21. Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013

    Article  Google Scholar 

  22. Takahashi M, Ito T (2013) Multiple imputation of missing values in economic surveys: comparison of competing algorithms. In: Proceedings 59th ISI world statistics congress, no. August, pp 3240–3245

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kartika Fithriasari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Agustini, M., Fithriasari, K., Prastyo, D.D. (2024). The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_17

Download citation

Publish with us

Policies and ethics