Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

Limas, Manuel Castejón; Ordieres Meré, Joaquín B.; de Pisón Ascacibar, Francisco J. Martínez; González, Eliseo P. Vergara

doi:10.1023/B:DAMI.0000031630.50685.7c

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

Published: September 2004

Volume 9, pages 171–187, (2004)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Manuel Castejón Limas¹,
Joaquín B. Ordieres Meré¹,
Francisco J. Martínez de Pisón Ascacibar² &
…
Eliseo P. Vergara González²

701 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multivariate Outlier Identification Based on Robust Estimators of Location and Scatter

Methodically Unified Procedures for Outlier Detection, Clustering and Classification

Outlier Detection in High Dimension Using Regularization

References

Ihaka, R. and Gentleman, R. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314.
Google Scholar
Banfield, J. and Raftery, A. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.
Google Scholar
Billor, N., Hadi, A.S., and Velleman, P.F. 2000. BACON: Blocked adaptive computationally-efficient outlier nominators. Computational Statistics and Analysis, 34:279–298.
Google Scholar
Bilmes, J. 1998. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.
Bradley, P., Fayyad, U., and Reina, C. 1999. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR–98–35., Microsoft Research, Seattle.
Google Scholar
Campbell, N.A. 1990. Robust procedures in multivariate analysis I: Robust covariance estimation. Applied Statistics, 29:231–237.
Google Scholar
Castejón Limas, M., Ordieres Meré, J.B., de Cos Juez, F.J., and Martínez de Pisn Ascacibar, F.J. 2001. Control de Calidad. Metodolog´ýa para el Anílisis Previo a la Modelización de Datos en Procesos Industriales. Fundamentos Teóricos y Aplicaciones Prácticas con R. Logroño: Servicio de Publicaciones de la Universidad de La Rioja.
Coleman, D., Dong, X., Hardin, J., and Rocke ad David L. Woodruff, D.M. 1999. Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31:1–11.
Google Scholar
Cuevas, A., Febrero, M., and Fraiman, R. 1996. Estimating the number of clusters. The Canadian Journal of Statistics, 28(2):367–382.
Google Scholar
Cuevas, A., Febrero, M., and Fraiman, R. 2001. Cluster analysis: A further approach based in density estimation. Computational Statistics and Data Analalysis, 36(4):441–459.
Google Scholar
de Ammorin, S., Barthelemy, J.-P., and Ribeiro, C. 1992. Clustering and clique partitioning: Simulated annealing and tabu search approaches. J. Classification, 9:17–41.
Google Scholar
De Veaux, R. and Kreiger, A. 1990. Robust estimation of a normal mixture. Statistics & Probability Letters, 10:1–7.
Google Scholar
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1).
Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41(8);578–588.
Google Scholar
Fraley, C. and Raftery, A.E. 1999. MCLUST: Software for model-based cluster analysis. Journal of Classification, 16:297–306.
Google Scholar
Friedman, J. and Stuetzle, W. 1981. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823.
Google Scholar
Gallegos, M.T. 2000. Arobust method for clustering analysis. Technical Report MIP-0013, Fakultät für Mathematik und Informatik, Universität Passau.
Hardy, A. 1996. On the number of clusters. Computational Statistics & Data Analysis, 23:83–96.
Google Scholar
Hartigan, J. 1975. Clustering Algorithms. New York: Wiley.
Google Scholar
Hawkins, D. 1980. Identifications of Outliers. New York: Chapman and Hall.
Google Scholar
Markatou, M. 1998. Mixture models, robustness and the weighted likelihood methodology. Technical Report 1998–9, Department of Statistics, Stanford University.
McLachlan, G.J. 1988. On the choice of starting values for the EM algorithm in fitting mixture models. The Statistician, 37:417–425.
Google Scholar
McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions, Probability and Mathematical Statistics: Applied Probability and Statistics Section. New York: John Wiley & Sons.
Google Scholar
McLachlan, G.J. and Peel, D.J. 2000a. On computational aspects of clustering via mixtures of normal and t-components. In Proceedings of the American Statistical Association (Bayesian Statistical Section); Indianapolis.
McLachlan, G.J. and Peel, D.J. 2000b. Robust cluster analysis via mixtures of multivariate t-distributions. Lectures Notes in Computer Science, 1451:658–666.
Google Scholar
Muller, D. and Sawitzki, G. 1991. Using excess mass estimates to investigate the modality of a distribution. The Frontiers of Statistical Scientific Theory & Industrial Applications, 26:355–382.
Google Scholar
Rocke, D. and Woodruff, D. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Assoc., 91:1047–1061.
Google Scholar
Rocke, D. and Woodruff, D. 1997. Robust estimation of multivariate location and shape. Journal of Statistical Planning and Inference, 57:245–255.
Google Scholar
Rousseeuw, P.J. and Leroy, A. 1987. Robust Regression and Outlier DetectionDiagnostic Regression Analysis. New York: John Wiley and Sons.
Google Scholar
Srivastava, M.S. and von Rosen, D. 1998. Outliers in multivariate regression models. Journal of Multivariate Analysis, 65:195–208.
Google Scholar
Stanford, D. and Raftery, A.E. 1997. Principal curve clustering with noise. Technical Report 317, Department of Statistics. University of Washington.
Thiesson, B., Meek, C., and Heckerman, D. 2000. Accelerating EM for large databases. Technical Report MSR-TR–99–31., Microsoft Research, Seattle.
Google Scholar
Wang, X.Z. 1999. Data mining and Knowledge Discovery for Process Monitoring and Control. London: Springer-Verlag.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Ingeniería Eléctrica, Universidad de León, Leóon, Spain
Manuel Castejón Limas & Joaquín B. Ordieres Meré
Dept. Ingeniería Mecánica, Universidad de La Rioja, Logroño, Spain
Francisco J. Martínez de Pisón Ascacibar & Eliseo P. Vergara González

Authors

Manuel Castejón Limas
View author publications
You can also search for this author in PubMed Google Scholar
Joaquín B. Ordieres Meré
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Martínez de Pisón Ascacibar
View author publications
You can also search for this author in PubMed Google Scholar
Eliseo P. Vergara González
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Limas, M.C., Ordieres Meré, J.B., de Pisón Ascacibar, F.J.M. et al. Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm. Data Mining and Knowledge Discovery 9, 171–187 (2004). https://doi.org/10.1023/B:DAMI.0000031630.50685.7c

Download citation

Issue Date: September 2004
DOI: https://doi.org/10.1023/B:DAMI.0000031630.50685.7c

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

Abstract

Access this article

Similar content being viewed by others

Multivariate Outlier Identification Based on Robust Estimators of Location and Scatter

Methodically Unified Procedures for Outlier Detection, Clustering and Classification

Outlier Detection in High Dimension Using Regularization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

Abstract

Access this article

Similar content being viewed by others

Multivariate Outlier Identification Based on Robust Estimators of Location and Scatter

Methodically Unified Procedures for Outlier Detection, Clustering and Classification

Outlier Detection in High Dimension Using Regularization

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation