Abstract
A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.
Similar content being viewed by others
References
Ihaka, R. and Gentleman, R. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314.
Banfield, J. and Raftery, A. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.
Billor, N., Hadi, A.S., and Velleman, P.F. 2000. BACON: Blocked adaptive computationally-efficient outlier nominators. Computational Statistics and Analysis, 34:279–298.
Bilmes, J. 1998. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.
Bradley, P., Fayyad, U., and Reina, C. 1999. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR–98–35., Microsoft Research, Seattle.
Campbell, N.A. 1990. Robust procedures in multivariate analysis I: Robust covariance estimation. Applied Statistics, 29:231–237.
Castejón Limas, M., Ordieres Meré, J.B., de Cos Juez, F.J., and Martínez de Pisn Ascacibar, F.J. 2001. Control de Calidad. Metodolog´ýa para el Anílisis Previo a la Modelización de Datos en Procesos Industriales. Fundamentos Teóricos y Aplicaciones Prácticas con R. Logroño: Servicio de Publicaciones de la Universidad de La Rioja.
Coleman, D., Dong, X., Hardin, J., and Rocke ad David L. Woodruff, D.M. 1999. Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31:1–11.
Cuevas, A., Febrero, M., and Fraiman, R. 1996. Estimating the number of clusters. The Canadian Journal of Statistics, 28(2):367–382.
Cuevas, A., Febrero, M., and Fraiman, R. 2001. Cluster analysis: A further approach based in density estimation. Computational Statistics and Data Analalysis, 36(4):441–459.
de Ammorin, S., Barthelemy, J.-P., and Ribeiro, C. 1992. Clustering and clique partitioning: Simulated annealing and tabu search approaches. J. Classification, 9:17–41.
De Veaux, R. and Kreiger, A. 1990. Robust estimation of a normal mixture. Statistics & Probability Letters, 10:1–7.
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1).
Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41(8);578–588.
Fraley, C. and Raftery, A.E. 1999. MCLUST: Software for model-based cluster analysis. Journal of Classification, 16:297–306.
Friedman, J. and Stuetzle, W. 1981. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823.
Gallegos, M.T. 2000. Arobust method for clustering analysis. Technical Report MIP-0013, Fakultät für Mathematik und Informatik, Universität Passau.
Hardy, A. 1996. On the number of clusters. Computational Statistics & Data Analysis, 23:83–96.
Hartigan, J. 1975. Clustering Algorithms. New York: Wiley.
Hawkins, D. 1980. Identifications of Outliers. New York: Chapman and Hall.
Markatou, M. 1998. Mixture models, robustness and the weighted likelihood methodology. Technical Report 1998–9, Department of Statistics, Stanford University.
McLachlan, G.J. 1988. On the choice of starting values for the EM algorithm in fitting mixture models. The Statistician, 37:417–425.
McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions, Probability and Mathematical Statistics: Applied Probability and Statistics Section. New York: John Wiley & Sons.
McLachlan, G.J. and Peel, D.J. 2000a. On computational aspects of clustering via mixtures of normal and t-components. In Proceedings of the American Statistical Association (Bayesian Statistical Section); Indianapolis.
McLachlan, G.J. and Peel, D.J. 2000b. Robust cluster analysis via mixtures of multivariate t-distributions. Lectures Notes in Computer Science, 1451:658–666.
Muller, D. and Sawitzki, G. 1991. Using excess mass estimates to investigate the modality of a distribution. The Frontiers of Statistical Scientific Theory & Industrial Applications, 26:355–382.
Rocke, D. and Woodruff, D. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Assoc., 91:1047–1061.
Rocke, D. and Woodruff, D. 1997. Robust estimation of multivariate location and shape. Journal of Statistical Planning and Inference, 57:245–255.
Rousseeuw, P.J. and Leroy, A. 1987. Robust Regression and Outlier DetectionDiagnostic Regression Analysis. New York: John Wiley and Sons.
Srivastava, M.S. and von Rosen, D. 1998. Outliers in multivariate regression models. Journal of Multivariate Analysis, 65:195–208.
Stanford, D. and Raftery, A.E. 1997. Principal curve clustering with noise. Technical Report 317, Department of Statistics. University of Washington.
Thiesson, B., Meek, C., and Heckerman, D. 2000. Accelerating EM for large databases. Technical Report MSR-TR–99–31., Microsoft Research, Seattle.
Wang, X.Z. 1999. Data mining and Knowledge Discovery for Process Monitoring and Control. London: Springer-Verlag.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Limas, M.C., Ordieres Meré, J.B., de Pisón Ascacibar, F.J.M. et al. Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm. Data Mining and Knowledge Discovery 9, 171–187 (2004). https://doi.org/10.1023/B:DAMI.0000031630.50685.7c
Issue Date:
DOI: https://doi.org/10.1023/B:DAMI.0000031630.50685.7c