Skip to main content
Log in

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ihaka, R. and Gentleman, R. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314.

    Google Scholar 

  • Banfield, J. and Raftery, A. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.

    Google Scholar 

  • Billor, N., Hadi, A.S., and Velleman, P.F. 2000. BACON: Blocked adaptive computationally-efficient outlier nominators. Computational Statistics and Analysis, 34:279–298.

    Google Scholar 

  • Bilmes, J. 1998. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.

  • Bradley, P., Fayyad, U., and Reina, C. 1999. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR–98–35., Microsoft Research, Seattle.

    Google Scholar 

  • Campbell, N.A. 1990. Robust procedures in multivariate analysis I: Robust covariance estimation. Applied Statistics, 29:231–237.

    Google Scholar 

  • Castejón Limas, M., Ordieres Meré, J.B., de Cos Juez, F.J., and Martínez de Pisn Ascacibar, F.J. 2001. Control de Calidad. Metodolog´ýa para el Anílisis Previo a la Modelización de Datos en Procesos Industriales. Fundamentos Teóricos y Aplicaciones Prácticas con R. Logroño: Servicio de Publicaciones de la Universidad de La Rioja.

  • Coleman, D., Dong, X., Hardin, J., and Rocke ad David L. Woodruff, D.M. 1999. Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31:1–11.

    Google Scholar 

  • Cuevas, A., Febrero, M., and Fraiman, R. 1996. Estimating the number of clusters. The Canadian Journal of Statistics, 28(2):367–382.

    Google Scholar 

  • Cuevas, A., Febrero, M., and Fraiman, R. 2001. Cluster analysis: A further approach based in density estimation. Computational Statistics and Data Analalysis, 36(4):441–459.

    Google Scholar 

  • de Ammorin, S., Barthelemy, J.-P., and Ribeiro, C. 1992. Clustering and clique partitioning: Simulated annealing and tabu search approaches. J. Classification, 9:17–41.

    Google Scholar 

  • De Veaux, R. and Kreiger, A. 1990. Robust estimation of a normal mixture. Statistics & Probability Letters, 10:1–7.

    Google Scholar 

  • Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1).

  • Fraley, C. and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41(8);578–588.

    Google Scholar 

  • Fraley, C. and Raftery, A.E. 1999. MCLUST: Software for model-based cluster analysis. Journal of Classification, 16:297–306.

    Google Scholar 

  • Friedman, J. and Stuetzle, W. 1981. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823.

    Google Scholar 

  • Gallegos, M.T. 2000. Arobust method for clustering analysis. Technical Report MIP-0013, Fakultät für Mathematik und Informatik, Universität Passau.

  • Hardy, A. 1996. On the number of clusters. Computational Statistics & Data Analysis, 23:83–96.

    Google Scholar 

  • Hartigan, J. 1975. Clustering Algorithms. New York: Wiley.

    Google Scholar 

  • Hawkins, D. 1980. Identifications of Outliers. New York: Chapman and Hall.

    Google Scholar 

  • Markatou, M. 1998. Mixture models, robustness and the weighted likelihood methodology. Technical Report 1998–9, Department of Statistics, Stanford University.

  • McLachlan, G.J. 1988. On the choice of starting values for the EM algorithm in fitting mixture models. The Statistician, 37:417–425.

    Google Scholar 

  • McLachlan, G.J. and Krishnan, T. 1997. The EM Algorithm and Extensions, Probability and Mathematical Statistics: Applied Probability and Statistics Section. New York: John Wiley & Sons.

    Google Scholar 

  • McLachlan, G.J. and Peel, D.J. 2000a. On computational aspects of clustering via mixtures of normal and t-components. In Proceedings of the American Statistical Association (Bayesian Statistical Section); Indianapolis.

  • McLachlan, G.J. and Peel, D.J. 2000b. Robust cluster analysis via mixtures of multivariate t-distributions. Lectures Notes in Computer Science, 1451:658–666.

    Google Scholar 

  • Muller, D. and Sawitzki, G. 1991. Using excess mass estimates to investigate the modality of a distribution. The Frontiers of Statistical Scientific Theory & Industrial Applications, 26:355–382.

    Google Scholar 

  • Rocke, D. and Woodruff, D. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Assoc., 91:1047–1061.

    Google Scholar 

  • Rocke, D. and Woodruff, D. 1997. Robust estimation of multivariate location and shape. Journal of Statistical Planning and Inference, 57:245–255.

    Google Scholar 

  • Rousseeuw, P.J. and Leroy, A. 1987. Robust Regression and Outlier DetectionDiagnostic Regression Analysis. New York: John Wiley and Sons.

    Google Scholar 

  • Srivastava, M.S. and von Rosen, D. 1998. Outliers in multivariate regression models. Journal of Multivariate Analysis, 65:195–208.

    Google Scholar 

  • Stanford, D. and Raftery, A.E. 1997. Principal curve clustering with noise. Technical Report 317, Department of Statistics. University of Washington.

  • Thiesson, B., Meek, C., and Heckerman, D. 2000. Accelerating EM for large databases. Technical Report MSR-TR–99–31., Microsoft Research, Seattle.

    Google Scholar 

  • Wang, X.Z. 1999. Data mining and Knowledge Discovery for Process Monitoring and Control. London: Springer-Verlag.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Limas, M.C., Ordieres Meré, J.B., de Pisón Ascacibar, F.J.M. et al. Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm. Data Mining and Knowledge Discovery 9, 171–187 (2004). https://doi.org/10.1023/B:DAMI.0000031630.50685.7c

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:DAMI.0000031630.50685.7c

Navigation