Summary
The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets, a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as a useful technique for multidimensional outlier detection.
Similar content being viewed by others
Notes
1 Let X be an n × p matrix representing sample of n points in ℜp and \(S=n^{-1}(X-\overline{X})^{T}(X-\overline{X})\) denote the sample covariance matrix. Then the shape of the sample X is given by s/|S|1/p.
References
Atkinson, A.C. (1994), ‘Fast very robust methods for detection of multiple outliers’,Journal of American Statistical Association,89, 1329–1339.
Bartkowiak, A. & Szustalewicz, A.(1997), ‘The grand tour method for detecting multivariate outliers’,Machine Graphics & Vision,6, 487–505.
Cambell, N.A. (1980), ‘Robust procedures in multivariate analysis I: Robust covariance estimation’,Applied Statistics,29, 231–237, 1980.
Cambell, N.A. (1982), ‘Robust procedures in multivariate analysis II: Robust canonical variate analysis’,Applied Statistics,31, 1–8.
Davies, P.L. (1987), ‘Asymptotic behavior of S-estimators of multivariate location parameters and dispersion matrices’,The Annals of Statistics,15, 1269–1292.
Devlin, S.J., Gnanadesikan, R. & Kettenring, J.R. (1981), ‘Robust estimation of dispersion matrices and principal components’,Journal of American Statistical Association,76, 354–362.
Donoho, D.L. (1982),Breakdown properties of multivariate location estimators, Ph.D. qualifying paper, Harvard University, Department of Statistics.
Fung, W.K. (1993), ‘Unmasking outliers and leverage points: A confirmation’,Journal of American Statistical Association,88, 515–519.
Hadi, A.S. (1992), ‘Identifying multiple outliers in multivariate data’,Journal of Royal Statistical Society, Ser. B,54, 761–771.
Hadi, A.S. and Simonoff, J.S. (1993), ‘Procedures for the identification of multiple outliers in linear Models’,Journal of American Statistical Association,88, 1264–1272.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986),Robust Statistics, New York, John Wiley.
Hawkins, D.M. (1980),The identification of outliers, London, Chapman and Hall.
Hawkins, D.M.(1993), ‘A feasible solution algorithm for the minimum volume ellipsoid estimators’,Computational Statistics,9, 95–107.
Hawkins, D.M. (1994), ‘The feasible solution algorithm for the minimum covariance determinant estimator in multivariate data,Computational Statistics and Data Analysis,17, 197–210.
Hawkins, D.M., Bradu, D. and Kass, G.V. (1984), ‘Location of several outliers in multiple regression using elemental subsets,Technometrics,26, 197–208.
Huber, P.J. (1981),Robust Statistics, New York, John Wiley.
Kaski, S. (1997),Data exploration using self-organizing maps. Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 82, D.Sc.(Tech) Thesis, Helsinki University of Technology, Finland.
Kohonen, T. (1989),Self Organization and Associative Memory, Third edition, Heidelberg, Springer-Verlag, Berlin.
Kohonen, T. (1990), ‘The Self organizing maps’,Proceeding of the IEEE,78(9), 1464–1480.
Kohonen, T. (1997),Self-Organizing Maps, Second edition, Springer-Verlag, Berlin.
Iivarinen, J., Kohonen, T., Kangas, J., and Kaski, S.(1994), ‘Visualizing the clusters on the self-organizing map’,Proceedings of the Conference on Artificial Intelligence Research in Finland, Eds. C. Carlsson, T. Järvi and T. Reponen, Finnish Artificial Intelligence Society, Helsinki, Finland, 122–126.
Lopuhaä, H.P., and Rousseeuw, P.J. (1989), ‘Breakdown point of affine equivariant estimators of multivariate location and covariance matrices’,The Annals of Statistics,17, 1662–1683.
Maronna, R.A. (1976),‘Robust M-estimators of multivariate location and scatter’,The Annals of Statistics,4, 51–67.
Oja, M., Nikkilä, J., Törönen, P., Castrén, E. and Kaski, S (2002), ‘Learning metrics for visualizing gene functional similarities’,STeP 2002 —Intelligence, The Art of Natural and Artificial (Eds. Pekka Ala-Siuru and Samuel Kaski).The 10th Finnish Artificial Intelligence Conference, Oulu, Finland, 31–40.
Ritter, H. and Schulten, K. (1986), ‘On the stationary state of Kohonen’s self-organizing sensory mapping’,Biological Cybernatics,54, 99–106.
Ritter, H. and Schulten, K. (1989), ‘Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability and dimension selection’,Biological Cybernatics,69, 59–71.
Rocke, D.M. (1996), ‘Robustness properties of S-estimators of multivariate location and shape in high dimension’,The Annals of Statistics,24, 1327–1345.
Rocke, D.M. and Woodruff, D.L. (1996),‘Identification of outliers in multivariate data’,Journal of American Statistical Association,91, 1047–1061.
Rousseeuw, P.J. (1985),Multivariate estimation with high breakdown point, Mathematical Statistics and Applications, Volume B, eds. W. Grossman, G. Pflug, I. Vincze and W. Werz, Dordrecht: Reidel.
Rousseeuw, P.J. and Leroy, A.M. (1987),Robust regression and outlier detection, New York, John Wiley.
Rousseeuw, P.J. and van Zomeren, B.C. (1990), ‘Unmasking multivariate outliers and leverage points (with discussion)’,Journal of American Statistical Association,85, 633–651.
Ruppert, D. (1992), ‘Computing S-estimators for regression and multivariate location/dispersion’,Journal of Computational and Graphical Statistics,1, 253–270.
Tyler, D.E. (1988), ‘Some results on the existence, uniqueness and computation of the M-estimates of multivariate location and scatter’,SI AM Journal on Scientific and Statistical Computing,9, 354–362.
Tyler, D.E. (1991),‘Some issues in the robust estimation of multivariate location and scatter’,Directions in Robust Statistics and Diagnostics Part II, eds. W. Stahel and S. Weisberg, New York: Springer-Verlag.
Ultsch, A. (1992), ‘Self-organizing neural networks for visualisation and classification’,Proc. Conf. Soc. for Information and Classification, Dortmund, April 1992.
Ultsch, A. (1993), ‘Self-organizing neural networks for visualization and classification’,Information and Classification, eds. O. Opitz, B. Lausen and R. Klar, Springer-Verlag, Berlin, 307–313.
Woodruff, D.L. and Rocke, D.M. (1993), ‘Heuristic search algorithms for the minimum volume ellipsoid’,Journal of Computational and Graphical Statistics,2, 69–95.
Woodruff, D.L. and Rocke, D.M. (1994), ‘Computable robust estimation of multivariate location and shape in high dimension using compound estimators’,Journal of American Statistical Association,89, 888–896.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nag, A.K., Mitra, A. & Mitra, S. Multiple outlier detection in multivariate data using self-organizing maps title. Computational Statistics 20, 245–264 (2005). https://doi.org/10.1007/BF02789702
Published:
Issue Date:
DOI: https://doi.org/10.1007/BF02789702