Abstract
This paper introduces a new outlier detection approach and discusses and extends a new concept, class separation through variance. We show that even for balanced and concentric classes differing only in variance, accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which the classes naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach. Unlike typical outlier detection algorithms, this method can be applied beyond the ‘rare classes’ case with great success. The new algorithm FASTOUT introduces a number of novel features. It employs sampling of subspaces points and is highly efficient. It handles arbitrarily sized subspaces and converges to an optimal subspace size through the use of an objective function. In addition, two approaches are presented for automatically deriving the class of the data points from the ranking. Experiments show that FASTOUT typically outperforms other state-of-the-art outlier detection methods on high-dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance, and competes even with the leading supervised classification methods for separating classes.
Similar content being viewed by others
References
ACM (1999) ACM KDD cup ’99 results. http://www.sigkdd.org/kddcup/
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 37–46
Aggarwal CC, Yu PS (2005) An efficient and effective algorithm for high-dimensional outlier detection. VLDB J 14(2): 211–221
Ali ML, Rueda L, Herrera M (2006) On the performance of chernoff-distance-based linear dimensionality reduction techniques. Adv Artif Intell 4013: 467–478
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time randomization and a simple pruning rule. In: Proceedings of SIGKDD
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful. In: International conference on database theory, pp 217–235
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proceedings of SIGMOD conference, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3)
Donoho D (2000) High-dimensional data analysis: the curses and blessings of dimensionality. In: Proceedings of the American Mathematical Society conference on mathematical challenges of the 21st century
Dubhashi DP, Panconesi A (2009) Concentration of measure for the analysis of randomised algorithms. Cambridge University Press, Cambridge
Eggermont J, Kok JN, Kosters WA (2004) Genetic programming for data classification: partitioning the search space. In: Proceedings of the 2004 symposium on applied computing (ACM SAC64). ACM, pp 1001–1005
Fan H, Zaiane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51
Foss A, Zaïane OR (2002) A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proceedings of the IEEE international conference on data mining (ICDM’02), pp 179–186
Foss A, Zaïane OR, Zilles S (2009) Unsupervised class separation of multivariate data through cumulative variance-based ranking. In: Proceedings of the IEEE international conference on data mining (ICDM’09), pp 139–148
Harmeling S, Dornhege G, Tax D, Meinecke F, Müller K-R (2006) From outliers to prototypes: ordering data. Neurocomputing 69: 1608–1618
He Z, Xu X, Deng S (2005) A unified subspace outlier ensemble framework for outlier detection in high dimensional spaces. In: Proceedings of the 6th international conference, WAIM 2005, pp 632–637
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2009) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst (online first)
Jiang X, Zhu X (2009) vEye: behavioral footprinting for self-propagating worm detection and profiling. Knowl Inf Syst 18(2): 231–262
Jiang Y, Zhou Z-H (2004) Editing training data for kNN classifiers with neural network ensemble. In: Lecture Notes in Computer Science, vol 3173
Kim H, Park SH (2004) Data Reduction in support vector machines by a kernelized ionic interaction model. In: Proceedings of SDM
Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of VLDB conference, pp 211–222
Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela MR, Goodenday LS (2001) Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif Intell Med 23: 149
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of ACM SIGKDD, pp 157–166
Ledoux M (2001) The concentration of measure phenomenon, vol 89 of Mathematical Surveys and Monographs. AMS
Li X, Ye N (2005) A supervised clustering algorithm for computer intrusion detection. Knowl Inf Syst 8(4): 498–509
Milman V (1988) The heritage of P levy in geometrical functional-analysis. Asterisque 157: 273–301
Petrovskiy MI (2003) Outlier detection algorithms in data mining systems. Program Comput Softw 29(4): 228–237
Petrushin, VA, Khan, L (eds) (2007) Multimedia data mining and knowledge discovery. Springer, Berlin
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3): 212–223
Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of KDD 02
Zafra A, Ventura S (2007) Multi-objective genetic programming for multiple instance learning. In: Lecture Notes in Computer Science, Machine Learning: ECML’07
Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inf Syst 10(3): 333–355
Zhou Z-H, Li M (2009) Semi-supervised learning by disagreement. Knowl Inf Syst (online first)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Foss, A., Zaïane, O.R. Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29, 565–596 (2011). https://doi.org/10.1007/s10115-010-0347-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0347-3