Class separation through variance: a new application of outlier detection

Foss, Andrew; Zaïane, Osmar R.

doi:10.1007/s10115-010-0347-3

Class separation through variance: a new application of outlier detection

Regular Paper
Published: 10 November 2010

Volume 29, pages 565–596, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Andrew Foss¹ &
Osmar R. Zaïane¹

226 Accesses
7 Citations
Explore all metrics

Abstract

This paper introduces a new outlier detection approach and discusses and extends a new concept, class separation through variance. We show that even for balanced and concentric classes differing only in variance, accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which the classes naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach. Unlike typical outlier detection algorithms, this method can be applied beyond the ‘rare classes’ case with great success. The new algorithm FASTOUT introduces a number of novel features. It employs sampling of subspaces points and is highly efficient. It handles arbitrarily sized subspaces and converges to an optimal subspace size through the use of an objective function. In addition, two approaches are presented for automatically deriving the class of the data points from the ranking. Experiments show that FASTOUT typically outperforms other state-of-the-art outlier detection methods on high-dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance, and competes even with the leading supervised classification methods for separating classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

ACM (1999) ACM KDD cup ’99 results. http://www.sigkdd.org/kddcup/
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 37–46
Aggarwal CC, Yu PS (2005) An efficient and effective algorithm for high-dimensional outlier detection. VLDB J 14(2): 211–221
Article Google Scholar
Ali ML, Rueda L, Herrera M (2006) On the performance of chernoff-distance-based linear dimensionality reduction techniques. Adv Artif Intell 4013: 467–478
Article MathSciNet Google Scholar
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time randomization and a simple pruning rule. In: Proceedings of SIGKDD
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
MATH Google Scholar
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful. In: International conference on database theory, pp 217–235
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proceedings of SIGMOD conference, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3)
Donoho D (2000) High-dimensional data analysis: the curses and blessings of dimensionality. In: Proceedings of the American Mathematical Society conference on mathematical challenges of the 21st century
Dubhashi DP, Panconesi A (2009) Concentration of measure for the analysis of randomised algorithms. Cambridge University Press, Cambridge
Book Google Scholar
Eggermont J, Kok JN, Kosters WA (2004) Genetic programming for data classification: partitioning the search space. In: Proceedings of the 2004 symposium on applied computing (ACM SAC64). ACM, pp 1001–1005
Fan H, Zaiane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51
Article Google Scholar
Foss A, Zaïane OR (2002) A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proceedings of the IEEE international conference on data mining (ICDM’02), pp 179–186
Foss A, Zaïane OR, Zilles S (2009) Unsupervised class separation of multivariate data through cumulative variance-based ranking. In: Proceedings of the IEEE international conference on data mining (ICDM’09), pp 139–148
Harmeling S, Dornhege G, Tax D, Meinecke F, Müller K-R (2006) From outliers to prototypes: ordering data. Neurocomputing 69: 1608–1618
Article Google Scholar
He Z, Xu X, Deng S (2005) A unified subspace outlier ensemble framework for outlier detection in high dimensional spaces. In: Proceedings of the 6th international conference, WAIM 2005, pp 632–637
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2009) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst (online first)
Jiang X, Zhu X (2009) vEye: behavioral footprinting for self-propagating worm detection and profiling. Knowl Inf Syst 18(2): 231–262
Article Google Scholar
Jiang Y, Zhou Z-H (2004) Editing training data for kNN classifiers with neural network ensemble. In: Lecture Notes in Computer Science, vol 3173
Kim H, Park SH (2004) Data Reduction in support vector machines by a kernelized ionic interaction model. In: Proceedings of SDM
Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of VLDB conference, pp 211–222
Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela MR, Goodenday LS (2001) Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif Intell Med 23: 149
Article Google Scholar
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of ACM SIGKDD, pp 157–166
Ledoux M (2001) The concentration of measure phenomenon, vol 89 of Mathematical Surveys and Monographs. AMS
Li X, Ye N (2005) A supervised clustering algorithm for computer intrusion detection. Knowl Inf Syst 8(4): 498–509
Article Google Scholar
Milman V (1988) The heritage of P levy in geometrical functional-analysis. Asterisque 157: 273–301
MathSciNet Google Scholar
Petrovskiy MI (2003) Outlier detection algorithms in data mining systems. Program Comput Softw 29(4): 228–237
Article Google Scholar
Petrushin, VA, Khan, L (eds) (2007) Multimedia data mining and knowledge discovery. Springer, Berlin
MATH Google Scholar
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3): 212–223
Article MathSciNet Google Scholar
Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953
Article Google Scholar
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of KDD 02
Zafra A, Ventura S (2007) Multi-objective genetic programming for multiple instance learning. In: Lecture Notes in Computer Science, Machine Learning: ECML’07
Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inf Syst 10(3): 333–355
Article Google Scholar
Zhou Z-H, Li M (2009) Semi-supervised learning by disagreement. Knowl Inf Syst (online first)

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Andrew Foss & Osmar R. Zaïane

Authors

Andrew Foss
View author publications
You can also search for this author in PubMed Google Scholar
Osmar R. Zaïane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osmar R. Zaïane.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Foss, A., Zaïane, O.R. Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29, 565–596 (2011). https://doi.org/10.1007/s10115-010-0347-3

Download citation

Received: 04 January 2010
Revised: 02 June 2010
Accepted: 27 September 2010
Published: 10 November 2010
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10115-010-0347-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class separation through variance: a new application of outlier detection

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Class separation through variance: a new application of outlier detection

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation