Skip to main content
Log in

Class separation through variance: a new application of outlier detection

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper introduces a new outlier detection approach and discusses and extends a new concept, class separation through variance. We show that even for balanced and concentric classes differing only in variance, accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which the classes naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach. Unlike typical outlier detection algorithms, this method can be applied beyond the ‘rare classes’ case with great success. The new algorithm FASTOUT introduces a number of novel features. It employs sampling of subspaces points and is highly efficient. It handles arbitrarily sized subspaces and converges to an optimal subspace size through the use of an objective function. In addition, two approaches are presented for automatically deriving the class of the data points from the ranking. Experiments show that FASTOUT typically outperforms other state-of-the-art outlier detection methods on high-dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance, and competes even with the leading supervised classification methods for separating classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. ACM (1999) ACM KDD cup ’99 results. http://www.sigkdd.org/kddcup/

  2. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 37–46

  3. Aggarwal CC, Yu PS (2005) An efficient and effective algorithm for high-dimensional outlier detection. VLDB J 14(2): 211–221

    Article  Google Scholar 

  4. Ali ML, Rueda L, Herrera M (2006) On the performance of chernoff-distance-based linear dimensionality reduction techniques. Adv Artif Intell 4013: 467–478

    Article  MathSciNet  Google Scholar 

  5. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time randomization and a simple pruning rule. In: Proceedings of SIGKDD

  6. Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

    MATH  Google Scholar 

  7. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful. In: International conference on database theory, pp 217–235

  8. Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/

  9. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proceedings of SIGMOD conference, pp 93–104

  10. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3)

  11. Donoho D (2000) High-dimensional data analysis: the curses and blessings of dimensionality. In: Proceedings of the American Mathematical Society conference on mathematical challenges of the 21st century

  12. Dubhashi DP, Panconesi A (2009) Concentration of measure for the analysis of randomised algorithms. Cambridge University Press, Cambridge

    Book  Google Scholar 

  13. Eggermont J, Kok JN, Kosters WA (2004) Genetic programming for data classification: partitioning the search space. In: Proceedings of the 2004 symposium on applied computing (ACM SAC64). ACM, pp 1001–1005

  14. Fan H, Zaiane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51

    Article  Google Scholar 

  15. Foss A, Zaïane OR (2002) A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proceedings of the IEEE international conference on data mining (ICDM’02), pp 179–186

  16. Foss A, Zaïane OR, Zilles S (2009) Unsupervised class separation of multivariate data through cumulative variance-based ranking. In: Proceedings of the IEEE international conference on data mining (ICDM’09), pp 139–148

  17. Harmeling S, Dornhege G, Tax D, Meinecke F, Müller K-R (2006) From outliers to prototypes: ordering data. Neurocomputing 69: 1608–1618

    Article  Google Scholar 

  18. He Z, Xu X, Deng S (2005) A unified subspace outlier ensemble framework for outlier detection in high dimensional spaces. In: Proceedings of the 6th international conference, WAIM 2005, pp 632–637

  19. Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2009) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst (online first)

  20. Jiang X, Zhu X (2009) vEye: behavioral footprinting for self-propagating worm detection and profiling. Knowl Inf Syst 18(2): 231–262

    Article  Google Scholar 

  21. Jiang Y, Zhou Z-H (2004) Editing training data for kNN classifiers with neural network ensemble. In: Lecture Notes in Computer Science, vol 3173

  22. Kim H, Park SH (2004) Data Reduction in support vector machines by a kernelized ionic interaction model. In: Proceedings of SDM

  23. Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of VLDB conference, pp 211–222

  24. Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela MR, Goodenday LS (2001) Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif Intell Med 23: 149

    Article  Google Scholar 

  25. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of ACM SIGKDD, pp 157–166

  26. Ledoux M (2001) The concentration of measure phenomenon, vol 89 of Mathematical Surveys and Monographs. AMS

  27. Li X, Ye N (2005) A supervised clustering algorithm for computer intrusion detection. Knowl Inf Syst 8(4): 498–509

    Article  Google Scholar 

  28. Milman V (1988) The heritage of P levy in geometrical functional-analysis. Asterisque 157: 273–301

    MathSciNet  Google Scholar 

  29. Petrovskiy MI (2003) Outlier detection algorithms in data mining systems. Program Comput Softw 29(4): 228–237

    Article  Google Scholar 

  30. Petrushin, VA, Khan, L (eds) (2007) Multimedia data mining and knowledge discovery. Springer, Berlin

    MATH  Google Scholar 

  31. Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3): 212–223

    Article  MathSciNet  Google Scholar 

  32. Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953

    Article  Google Scholar 

  33. Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of KDD 02

  34. Zafra A, Ventura S (2007) Multi-objective genetic programming for multiple instance learning. In: Lecture Notes in Computer Science, Machine Learning: ECML’07

  35. Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inf Syst 10(3): 333–355

    Article  Google Scholar 

  36. Zhou Z-H, Li M (2009) Semi-supervised learning by disagreement. Knowl Inf Syst (online first)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Osmar R. Zaïane.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Foss, A., Zaïane, O.R. Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29, 565–596 (2011). https://doi.org/10.1007/s10115-010-0347-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0347-3

Keywords

Navigation