Skip to main content
Log in

SVDD-based outlier detection on uncertain data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Outlier detection is an important problem that has been studied within diverse research areas and application domains. Most existing methods are based on the assumption that an example can be exactly categorized as either a normal class or an outlier. However, in many real-life applications, data are uncertain in nature due to various errors or partial completeness. These data uncertainty make the detection of outliers far more difficult than it is from clearly separable data. The key challenge of handling uncertain data in outlier detection is how to reduce the impact of uncertain data on the learned distinctive classifier. This paper proposes a new SVDD-based approach to detect outliers on uncertain data. The proposed approach operates in two steps. In the first step, a pseudo-training set is generated by assigning a confidence score to each input example, which indicates the likelihood of an example tending normal class. In the second step, the generated confidence score is incorporated into the support vector data description training phase to construct a global distinctive classifier for outlier detection. In this phase, the contribution of the examples with the least confidence score on the construction of the decision boundary has been reduced. The experiments show that the proposed approach outperforms state-of-art outlier detection techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abraham B, Box GEP (1979) Bayesian analysis of some outlier problems in time series. Biometrika 66(2): 229–236

    Article  MathSciNet  MATH  Google Scholar 

  2. Agarwal C (2005) An empirical bayes approach to detect anomalies in dynamic multidimen-sional arrays. In: Proceedings of the 5th IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 26–33

  3. Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1): 29–44

    Article  Google Scholar 

  4. Aggarwal C (2007) On density based transforms for uncertain data mining. In: Proceedings of IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 866–875

  5. Aggarwal C (2009) Managing and mining uncertain data. Springer, Berlin

    Book  MATH  Google Scholar 

  6. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, pp 37–46

  7. Aggarwal C, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of SDM, pp 483–493

  8. Aggarwal C, Yu PS (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5): 609–623

    Article  Google Scholar 

  9. Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalized radial basis function networks for classification and novelty detection: self-organization of optional bayesian decision. Neural Netw 13(10): 1075–1093

    Article  Google Scholar 

  10. Barbara D, Couto J, Jajodia S, Wu N (2001a) Detecting novel network intrusions using bayes estimators. In: Proceedings of the first SIAM international conference on data mining

  11. Barbara D, Couto J, Jajodia S, Wu N (2001b) Adam: a testbed for exploring the use of data mining in intrusion detection. SIGMOD Rec 30(4): 15–24

    Article  Google Scholar 

  12. Bi J, Zhang T (2004) Support vector machines with input data uncertainty. In: Proceedings of advances in neural information processing systems (NIPS)

  13. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(6): 1145–1159

    Article  Google Scholar 

  14. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD), pp 93–104

  15. Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD

  16. Chen D, Shao X, Hu B, Su Q (2005) Simultaneous wavelength selection and outlier detection in multivariate regression of near-infrared spectra. Anal Sci 21(2): 161–167

    Article  MATH  Google Scholar 

  17. Cheng L, Wing HW (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. In: Proceedings of the national academy of sciences, USA (98), pp 31–36

  18. Dalvi N, Suciu D (2004) Efficient query evaluation on probabilistic databases. VLDB J 16(4): 523–544

    Article  Google Scholar 

  19. Denton A (2009) Subspace sums for extracting non-random data from massive noise. Knowl Inf Syst 20(1): 35–62

    Article  Google Scholar 

  20. Eskin E (2008) Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the seventeenth international conference on machine learning, pp 255–262

  21. Fan HQ, Zaiane OR, Foss A (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51

    Article  Google Scholar 

  22. Foss A, Zaiane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3): 565–596

    Article  Google Scholar 

  23. Guo SM, Chen LC, Tsai JSH (2009) A boundary method for outlier detection based on support vector domain description. Pattern Recogn 42(1): 77–83

    Article  MATH  Google Scholar 

  24. Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2): 309–336

    Article  Google Scholar 

  25. Hollier G, Austin J (2002) Novelty detection for strain-gauge degradation using maximally correlated components. In: Proceedings of the European symposium on artificial neural networks, pp 257–262

  26. Huang HP, Liu YH (2002) Fuzzy support vector machine. IEEE Trans Neural Netw 13(2): 464–471

    Article  Google Scholar 

  27. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, New Jersey

    MATH  Google Scholar 

  28. Jiang SY, An QB (2008) Clustering-based outlier detection method. In: Proceedings of the fifth IEEE international conference on fuzzy systems and knowledge discovery, 429C433

  29. King S, King DP, Anuzis KA, Tarassenko L, Hayton P, Utete S (2002) The use of novelty detection techniques for monitoring high-integrity plant. In: Proceedings of the 2002 international conference on control applications (1), pp 221–226

  30. Kapil KG, Baikunth N, Ramamohanarao K (2010) Layered approach using conditional random fields for intrusion detection. IEEE Trans Dependable Secur Comput 7(1): 35–49

    Article  Google Scholar 

  31. Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of 11th ACM SIGKDD international conference knowledge discovery in data mining (KDD)

  32. Lazarevic A, Ertoz L, Ozgur A, Srivastava J, Kumar V (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the third SIAM international conference on data mining (SDM), pp 23–34

  33. Lee KY, Kim DW, Lee KH, Lee D (2007) Density-induced support vector data description. IEEE Trans Neural Netw 18(1): 284–289

    Article  Google Scholar 

  34. Mahoney MV, Chan PK (2003) Learning rules for anomaly detection of hostile net- work trafic. In: Proceedings of the 3rd IEEE international conference on data mining. IEEE Computer Society, pp 601–612

  35. Matsubara Y, Sakurai Y, Yoshikawa M (2011) D-Search: an efficient and exact search algorithm for large distribution sets. Knowl Inf Syst 29(1): 131–157

    Article  Google Scholar 

  36. Murphy PM, Aha DW (2004) UCI repository of machine learning database. http://www.ics.uci.edu/~mlearn/MLRepository.html

  37. Peterson GL, McBride BT (2011) The importance of generalizability for anomaly detection. Knowl Inf Syst 14(3): 377–392

    Article  Google Scholar 

  38. Saitoh S (1998) Theory of reproducing kernels and its applications. Longman Scientific & Technical, Harlow

    Google Scholar 

  39. Solberg HE, Lahti A (2005) Detection of outliers in reference distributions: Performance of Horn’s algorithm. Clin Chem 51(12): 2326–2332

    Article  Google Scholar 

  40. Shi Y, Zhang L (2011) COID: a cluster Coutlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28(3): 709–733

    Article  Google Scholar 

  41. Sun H, Bao Y, Zhao F, Yu G, Wang D (2004) CD-trees: an efficient index structure for outlier detection. In: International conference on web-age information management (WAIM), pp 600–609

  42. Tax DMJ, Ypma A, Duin RPW (1999) Support vector data description applied to machine vibration analysis. In: Proceedings of the fifth annual conference of the advanced school for computing and imaging (ASCI), 398C405

  43. Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Boston

    Google Scholar 

  44. Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66

    Article  MATH  Google Scholar 

  45. Varun C (2008) Real-time credit card fraud detection. Expert Syst Appl 35(4): 1721–1732

    Article  Google Scholar 

  46. Vapnik VN (1998) The nature of statistical learning theory. Springer, London

    Google Scholar 

  47. Varun C, Arindam B, Vipin K (2009) Anomaly detection: a survey. ACM Comput Surv 41(3): 1–58

    Google Scholar 

  48. Van Hulse JD, Khoshgoftaar TM, Huang HY (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190

    Article  Google Scholar 

  49. Victoria JH, Jim A (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85C126

    Google Scholar 

  50. Wang DF, Yeung DS, Tsang ECC (2006) Structured one-class classification. IEEE Trans SMC Part B: Cybern 36(6): 1283–1295

    Article  Google Scholar 

  51. Williams G, Baxter R, He H, Hawkins S, Gu L (2002) A comparative study of RNN for outlier detection in data mining. In: Proceedings of the 2002 IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 709–718

  52. Xiao YS et al (2009) Multi-sphere support vector data description for outliers detection on multi-distribution data. In: 2009 IEEE international conference on data mining workshops, pp 82–87

  53. Yang WS, Wang SY (2008) A process-mining framework for the detection of healthcare fraud and abuse. Expert Syst Appl 31(1): 56–68

    Article  Google Scholar 

  54. Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the 2009 SIAM international conference on data mining (SDM), 145C154

  55. Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of ACM SIGMOD

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhifeng Hao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, B., Xiao, Y., Cao, L. et al. SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34, 597–618 (2013). https://doi.org/10.1007/s10115-012-0484-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0484-y

Keywords

Navigation