Skip to main content
Log in

Record-level peculiarity-based data analysis and classifications

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Peculiarity-oriented mining is a data mining method consisting of peculiar data identification and peculiar data analysis. Peculiarity factor and local peculiarity factor are important concepts employed to describe the peculiarity of a data point in the identification step. One can study the notions at both attribute and record levels. In this paper, a new record LPF called distance-based record LPF (D-record LPF) is proposed, which is defined as the sum of distances between a point and its nearest neighbors. The authors prove that D-record LPF can characterize the probability density of a continuous m-dimensional distribution accurately. This provides a theoretical basis for some existing distance-based anomaly detection techniques. More importantly, it also provides an effective method for describing the class-conditional probabilities in a Bayesian classifier. The result enables us to apply D-record LPF to solve classification problems. A novel algorithm called LPF-Bayes classifier and its kernelized implementation are proposed, which have some connection to the Bayesian classifier. Experimental results on several benchmark datasets demonstrate that the proposed classifiers are competitive to some excellent classifiers such as AdaBoost, support vector machines and kernel Fisher discriminant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abe N, Zadrozny B (2006) Outlier detection by active learning. In: Proceeding of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 504–509

  2. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceeding of the 6th European conference on principles of data mining and knowledge discovery, pp 15–26

  3. Aouad LM, Le-Khac N-A, Kechadi TM (2010) Performance study of distributed apriori-like frequent itemsets mining. Knowl Inf Syst 23(1): 55–72

    Article  Google Scholar 

  4. Bhamidipati NL, Pal SK (2006) Comparing rank-inducing scoring systems. In: Proceeding of the 18th international conference on pattern recognition, pp 300–303

  5. Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, Oxford

    Google Scholar 

  6. Blumenstock A, Schweiggert F, Müller M, Lanquillon M (2009) Rule cubes for causal investigations. Knowl Inf Syst 18(1): 109–132

    Article  Google Scholar 

  7. Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89

    Article  Google Scholar 

  8. Breunig MM, Kriegel HP Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceeding of the 6th ACM SIGMOD international conference on management of data, pp 93–104

  9. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 14(13): 1501–1558

    Google Scholar 

  10. Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  11. Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Applications of data mining in computer security

  12. Farago A, Linder T, Lugosi G (1993) Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans Pattern Anal Mach Intell 15: 957–962

    Article  Google Scholar 

  13. He QP, Wang J (2007) Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes. IEEE Trans Semicond Manuf 24: 345–354

    Article  MathSciNet  Google Scholar 

  14. He ZY, Xu XF, Huang ZX, Deng SC (2004) A frequent pattern discovery method for outlier detection. In: Proceedings of the 5th international conference on web-age information management, LNCS 3129, pp 726–732

  15. Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1): 119–139

    Article  MATH  MathSciNet  Google Scholar 

  16. Hald A (1999) On the history of maximum likelihood in relation to inverse probability and least squares. Stat Sci 14(2): 214–222

    Article  MATH  MathSciNet  Google Scholar 

  17. Karmarkar N (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4: 373–395

    Article  MATH  MathSciNet  Google Scholar 

  18. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, pp 157–166

  19. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 169–178

  20. Mcgarry K (2005) A survey of interestingness measures for knowledge discovery. The Knowl Eng Rev 20: 39–61

    Article  Google Scholar 

  21. Mika S, Rätsch G, Weston J, Schölkopf B, Müller KR (1999) Fisher discriminant analysis with kernels. In: Neural networks for signal processing IX, pp 41–48

  22. Ohshima M, Zhong N, Yao YY, Liu C (2007) Relational peculiarity oriented mining. Data Min and Knowl Discov 15: 249–273

    Article  MathSciNet  Google Scholar 

  23. Ohshima M, Zhong N, Yao YY, Murata S (2004) Peculiarity oriented analysis in multi-people tracking images. In: Advances in knowledge discovery and data mining, pp 508–518

  24. Ramaswamy S, Rastogi R, Kyuseok S (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 6th ACM SIGMOD international conference on management of data, pp 427–438

  25. Rätsch G (2001) Robust boosting via convex optimization. PhD thesis, University of Potsdam

  26. Rätsch G, Onoda T, Müller KR (2001) Soft margins for adaboost. Mach Learn 42: 283–320

    Article  Google Scholar 

  27. Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10: 1299–1319

    Article  Google Scholar 

  28. Shen B, Yao M, Wu ZH, Gao YJ (2010) Mining dynamic association rules with comments. Knowl Inf Syst 23(1): 73–98

    Article  Google Scholar 

  29. Silbverschatz A, Tuzhilin A (1996) What makes patterns interesting in knowledge discovery systems. IEEE Trans Know Data Eng 8(6): 970–974

    Article  Google Scholar 

  30. Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  31. Yao YY, Zhong N (2002) An analysis of peculiarity oriented data mining. In: Proceedings of the 2002 IEEE international conference on data mining workshop on the foundation of data mining and Discovery, pp 185–188

  32. Yang J, Zhong N, Yao YY, Wang J (2008) Local peculiarity factor and its application in outlier detection. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 776–784

  33. Yang J, Zhong N, Yao YY, Wang J (2009) Peculiarity analysis for classifications. In: Proceedings of the 2009 IEEE international conference on data mining, pp 607–616

  34. Zhang Y (1998) Solving large-scale linear programs by interior-point methods under the matlab environment. Optim Methods Softw 10: 1–31

    Article  MATH  MathSciNet  Google Scholar 

  35. Zhang B, Srihari SN (2004) Fast k-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Anal Mach Intell 26(4): 525–528

    Article  Google Scholar 

  36. Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data the new task algorithms and performance. Knowl Inf Syst 10: 333–355

    Article  Google Scholar 

  37. Zhong N, Liu C, Yao YY, Ohshima M, Huang MX, Huang JJ (2004) Relational peculiarity oriented data mining. In: Proceedings of the 2004 IEEE international conference on data mining, pp 575–578

  38. Zhong N, Yao YY, Ohshima M (2003) Peculiarity oriented multi-database mining. IEEE Trans Knowl Data Eng 15: 952–960

    Article  Google Scholar 

  39. Zhong N, Ohshima M, Ohsuga S (2001a) Peculiarity oriented mining and its application for knowledge discovery in amino-acid data. In: advances in knowledge discovery and data mining, pp 260–269

  40. Zhong N, Yao YY, Ohshima M, Ohsuga S (2001b) Interestingness, peculiarity, and multi-database mining. In: Proceedings of the 2001 IEEE international conference on data mining, pp 566–573

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Yang.

Additional information

This paper extends and improves our previous work published in the Proceedings of the 9th IEEE International Conference on Data Mining [33].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Zhong, N., Yao, Y. et al. Record-level peculiarity-based data analysis and classifications. Knowl Inf Syst 28, 149–173 (2011). https://doi.org/10.1007/s10115-010-0315-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0315-y

Keywords

Navigation