Skip to main content
Log in

An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

With the development of intelligent technology, data obtained from practical applications may be subject to noise information (outlier data or redundant data). Noise data usually leads to the deterioration of the performance and robustness of classifiers. In order to address the above problem, in this paper, we propose an optimization method for Outlier samples and Redundant samples Detection (ORD). Firstly, we leverage the maximum information compression to eliminate irrelevant feature information. Secondly, an outlier optimization filter is proposed, called WSCiForest, which utilizes the fusion strategy based on the entropy-weighted and group optimization theory to calculate the distribution estimated score for each sample. Eventually, ORD adopts the improved Hausdorff distance to obtain redundant samples effectively. The experimental results show that the proposed method can effectively optimize the data space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Paula EL, Ladeira M, Carvalho RN, Marzag¨¢o T (2016) Deep learning anomaly detection as support fraud investigation in Brazilian exports andanti-money laundering. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 954–960

  2. Porwal U, Mukund S Credit card fraud detection in e-commerce:An outlier detection approach, 2018[Online]

  3. Alrawashdeh K, Purdy C (2016) Toward an online anomaly intrusion detection system based on deep learning. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 195–200

  4. Gebremeskel G, Yi C, He Z, Haile D (2016) Combined data mining techniques based patient data outlier detection for healthcare safety. Int J Intell Comput Cybern 9(1):42–68

    Article  Google Scholar 

  5. Ayadi A, Ghorbel O, Obeid AM, Abid M (2017) Outlier detection approaches for wireless sensor networks: A survey. Comput Netw 129:319–333

    Article  Google Scholar 

  6. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210

    Article  Google Scholar 

  7. S¨¢ez JA, Galar M, Luengo J, Herrera F (2014) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inform Syst 38(1):179–206

    Article  Google Scholar 

  8. Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufman Publishers, San Francisco, CA USA

    Google Scholar 

  9. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufman Publishers, pp 115–123

  10. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    Article  Google Scholar 

  11. Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151

  12. Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proc. SIAM Int Conf on Mining (SDM), pp 145–154

  13. Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: Proc. 5th Int. Conf. Mach. Learn. Data Mining Pattern Recognit., pp 61–75

  14. Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336

    Article  Google Scholar 

  15. Bhattacharya G, Ghosh K, Chowdhury AS (2015) Outlier detection using neighborhood rank difference. Pattern Recognit Lett 60:24–31

    Article  Google Scholar 

  16. Ren D, Rahal I, Perrizo W, Scott K (2004) A vertical distance-based outlier detection method with local pruning. In: Proc. 13th ACM CIKM Int. Conf. Inf. Knowl. Manage., pp 279– -284

  17. Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180

    Article  Google Scholar 

  18. Cao K, Shi L, Wang G, Han D, Bai M (2014) Density-based local outlier detection on uncertain data. In: Proc Web-Age Information Management

  19. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 157–166

  20. Zimek A, Gaudet M, Campello RJ, Sander J (2013) Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 428–436

  21. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: Proc. Neural Information Processing Systems

  22. Hu Q, Pedrycz W, Yu D, Lang J (2010) Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans Syst Man Cybern Part B Cybern 40(1):137–150

    Article  Google Scholar 

  23. Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24:301–312

    Article  Google Scholar 

  24. Tabakhi S, Moradi P (2015) Relevance¨Credundancy feature selection based on ant colony optimization. Pattern Recogn 48(9):2798–2811

    Article  Google Scholar 

  25. Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intel 32:112–123

    Article  Google Scholar 

  26. Tabakhi S, Najafi A, Ranjbar R, Moradi P (2015) Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168:1024–1036

    Article  Google Scholar 

  27. Kriegel HP, Krger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. Pacic-Asia Conf. Knowl. Discovery Data Mining. Springer, Berlin Germany, pp 831–838

  28. Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using SCiForest, machine learning and knowledge discovery in databases. Springer, Berlin Heidelberg

    Google Scholar 

  29. Hwang CL, Yoon KP (1981) Multiple attributes decision making: method sand applications. Springer-Verlag, NewYork

    Book  Google Scholar 

  30. Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Statist Phys 52 (1-2):479–487

    Article  MathSciNet  Google Scholar 

  31. Alcal¨¢-Fdez J, Fernandez A, Luengo J, Derrac J, Garc¨aa S, S¨¢nchez L, Herrera F (2011) KEEL Data-Mining Software Tool: data set repository, integration of algorithms and experimental analysis framework. J Mult-Valued Logic Soft Comput 17:255–287

  32. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Statist Assoc 32(200):675–701

    Article  Google Scholar 

  33. Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Statist 11(1):86–92

    Article  MathSciNet  Google Scholar 

  34. Doksum K (1967) Robust procedures for some linear models with one observation per cell. Ann Math Statist 38(3):878– 883

    Article  MathSciNet  Google Scholar 

  35. Garc¨aa S, Fern¨¢ndez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064

  36. Zhaleh M, Fardin A, Chiman S (2019) Hybrid fast unsupervised feature selection for high-dimensional data. Expert Syst Appl 124:97–118

    Article  Google Scholar 

  37. Teng CM (1999) Correcting noisy data. In: Proc. the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 239–248

  38. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210

    Article  Google Scholar 

  39. Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421

    Article  MathSciNet  Google Scholar 

  40. Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybernet 6(6):448–452

    MathSciNet  MATH  Google Scholar 

  41. Devijver P (1986) On the editing rate of the MULTIEDIT algorithm. Pattern Recogn Lett 4 (1):9–12

    Article  Google Scholar 

  42. S¨¢nchez J, Barandela R, M¨¢rques A, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recogn Lett 24:1015–1022

    Article  Google Scholar 

  43. Li Z, Zhao Y, Botta N, Ionescu C, Hu XY (2020) COPOD: copula-based outlier detection. In: Proc of IEEE International Conference on Data Mining, pp 1–6

  44. Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151

  45. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–157

    Article  Google Scholar 

  46. Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396

    Article  Google Scholar 

  47. S¨¢ez JA, Galar M, Luengo J, Herrera F (2016) INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inf Fusion 27:19–32

    Article  Google Scholar 

  48. Luengo J, Shim S, Alshomrani S, Altalhi A (2018) CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring. Knowl-Based Syst 140:27–49

    Article  Google Scholar 

  49. Zhao Y, Nasrullah Z, Hryniewicki M, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In: Proc. of the 2019 SIAM international conference on data mining, pp 585–C593

  50. Dunn O (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We are greatly indebted to colleagues at Data and Knowledge Engineering Center, School of Information Technology and Electrical Engineering, the University of Queensland, Australia. We thank Prof. Xiaofang Zhou, Pro. Xue Li, Dr. Shuo Shang and Dr. Kai Zheng for their special suggestions and many interesting discussions.

Funding

This work is partly supported by the Nature Science Foundation of China under Grant (Nos. 60473125 and 61701213), Science Foundation of China University of Petroleum-Beijing At Karamay under Grant (Nos. RCYJ2016B-03-001), Kalamay Science & Technology Research Project (Nos.2020CGZH0009), the Natural Science Foundation of Fujian Province(Nos. 2018J01546 and 2019J01748), and the Research Fund for Educational Department of Fujian Province (Nos. JAT190392).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifeng Zheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Li, G., Li, Y. et al. An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data. Appl Intell 52, 4909–4926 (2022). https://doi.org/10.1007/s10489-021-02685-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02685-9

Keywords

Navigation