Abstract
With the development of intelligent technology, data obtained from practical applications may be subject to noise information (outlier data or redundant data). Noise data usually leads to the deterioration of the performance and robustness of classifiers. In order to address the above problem, in this paper, we propose an optimization method for Outlier samples and Redundant samples Detection (ORD). Firstly, we leverage the maximum information compression to eliminate irrelevant feature information. Secondly, an outlier optimization filter is proposed, called WSCiForest, which utilizes the fusion strategy based on the entropy-weighted and group optimization theory to calculate the distribution estimated score for each sample. Eventually, ORD adopts the improved Hausdorff distance to obtain redundant samples effectively. The experimental results show that the proposed method can effectively optimize the data space.
Similar content being viewed by others
References
Paula EL, Ladeira M, Carvalho RN, Marzag¨¢o T (2016) Deep learning anomaly detection as support fraud investigation in Brazilian exports andanti-money laundering. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 954–960
Porwal U, Mukund S Credit card fraud detection in e-commerce:An outlier detection approach, 2018[Online]
Alrawashdeh K, Purdy C (2016) Toward an online anomaly intrusion detection system based on deep learning. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 195–200
Gebremeskel G, Yi C, He Z, Haile D (2016) Combined data mining techniques based patient data outlier detection for healthcare safety. Int J Intell Comput Cybern 9(1):42–68
Ayadi A, Ghorbel O, Obeid AM, Abid M (2017) Outlier detection approaches for wireless sensor networks: A survey. Comput Netw 129:319–333
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210
S¨¢ez JA, Galar M, Luengo J, Herrera F (2014) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inform Syst 38(1):179–206
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufman Publishers, San Francisco, CA USA
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufman Publishers, pp 115–123
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151
Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proc. SIAM Int Conf on Mining (SDM), pp 145–154
Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: Proc. 5th Int. Conf. Mach. Learn. Data Mining Pattern Recognit., pp 61–75
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336
Bhattacharya G, Ghosh K, Chowdhury AS (2015) Outlier detection using neighborhood rank difference. Pattern Recognit Lett 60:24–31
Ren D, Rahal I, Perrizo W, Scott K (2004) A vertical distance-based outlier detection method with local pruning. In: Proc. 13th ACM CIKM Int. Conf. Inf. Knowl. Manage., pp 279– -284
Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
Cao K, Shi L, Wang G, Han D, Bai M (2014) Density-based local outlier detection on uncertain data. In: Proc Web-Age Information Management
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 157–166
Zimek A, Gaudet M, Campello RJ, Sander J (2013) Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 428–436
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: Proc. Neural Information Processing Systems
Hu Q, Pedrycz W, Yu D, Lang J (2010) Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans Syst Man Cybern Part B Cybern 40(1):137–150
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24:301–312
Tabakhi S, Moradi P (2015) Relevance¨Credundancy feature selection based on ant colony optimization. Pattern Recogn 48(9):2798–2811
Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intel 32:112–123
Tabakhi S, Najafi A, Ranjbar R, Moradi P (2015) Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168:1024–1036
Kriegel HP, Krger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. Pacic-Asia Conf. Knowl. Discovery Data Mining. Springer, Berlin Germany, pp 831–838
Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using SCiForest, machine learning and knowledge discovery in databases. Springer, Berlin Heidelberg
Hwang CL, Yoon KP (1981) Multiple attributes decision making: method sand applications. Springer-Verlag, NewYork
Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Statist Phys 52 (1-2):479–487
Alcal¨¢-Fdez J, Fernandez A, Luengo J, Derrac J, Garc¨aa S, S¨¢nchez L, Herrera F (2011) KEEL Data-Mining Software Tool: data set repository, integration of algorithms and experimental analysis framework. J Mult-Valued Logic Soft Comput 17:255–287
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Statist Assoc 32(200):675–701
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Statist 11(1):86–92
Doksum K (1967) Robust procedures for some linear models with one observation per cell. Ann Math Statist 38(3):878– 883
Garc¨aa S, Fern¨¢ndez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Zhaleh M, Fardin A, Chiman S (2019) Hybrid fast unsupervised feature selection for high-dimensional data. Expert Syst Appl 124:97–118
Teng CM (1999) Correcting noisy data. In: Proc. the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 239–248
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybernet 6(6):448–452
Devijver P (1986) On the editing rate of the MULTIEDIT algorithm. Pattern Recogn Lett 4 (1):9–12
S¨¢nchez J, Barandela R, M¨¢rques A, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recogn Lett 24:1015–1022
Li Z, Zhao Y, Botta N, Ionescu C, Hu XY (2020) COPOD: copula-based outlier detection. In: Proc of IEEE International Conference on Data Mining, pp 1–6
Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–157
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396
S¨¢ez JA, Galar M, Luengo J, Herrera F (2016) INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inf Fusion 27:19–32
Luengo J, Shim S, Alshomrani S, Altalhi A (2018) CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring. Knowl-Based Syst 140:27–49
Zhao Y, Nasrullah Z, Hryniewicki M, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In: Proc. of the 2019 SIAM international conference on data mining, pp 585–C593
Dunn O (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Acknowledgements
We are greatly indebted to colleagues at Data and Knowledge Engineering Center, School of Information Technology and Electrical Engineering, the University of Queensland, Australia. We thank Prof. Xiaofang Zhou, Pro. Xue Li, Dr. Shuo Shang and Dr. Kai Zheng for their special suggestions and many interesting discussions.
Funding
This work is partly supported by the Nature Science Foundation of China under Grant (Nos. 60473125 and 61701213), Science Foundation of China University of Petroleum-Beijing At Karamay under Grant (Nos. RCYJ2016B-03-001), Kalamay Science & Technology Research Project (Nos.2020CGZH0009), the Natural Science Foundation of Fujian Province(Nos. 2018J01546 and 2019J01748), and the Research Fund for Educational Department of Fujian Province (Nos. JAT190392).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zheng, Y., Li, G., Li, Y. et al. An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data. Appl Intell 52, 4909–4926 (2022). https://doi.org/10.1007/s10489-021-02685-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02685-9