An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data

Zheng, Yifeng; Li, Guohe; Li, Ying; Zhang, Wenjie; Pan, Xueling; Lin, Yaojin

doi:10.1007/s10489-021-02685-9

An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data

Published: 30 July 2021

Volume 52, pages 4909–4926, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yifeng Zheng ORCID: orcid.org/0000-0001-9884-2481^1,2,3,4,
Guohe Li^1,2,
Ying Li^1,2,
Wenjie Zhang^3,4,
Xueling Pan^1,2 &
…
Yaojin Lin^3,4

301 Accesses
Explore all metrics

Abstract

With the development of intelligent technology, data obtained from practical applications may be subject to noise information (outlier data or redundant data). Noise data usually leads to the deterioration of the performance and robustness of classifiers. In order to address the above problem, in this paper, we propose an optimization method for Outlier samples and Redundant samples Detection (ORD). Firstly, we leverage the maximum information compression to eliminate irrelevant feature information. Secondly, an outlier optimization filter is proposed, called WSCiForest, which utilizes the fusion strategy based on the entropy-weighted and group optimization theory to calculate the distribution estimated score for each sample. Eventually, ORD adopts the improved Hausdorff distance to obtain redundant samples effectively. The experimental results show that the proposed method can effectively optimize the data space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

Hybrid approaches to optimization and machine learning methods: a systematic literature review

Article Open access 24 January 2024

References

Paula EL, Ladeira M, Carvalho RN, Marzag¨¢o T (2016) Deep learning anomaly detection as support fraud investigation in Brazilian exports andanti-money laundering. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 954–960
Porwal U, Mukund S Credit card fraud detection in e-commerce:An outlier detection approach, 2018[Online]
Alrawashdeh K, Purdy C (2016) Toward an online anomaly intrusion detection system based on deep learning. In: Proc. 15th IEEE Int. Conf. Mach. Learn. Appl, pp 195–200
Gebremeskel G, Yi C, He Z, Haile D (2016) Combined data mining techniques based patient data outlier detection for healthcare safety. Int J Intell Comput Cybern 9(1):42–68
Article Google Scholar
Ayadi A, Ghorbel O, Obeid AM, Abid M (2017) Outlier detection approaches for wireless sensor networks: A survey. Comput Netw 129:319–333
Article Google Scholar
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210
Article Google Scholar
S¨¢ez JA, Galar M, Luengo J, Herrera F (2014) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inform Syst 38(1):179–206
Article Google Scholar
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufman Publishers, San Francisco, CA USA
Google Scholar
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufman Publishers, pp 115–123
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Article Google Scholar
Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151
Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proc. SIAM Int Conf on Mining (SDM), pp 145–154
Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: Proc. 5th Int. Conf. Mach. Learn. Data Mining Pattern Recognit., pp 61–75
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336
Article Google Scholar
Bhattacharya G, Ghosh K, Chowdhury AS (2015) Outlier detection using neighborhood rank difference. Pattern Recognit Lett 60:24–31
Article Google Scholar
Ren D, Rahal I, Perrizo W, Scott K (2004) A vertical distance-based outlier detection method with local pruning. In: Proc. 13th ACM CIKM Int. Conf. Inf. Knowl. Manage., pp 279– -284
Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
Article Google Scholar
Cao K, Shi L, Wang G, Han D, Bai M (2014) Density-based local outlier detection on uncertain data. In: Proc Web-Age Information Management
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 157–166
Zimek A, Gaudet M, Campello RJ, Sander J (2013) Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp 428–436
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: Proc. Neural Information Processing Systems
Hu Q, Pedrycz W, Yu D, Lang J (2010) Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans Syst Man Cybern Part B Cybern 40(1):137–150
Article Google Scholar
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24:301–312
Article Google Scholar
Tabakhi S, Moradi P (2015) Relevance¨Credundancy feature selection based on ant colony optimization. Pattern Recogn 48(9):2798–2811
Article Google Scholar
Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intel 32:112–123
Article Google Scholar
Tabakhi S, Najafi A, Ranjbar R, Moradi P (2015) Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168:1024–1036
Article Google Scholar
Kriegel HP, Krger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. Pacic-Asia Conf. Knowl. Discovery Data Mining. Springer, Berlin Germany, pp 831–838
Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using SCiForest, machine learning and knowledge discovery in databases. Springer, Berlin Heidelberg
Google Scholar
Hwang CL, Yoon KP (1981) Multiple attributes decision making: method sand applications. Springer-Verlag, NewYork
Book Google Scholar
Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Statist Phys 52 (1-2):479–487
Article MathSciNet Google Scholar
Alcal¨¢-Fdez J, Fernandez A, Luengo J, Derrac J, Garc¨^aa S, S¨¢nchez L, Herrera F (2011) KEEL Data-Mining Software Tool: data set repository, integration of algorithms and experimental analysis framework. J Mult-Valued Logic Soft Comput 17:255–287
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Statist Assoc 32(200):675–701
Article Google Scholar
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Statist 11(1):86–92
Article MathSciNet Google Scholar
Doksum K (1967) Robust procedures for some linear models with one observation per cell. Ann Math Statist 38(3):878– 883
Article MathSciNet Google Scholar
Garc¨^aa S, Fern¨¢ndez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Zhaleh M, Fardin A, Chiman S (2019) Hybrid fast unsupervised feature selection for high-dimensional data. Expert Syst Appl 124:97–118
Article Google Scholar
Teng CM (1999) Correcting noisy data. In: Proc. the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 239–248
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210
Article Google Scholar
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421
Article MathSciNet Google Scholar
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybernet 6(6):448–452
MathSciNet MATH Google Scholar
Devijver P (1986) On the editing rate of the MULTIEDIT algorithm. Pattern Recogn Lett 4 (1):9–12
Article Google Scholar
S¨¢nchez J, Barandela R, M¨¢rques A, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recogn Lett 24:1015–1022
Article Google Scholar
Li Z, Zhao Y, Botta N, Ionescu C, Hu XY (2020) COPOD: copula-based outlier detection. In: Proc of IEEE International Conference on Data Mining, pp 1–6
Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proc. of the sixteenth international conference on machine learning. Morgan Kaufman Publishers, pp 143–151
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–157
Article Google Scholar
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396
Article Google Scholar
S¨¢ez JA, Galar M, Luengo J, Herrera F (2016) INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inf Fusion 27:19–32
Article Google Scholar
Luengo J, Shim S, Alshomrani S, Altalhi A (2018) CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring. Knowl-Based Syst 140:27–49
Article Google Scholar
Zhao Y, Nasrullah Z, Hryniewicki M, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In: Proc. of the 2019 SIAM international conference on data mining, pp 585–C593
Dunn O (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Article MathSciNet Google Scholar

Download references

Acknowledgements

We are greatly indebted to colleagues at Data and Knowledge Engineering Center, School of Information Technology and Electrical Engineering, the University of Queensland, Australia. We thank Prof. Xiaofang Zhou, Pro. Xue Li, Dr. Shuo Shang and Dr. Kai Zheng for their special suggestions and many interesting discussions.

Funding

This work is partly supported by the Nature Science Foundation of China under Grant (Nos. 60473125 and 61701213), Science Foundation of China University of Petroleum-Beijing At Karamay under Grant (Nos. RCYJ2016B-03-001), Kalamay Science & Technology Research Project (Nos.2020CGZH0009), the Natural Science Foundation of Fujian Province(Nos. 2018J01546 and 2019J01748), and the Research Fund for Educational Department of Fujian Province (Nos. JAT190392).

Author information

Authors and Affiliations

College of Information Science and Engineering, China University of Petroleum-Beijing, 102249, Beijing, China
Yifeng Zheng, Guohe Li, Ying Li & Xueling Pan
Beijing Key Lab of Data Mining for Petroleum Data, China University of Petroleum-Beijing, 102249, Beijing, China
Yifeng Zheng, Guohe Li, Ying Li & Xueling Pan
School of Computer Science, Minnan Normal University, 363000, ZhangZhou, China
Yifeng Zheng, Wenjie Zhang & Yaojin Lin
Key Laboratory of Data Science and Intelligence Application, Fujian Province University, 363000, Zhangzhou, China
Yifeng Zheng, Wenjie Zhang & Yaojin Lin

Authors

Yifeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Guohe Li
View author publications
You can also search for this author in PubMed Google Scholar
Ying Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xueling Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yaojin Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifeng Zheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Y., Li, G., Li, Y. et al. An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data. Appl Intell 52, 4909–4926 (2022). https://doi.org/10.1007/s10489-021-02685-9

Download citation

Accepted: 13 July 2021
Published: 30 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02685-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

Hybrid approaches to optimization and machine learning methods: a systematic literature review

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

Hybrid approaches to optimization and machine learning methods: a systematic literature review

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation