ANCES: A novel method to repair attribute noise in classification problems
Graphical abstract
Introduction
Data gathering, preparation and storage procedures in information systems are usually linked to imperfections in real-world applications [1]. Because of this, datasets usually contain errors or noise [2], [3], particularly medical image data [4], [5], [6], which have a major role in automated diagnosis and treatment planning due to imaging techniques [7], [8], [9]. In classification [10], [11], a model is built from labeled samples in order to predict the class for new previously unobserved samples. Building classifiers from noisy data has several well-known disadvantages that have been studied in the specialized literature [2], [12]. First, the learning phase usually requires more time and samples to create the model. Furthermore, the classifier obtained will be probably less accurate and more complex since errors in the data may be modeled.
Two different types of noise are found in classification datasets [2]: class and attribute noises. Class noise [13], [14] is produced when samples are incorrectly labeled. Many research works focus on its identification and treatment [3], [15]. They usually establish an association between samples with class noise and misclassifications performed by some classification system. In this way, the detection of samples with class noise is relatively simple. Once these samples are identified, the treatment of errors affecting them is also uncomplicated, since their removal usually leads to improvements in classification performance [16]. The techniques responsible for performing such tasks are known as noise filters [17], [18] and belong to one of the most common preprocessing approaches when data are affected by class noise.
On the other hand, attribute noise [19], [20] is related to the presence of errors in attribute values of samples in a dataset. Even though there are works showing the negative impacts of attribute noise when building classifiers, whose performance is often reduced [12], [21], its identification and treatment have been traditionally overlooked in the literature due to its higher complexity. There have been some attempts of translating the principles of class noise treatment with noise filtering to the attribute noise problem [2], [22]. However, they have shown that removing samples containing attribute noise is counterproductive in some cases, as these samples still contain valuable information in other attributes that can help to build a more accurate classifier. For this reason, it is interesting to investigate other alternatives to improve classification performance when data is affected by attribute noise.
This research focuses on the detection and treatment of attribute values containing noise. It is understood that, due to the different nature of attribute noise than that of class noise, its detection must be far from the classic association of noise and mislabeled data. Thus, the proposal of this paper is based on analyzing the neighborhood of each sample to assign an error score to each one of its attribute values. The more different the attribute value of a sample is from those of its nearest neighboring samples, the more likely it is to contain noise. On the other hand, the treatment of such attribute values with noise will not be directly related to the removal of the samples. These values will be passed through an optimization process in which they will be progressively corrected using optimization meta-heuristics [23] with the goal of improving the classification performance. Additionally, the approach presented is also inspired by important postulates in noise preprocessing, such as the iterative detection and treatment of noise in datasets [24], in such a way that the errors corrected in one iteration do not negatively influence in later iterations. The proposed method is called Attribute Noise Corrector based on Error Scores (ANCES).
A thorough experimental study has been developed comparing both the absence of preprocessing and 15 representative noise preprocessing methods with respect to ANCES. All of them have been used to preprocess 30 real-world datasets of a different nature, in which several attribute noise levels (from 5% to 40%, by increments of 5%) are introduced in a supervised manner [2]. The preprocessed datasets have been then employed to create classifiers with a method of a well-known behavior against noise, the Nearest Neighbor (NN) rule [25], which is considered very noise sensitive and allows better checking the effect of each preprocessing technique over the data. Its test performances in the datasets preprocessed with ANCES and the rest of approaches have been compared using the appropriate statistical tests [26] in order to check the significance of the differences found. A webpage with complementary material to this paper is available at https://joseasaezm.github.io/ances/, including the datasets used and the results obtained for each preprocessing method.
Thus, the main contributions of this work are:
- •
The proposal of a new approach to correct noisy attribute values in classification problems, which is lacking in the specialized literature.
- •
The study of the efficacy of attribute noise correction in different scenarios, comparing it against not preprocessing, noise filters with different characteristics and other methods for the treatment of attribute noise.
- •
A comprehensive analysis of the strengths and weaknesses of correcting attribute noise with respect to the elimination of noisy samples considering 30 real-world datasets with different noise levels.
The rest of this research is organized as follows. Section 2 presents an introduction to classification with noisy data and a brief review of noise preprocessing methods. Section 3 details the proposed attribute noise corrector. Section 4 describes the experimental framework, whereas Section 5 analyzes the results obtained compared with other noise preprocessing methods. Then, Section 6 focuses on the sensitivity analysis of parameters in ANCES. Finally, Section 7 concludes this work and offers ideas about future research.
Section snippets
Noisy data in classification problems
This section focuses on the problem of noisy data in classification and the main alternatives to address it. First, Section 2.1 presents the difficulties derived from the presence of errors in the data, along with the principal types of noise found in classification. Then, Section 2.2 briefly reviews previous works on noise preprocessing, putting special emphasis to those approaches employed in the experiments of this research.
ANCES: Attribute noise corrector based on error scores
This section presents the main foundations of the technique proposed in this research: ANCES. As any other preprocessing technique, it receives an input dataset , which is considered for its treatment. is composed by samples (), input attributes () and one output class with different labels (). The value of the attribute in the sample is denoted as . The preprocessing of the dataset gives as a result the output dataset . ANCES is based
Experimental framework
This section presents the details of the experimental study carried out to check the validity of the proposed attribute noise corrector. First, Section 4.1 describes the datasets used. Then, Section 4.2 focuses on the methodology followed to analyze the results.
On the advantages of correcting versus removing attribute noise
This section presents the analysis of the results obtained. First, Section 5.1 analyzes the performance of ANCES and its differences versus not preprocessing. Then, Section 5.2 focuses on the comparison of ANCES with similarity filters, whereas Section 5.3 is devoted to the comparison with ensemble methods. Finally, Section 5.4 analyzes the behavior of ANCES with respect to other attribute noise preprocessing techniques.
Sensitivity analysis of parameters in ANCES
In addition to the comparison with other noise preprocessing techniques (Section 5), this section analyzes the impact in classification performance of changes in two of the main parameters of ANCES: the number of nearest neighbors in the fitness function and the percentage of attribute values corrected in each iteration. In order to study whether the choice of specific values for these parameters significantly affects the results obtained by ANCES, a new experiment is performed based on
Conclusions and future work
In this research, a new attribute noise correction method (ANCES) has been proposed. Its main goal is correcting attribute noise instead of removing noisy samples, as most of the solutions in the literature. The design of ANCES follows an iterative scheme, which involves an attribute noise score to detect errors in attribute values and the usage of an optimization meta-heuristic to correct these values. Moreover, different versions of the original dataset are considered in the noise treatment
José A. Sáez is a PhD in Computer Science and Computing Technology. He is currently working at the University of Granada (Spain). His main research interests are related to data mining, data preprocessing and data transformation tasks in Knowledge Discovery in Databases, including noisy data in classification, discretization methods, imbalanced learning, performance evaluation methods, nonparametrical statistical tests, unsupervised learning and others. His main research line is that of noisy
References (54)
- et al.
Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features
Pattern Recognit
(2018) - et al.
One-vs-one classification for deep neural networks
Pattern Recognit
(2020) A generalised label noise model for classification in the presence of annotation errors
Neurocomputing
(2016)- et al.
Classification algorithm sensitivity to training data with non representative attribute noise
Decis Support Syst
(2009) - et al.
Tackling the problem of classification with noisy data using multiple classifier systems: analysis of the performance and robustness
Inf Sci (Ny)
(2013) - et al.
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification
Pattern Recognit
(2013) - et al.
A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms
Swarm Evol Comput
(2011) - et al.
Classification with class noises through probabilistic sampling
Information Fusion
(2018) - et al.
Radial-based oversampling for noisy imbalanced data classification
Neurocomputing
(2019) - et al.
The impact of class imbalance in classification performance metrics based on the binary confusion matrix
Pattern Recognit
(2019)
Radial-based undersampling for imbalanced data classification
Pattern Recognit
SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering
Inf Sci (Ny)
A label noise tolerant random forest for the classification of remote sensing data based on outdated maps for training
Comput. Vision Image Understanding
Ensemble selection based on classifier prediction confidence
Pattern Recognit
Fault recognition using an ensemble classifier based on dempster-Shafer theory
Pattern Recognit
On the relation of performance to editing in nearest neighbor rules
Pattern Recognit
A trace lasso regularized robust nonparallel proximal support vector machine for noisy classification
IEEE Access
Class noise vs. attribute noise: A Quantitative study
Artif Intell Rev
Classification in the presence of label noise: asurvey
IEEE Trans Neural Netw Learn Syst
A comparative evaluation for liver segmentation from SPIR images and a novel level set method using Signed Pressure Force function
Automated fluorescent miscroscopic image analysis of PTBP1 expression in glioma
PLoS ONE
Automatic kidney segmentation using Gaussian mixture model on MRI sequences
Fully automated and adaptive intensity normalization using statistical features for brain MR images
Celal Bayar University Journal of Science
A method for liver segmentation in perfusion MR images using probabilistic atlases and viscous reconstruction
Pattern Analysis and Applications
Automatic labeling of portal and hepatic veins from MR images prior to liver transplantation
Int J Comput Assist Radiol Surg
Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition
Knowl Inf Syst
Classification with noisy labels by importance reweighting
IEEE Trans Pattern Anal Mach Intell
Cited by (14)
Tackling the problem of noisy IoT sensor data in smart agriculture: Regression noise filters for enhanced evapotranspiration prediction
2024, Expert Systems with ApplicationsAn optimization for adaptive multi-filter estimation in medical images and EEG based signal denoising
2023, Biomedical Signal Processing and ControlCitation Excerpt :Week signal detection is challenging; hence, other computational improvements are required in signal noise filters. Sáez and Corchado [16] has proposed a denoising method to improve the classification performance since noise tends to degrade the prediction. Regression and classification are some of the most critical tasks in machine learning-based prediction.
A noise-aware fuzzy rough set approach for feature selection
2022, Knowledge-Based SystemsCitation Excerpt :Generally speaking, there are two types of noise samples in the data [19]. One is that the conditional attribute of the sample is anomalous (i.e., attribute noise) [22], and the other is that the decision attribute of the sample is anomalous (i.e., class noise) [23]. Two types of noise have different impacts on the dependence degree and downstream learning task [19,24].
Learning to rectify for robust learning with noisy labels
2022, Pattern RecognitionCitation Excerpt :There are essentially three ways to implement the loss function correction method. ( 1) The basic idea is to correct the noisy label to the true one via a correction function [34–36] or extra inference steps [37–40]. ( 2) In contrast to leveraging hard labels in the learning stage, label smoothing regularizers, such as the confusion matrix [15,41] or soft labels [42], is introduced to avoid the bias of learning on noisy data. (
The rank of contextuality
2023, New Journal of Physics
José A. Sáez is a PhD in Computer Science and Computing Technology. He is currently working at the University of Granada (Spain). His main research interests are related to data mining, data preprocessing and data transformation tasks in Knowledge Discovery in Databases, including noisy data in classification, discretization methods, imbalanced learning, performance evaluation methods, nonparametrical statistical tests, unsupervised learning and others. His main research line is that of noisy data in classification tasks, counting with more than 20 publications on this topic.
Emilio Corchado is a full Professor at the University of Salamanca (Spain) in Computer and Automatic Science. He received his PhD in Computer Science from University of Salamanca. His research interests include neural networks, with a particular focus on exploratory projection pursuit, maximum likelihood hebbian learning, self-organising maps, multiple classifier systems and Hybrid Systems. He has published over 100 peer-reviewed articles in a range of topics from knowledge management and risk analysis, intrusion detection systems, food industry, artificial vision, and modelling of industrial processes. He has been organizing chair, program committee chair, session chair and general chair for a number of conferences, such as International Conference on Hybrid Artificial Intelligence Systems (HAIS), International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), and International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES). He is a RTD expert hired by international organizations such as the European Commission, the Grant agency of Czech Republic, the Spanish National Agency for Assessment and Forecasting and has collaborated with SMEs and new companies in the innovation field in about 40 projects. He has patented software models and he owns the IP of more than 10 ICT tools and models. He was the chair of the IEEE Spanish Section during the years 2014-15, and has actively contributed to several current projects in the EU, including SOFTCOMP, IT4Innovation, ICT Action COST IC1303 and IntelliCIS NISIS.