ANCES: A novel method to repair attribute noise in classification problems

doi:10.1016/j.patcog.2021.108198

Pattern Recognition

Volume 121, January 2022, 108198

https://doi.org/10.1016/j.patcog.2021.108198 Get rights and content

Highlights

•
The removal of samples is inadvisable dealing with data with attribute noise.
•
Correcting attribute values with errors is the best approach in this scenario.
•
This paper proposes a novel attribute noise correction method, called ANCES.
•
ANCES performs an iterative correction of attribute values using error scores.
•
ANCES outperforms other existing noise filtering techniques in the literature.

Abstract

Noise negatively affects the complexity and performance of models built in classification problems. The most common approach to mitigate its consequences is the usage of preprocessing techniques, known as noise filters, which are designed to remove noisy samples from the training data. Nevertheless, they are specifically oriented to deal with errors affecting class labels. Their employment may not always result in an improvement when noise affects attribute values. In these cases, correcting the errors is an interesting alternative to traditional noise filtering that has not been enough studied so far in the specialized literature. This research proposes an attribute noise correction method with the final aim of increasing the performance of the classification algorithms used later. The identification of noisy data is based on an error score assigned to each one of the attribute values in the dataset, which are then passed through an optimization process to correct their potential noise. The validity of the proposed method is studied in an exhaustive experimental study, in which it is compared to several well-known preprocessing methods to deal with noisy datasets. The results obtained show the suitability of attribute noise correction with respect to the other alternatives when data suffer from attribute noise.

Graphical abstract

Introduction

Data gathering, preparation and storage procedures in information systems are usually linked to imperfections in real-world applications [1]. Because of this, datasets usually contain errors or noise [2], [3], particularly medical image data [4], [5], [6], which have a major role in automated diagnosis and treatment planning due to imaging techniques [7], [8], [9]. In classification [10], [11], a model is built from labeled samples in order to predict the class for new previously unobserved samples. Building classifiers from noisy data has several well-known disadvantages that have been studied in the specialized literature [2], [12]. First, the learning phase usually requires more time and samples to create the model. Furthermore, the classifier obtained will be probably less accurate and more complex since errors in the data may be modeled.

Two different types of noise are found in classification datasets [2]: class and attribute noises. Class noise [13], [14] is produced when samples are incorrectly labeled. Many research works focus on its identification and treatment [3], [15]. They usually establish an association between samples with class noise and misclassifications performed by some classification system. In this way, the detection of samples with class noise is relatively simple. Once these samples are identified, the treatment of errors affecting them is also uncomplicated, since their removal usually leads to improvements in classification performance [16]. The techniques responsible for performing such tasks are known as noise filters [17], [18] and belong to one of the most common preprocessing approaches when data are affected by class noise.

On the other hand, attribute noise [19], [20] is related to the presence of errors in attribute values of samples in a dataset. Even though there are works showing the negative impacts of attribute noise when building classifiers, whose performance is often reduced [12], [21], its identification and treatment have been traditionally overlooked in the literature due to its higher complexity. There have been some attempts of translating the principles of class noise treatment with noise filtering to the attribute noise problem [2], [22]. However, they have shown that removing samples containing attribute noise is counterproductive in some cases, as these samples still contain valuable information in other attributes that can help to build a more accurate classifier. For this reason, it is interesting to investigate other alternatives to improve classification performance when data is affected by attribute noise.

This research focuses on the detection and treatment of attribute values containing noise. It is understood that, due to the different nature of attribute noise than that of class noise, its detection must be far from the classic association of noise and mislabeled data. Thus, the proposal of this paper is based on analyzing the neighborhood of each sample to assign an error score to each one of its attribute values. The more different the attribute value of a sample is from those of its nearest neighboring samples, the more likely it is to contain noise. On the other hand, the treatment of such attribute values with noise will not be directly related to the removal of the samples. These values will be passed through an optimization process in which they will be progressively corrected using optimization meta-heuristics [23] with the goal of improving the classification performance. Additionally, the approach presented is also inspired by important postulates in noise preprocessing, such as the iterative detection and treatment of noise in datasets [24], in such a way that the errors corrected in one iteration do not negatively influence in later iterations. The proposed method is called Attribute Noise Corrector based on Error Scores (ANCES).

A thorough experimental study has been developed comparing both the absence of preprocessing and 15 representative noise preprocessing methods with respect to ANCES. All of them have been used to preprocess 30 real-world datasets of a different nature, in which several attribute noise levels (from 5% to 40%, by increments of 5%) are introduced in a supervised manner [2]. The preprocessed datasets have been then employed to create classifiers with a method of a well-known behavior against noise, the Nearest Neighbor (NN) rule [25], which is considered very noise sensitive and allows better checking the effect of each preprocessing technique over the data. Its test performances in the datasets preprocessed with ANCES and the rest of approaches have been compared using the appropriate statistical tests [26] in order to check the significance of the differences found. A webpage with complementary material to this paper is available at https://joseasaezm.github.io/ances/, including the datasets used and the results obtained for each preprocessing method.

Thus, the main contributions of this work are:

•
The proposal of a new approach to correct noisy attribute values in classification problems, which is lacking in the specialized literature.
•
The study of the efficacy of attribute noise correction in different scenarios, comparing it against not preprocessing, noise filters with different characteristics and other methods for the treatment of attribute noise.
•
A comprehensive analysis of the strengths and weaknesses of correcting attribute noise with respect to the elimination of noisy samples considering 30 real-world datasets with different noise levels.

The rest of this research is organized as follows. Section 2 presents an introduction to classification with noisy data and a brief review of noise preprocessing methods. Section 3 details the proposed attribute noise corrector. Section 4 describes the experimental framework, whereas Section 5 analyzes the results obtained compared with other noise preprocessing methods. Then, Section 6 focuses on the sensitivity analysis of parameters in ANCES. Finally, Section 7 concludes this work and offers ideas about future research.

Section snippets

Noisy data in classification problems

This section focuses on the problem of noisy data in classification and the main alternatives to address it. First, Section 2.1 presents the difficulties derived from the presence of errors in the data, along with the principal types of noise found in classification. Then, Section 2.2 briefly reviews previous works on noise preprocessing, putting special emphasis to those approaches employed in the experiments of this research.

ANCES: Attribute noise corrector based on error scores

This section presents the main foundations of the technique proposed in this research: ANCES. As any other preprocessing technique, it receives an input dataset $D_{i n}$ , which is considered for its treatment. $D_{i n}$ is composed by $n$ samples $δ_{j}$ ( $j = 1, \dots, n$ ), $m$ input attributes $φ_{i}$ ( $i = 1, \dots, m$ ) and one output class $ψ$ with $c$ different labels ( $ψ_{1}, \dots, ψ_{c}$ ). The value of the attribute $φ_{i}$ in the sample $δ_{j}$ is denoted as $φ_{i, j}$ . The preprocessing of the dataset $D_{i n}$ gives as a result the output dataset $D_{o u t}$ . ANCES is based

Experimental framework

This section presents the details of the experimental study carried out to check the validity of the proposed attribute noise corrector. First, Section 4.1 describes the datasets used. Then, Section 4.2 focuses on the methodology followed to analyze the results.

On the advantages of correcting versus removing attribute noise

This section presents the analysis of the results obtained. First, Section 5.1 analyzes the performance of ANCES and its differences versus not preprocessing. Then, Section 5.2 focuses on the comparison of ANCES with similarity filters, whereas Section 5.3 is devoted to the comparison with ensemble methods. Finally, Section 5.4 analyzes the behavior of ANCES with respect to other attribute noise preprocessing techniques.

Sensitivity analysis of parameters in ANCES

In addition to the comparison with other noise preprocessing techniques (Section 5), this section analyzes the impact in classification performance of changes in two of the main parameters of ANCES: the number of nearest neighbors $k_{1}$ in the fitness function and the percentage $p$ of attribute values corrected in each iteration. In order to study whether the choice of specific values for these parameters significantly affects the results obtained by ANCES, a new experiment is performed based on

Conclusions and future work

In this research, a new attribute noise correction method (ANCES) has been proposed. Its main goal is correcting attribute noise instead of removing noisy samples, as most of the solutions in the literature. The design of ANCES follows an iterative scheme, which involves an attribute noise score to detect errors in attribute values and the usage of an optimization meta-heuristic to correct these values. Moreover, different versions of the original dataset are considered in the noise treatment

José A. Sáez is a PhD in Computer Science and Computing Technology. He is currently working at the University of Granada (Spain). His main research interests are related to data mining, data preprocessing and data transformation tasks in Knowledge Discovery in Databases, including noisy data in classification, discretization methods, imbalanced learning, performance evaluation methods, nonparametrical statistical tests, unsupervised learning and others. His main research line is that of noisy

References (54)

X. Xia et al.
Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features
Pattern Recognit
(2018)
P. Pawara et al.
One-vs-one classification for deep neural networks
Pattern Recognit
(2020)
J. Bootkrajang
A generalised label noise model for classification in the presence of annotation errors
Neurocomputing
(2016)
M. Mannino et al.
Classification algorithm sensitivity to training data with non representative attribute noise
Decis Support Syst
(2009)
J. Sáez et al.
Tackling the problem of classification with noisy data using multiple classifier systems: analysis of the performance and robustness
Inf Sci (Ny)
(2013)
J. Sáez et al.
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification
Pattern Recognit
(2013)
J. Derrac et al.
A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms
Swarm Evol Comput
(2011)
W. Yuan et al.
Classification with class noises through probabilistic sampling
Information Fusion
(2018)
M. Koziarski et al.
Radial-based oversampling for noisy imbalanced data classification
Neurocomputing
(2019)
A. Luque et al.
The impact of class imbalance in classification performance metrics based on the binary confusion matrix
Pattern Recognit
(2019)

M. Koziarski

Radial-based undersampling for imbalanced data classification

Pattern Recognit

(2020)

J. Sáez et al.

SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

Inf Sci (Ny)

(2015)

A. Maas et al.

A label noise tolerant random forest for the classification of remote sensing data based on outdated maps for training

Comput. Vision Image Understanding

(2019)

T. Nguyen et al.

Ensemble selection based on classifier prediction confidence

Pattern Recognit

(2020)

Z. Wang et al.

Fault recognition using an ensemble classifier based on dempster-Shafer theory

Pattern Recognit

(2020)

J. Koplowitz et al.

On the relation of performance to editing in nearest neighbor rules

Pattern Recognit

(1981)

W. Chen et al.

A trace lasso regularized robust nonparallel proximal support vector machine for noisy classification

IEEE Access

(2019)

X. Zhu et al.

Class noise vs. attribute noise: A Quantitative study

Artif Intell Rev

(2004)

B. Frenay et al.

Classification in the presence of label noise: asurvey

IEEE Trans Neural Netw Learn Syst

(2014)

E. Goceri

A comparative evaluation for liver segmentation from SPIR images and a novel level set method using Signed Pressure Force function

(2013)

B. Kaya et al.

Automated fluorescent miscroscopic image analysis of PTBP1 expression in glioma

PLoS ONE

(2017)

E. Goceri

Automatic kidney segmentation using Gaussian mixture model on MRI sequences

E. Goceri

Fully automated and adaptive intensity normalization using statistical features for brain MR images

Celal Bayar University Journal of Science

(2018)

E. Dura et al.

A method for liver segmentation in perfusion MR images using probabilistic atlases and viscous reconstruction

Pattern Analysis and Applications

(2018)

E. Goceri

Automatic labeling of portal and hepatic veins from MR images prior to liver transplantation

Int J Comput Assist Radiol Surg

(2016)

J. Sáez et al.

Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition

Knowl Inf Syst

(2014)

T. Liu et al.

Classification with noisy labels by importance reweighting

IEEE Trans Pattern Anal Mach Intell

(2016)

Cited by (14)

Tackling the problem of noisy IoT sensor data in smart agriculture: Regression noise filters for enhanced evapotranspiration prediction
2024, Expert Systems with Applications
In smart agriculture, the accurate prediction of evapotranspiration plays a crucial role in optimizing water usage and maximizing crop yield. However, the increasing adoption of IoT sensor technologies has resulted in the accumulation of large amounts of data, which are frequently contaminated by noise and pose a significant challenge to extract reliable knowledge through data modeling. This research addresses the problem of noisy IoT sensor data and its impact on evapotranspiration prediction, an essential aspect of agricultural practices. The effect of noise on sensor variables and evapotranspiration is extensively analyzed by simulating different noise levels in evapotranspiration datasets collected from various agricultural areas in Spain, enabling a comprehensive evaluation of its impact on the performance of data science models. Despite the potential consequences of this type of errors, a noise preprocessing stage is often overlooked in existing literature in this field, which is necessary to improve data quality prior to modeling. In order to address this challenge, this paper proposes the usage of regression noise filters as approach to mitigate the detrimental effects of noisy IoT sensor data on evapotranspiration prediction. Additionally, we introduce the rgnoisefilt R package, which offers a practical and efficient implementation of noise filtering techniques for regression datasets, providing a valuable solution for handling noisy data in smart agriculture applications. The experimental results obtained emphasize the negative impacts of noise on evapotranspiration prediction performance and highlight the importance of an appropriate data treatment to mitigate system deterioration. Furthermore, the findings of this research emphasize the efficacy of the regression noise filters implemented in the rgnoisefilt software, enhancing the performance of the models built and providing a valuable tool for improving data quality in smart agriculture.
An optimization for adaptive multi-filter estimation in medical images and EEG based signal denoising
2023, Biomedical Signal Processing and Control
Citation Excerpt :
Week signal detection is challenging; hence, other computational improvements are required in signal noise filters. Sáez and Corchado [16] has proposed a denoising method to improve the classification performance since noise tends to degrade the prediction. Regression and classification are some of the most critical tasks in machine learning-based prediction.
Classical denoising techniques are efficient to extinguish the Gaussian noise but are unable to handle the impulse and additive noise. The blurring of edges in denoised data is critical in both images and signals. It has been found that edge blurring occurs due to limited spatial information. Adaptive multi-filtering methods can accommodate large spatial information. But the estimation of filter window sizes and their subsequent information fusion is a challenging task. To overcome this issue, an adaptive multi-filter estimation technique that fuses large information has been explored. The fusion function is formulated for multiple sizes of the filter as an objective function. The objective function takes the spatial information from the neighborhood of noisy data using multiple filters and fuses them while minimizing the objective function. The objective function is optimized for various sizes of filter windows with a stochastic flower pollination Algorithm (FPA). Impulse noise is efficiently removed using optimized filters. An empirical study is conducted with recent state-of-the-art using evaluation indices such as MSE, PSNR, MAXERR, L2RAT, SSIM, computational time, and complexity. For the proposed setup, the MSE, PSNR, MAXERR, L2RAT, and SSIM are evaluated as 1.33, 26.90, 201, 0.9964, 0.8922, and 0.9312 respectively that are outperforming the results obtained in the previous state-of-the-art denoising algorithms such as wavelet, compression, box averaging, tabu search and simulated annealing. Further, it is interesting to observe the improvement in complexity. The experimental study demonstrated the performance of the framework with previous qualitative indices in both Alzheimer’s images and EEG-based neural signal data.
A noise-aware fuzzy rough set approach for feature selection
2022, Knowledge-Based Systems
Citation Excerpt :
Generally speaking, there are two types of noise samples in the data [19]. One is that the conditional attribute of the sample is anomalous (i.e., attribute noise) [22], and the other is that the decision attribute of the sample is anomalous (i.e., class noise) [23]. Two types of noise have different impacts on the dependence degree and downstream learning task [19,24].
Feature selection has aroused extensive attention and aims at selecting features that are highly relevant to classification from raw datasets to improve the performance of a learning model. Fuzzy rough set theory is a powerful mathematical method for feature selection. The classical fuzzy rough set model is very sensitive to the noise while the noise samples in classification data often appear. In addition, fuzzy rough set theory does not fit well when the density distribution of the samples in the dataset varies greatly. Thus, it is of great significance to improve the robustness of fuzzy rough set models and its adaptability to data for feature selection. Inspired by these issues, we focus on the robust fuzzy rough set approach for feature selection. We first propose a robust fuzzy rough set model based on data distribution to achieve the purpose of anti-noise i.e., Noise-aware Fuzzy Rough Sets (NFRS) model. This model proposes a novel search mechanism, which weakens the sensitivity of the approximation operator to noise by considering the distribution of samples in the decision classes to weight the samples, further obtains three kinds of samples, i.e., intra-class samples, boundary samples, and outlier samples. Then, the degrees of relevance of the feature for class is defined by the dependency function based on the NFRS model to evaluate the significance of the feature subset. On this basis, an evaluation function about feature significance is constructed, which simultaneously considers the relevance and redundancy of a candidate feature provided for the selected subset and the remaining feature subset. A novel forward greedy search algorithm is presented to select a feature sequence. The selected features are subsequently evaluated with downstream classification tasks. Experimental using real-world datasets demonstrate the effectiveness of the proposed model and its superiority against comparison baseline methods.
Learning to rectify for robust learning with noisy labels
2022, Pattern Recognition
Citation Excerpt :
There are essentially three ways to implement the loss function correction method. ( 1) The basic idea is to correct the noisy label to the true one via a correction function [34–36] or extra inference steps [37–40]. ( 2) In contrast to leveraging hard labels in the learning stage, label smoothing regularizers, such as the confusion matrix [15,41] or soft labels [42], is introduced to avoid the bias of learning on noisy data. (
Label noise significantly degrades the generalization ability of deep models in applications. Effective strategies and approaches (e.g., re-weighting or loss correction) are designed to alleviate the negative impact of label noise when training a neural network. Those existing works usually rely on the pre-specified architecture and manually tuning the additional hyper-parameters. In this paper, we propose warped probabilistic inference (WarPI) to achieve adaptively rectifying the training procedure for the classification network within the meta-learning scenario. In contrast to the deterministic models, WarPI is formulated as a hierarchical probabilistic model by learning an amortization meta-network, which can resolve sample ambiguity and be therefore more robust to serious label noise. Unlike the existing approximated weighting function of directly generating weight values from losses, our meta-network is learned to estimate a rectifying vector from the input of the logits and labels, which has the capability of leveraging sufficient information lying in them. The procedure provides an effective way to rectify the learning procedure for the classification network, demonstrating a significant improvement of the generalization ability. Besides, modeling the rectifying vector as a latent variable and learning the meta-network can be seamlessly integrated into the SGD optimization of the classification network. We evaluate WarPI on four benchmarks of robust learning with noisy labels and achieve the new state-of-the-art under variant noise types. Extensive study and analysis also demonstrate the effectiveness of our model.
The rank of contextuality
2023, New Journal of Physics
Noise simulation in classification with the noisemodel R package: Applications analyzing the impact of errors with chemical data
2023, Journal of Chemometrics

View all citing articles on Scopus

Emilio Corchado is a full Professor at the University of Salamanca (Spain) in Computer and Automatic Science. He received his PhD in Computer Science from University of Salamanca. His research interests include neural networks, with a particular focus on exploratory projection pursuit, maximum likelihood hebbian learning, self-organising maps, multiple classifier systems and Hybrid Systems. He has published over 100 peer-reviewed articles in a range of topics from knowledge management and risk analysis, intrusion detection systems, food industry, artificial vision, and modelling of industrial processes. He has been organizing chair, program committee chair, session chair and general chair for a number of conferences, such as International Conference on Hybrid Artificial Intelligence Systems (HAIS), International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), and International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES). He is a RTD expert hired by international organizations such as the European Commission, the Grant agency of Czech Republic, the Spanish National Agency for Assessment and Forecasting and has collaborated with SMEs and new companies in the innovation field in about 40 projects. He has patented software models and he owns the IP of more than 10 ICT tools and models. He was the chair of the IEEE Spanish Section during the years 2014-15, and has actively contributed to several current projects in the EU, including SOFTCOMP, IT4Innovation, ICT Action COST IC1303 and IntelliCIS NISIS.

View full text

ANCES: A novel method to repair attribute noise in classification problems

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Noisy data in classification problems

ANCES: Attribute noise corrector based on error scores

Experimental framework

On the advantages of correcting versus removing attribute noise

Sensitivity analysis of parameters in ANCES

Conclusions and future work

Pattern Recognit

Pattern Recognit

Neurocomputing

Decis Support Syst

Inf Sci (Ny)

Pattern Recognit

Swarm Evol Comput

Information Fusion

Neurocomputing

Pattern Recognit

Pattern Recognit

Inf Sci (Ny)

Comput. Vision Image Understanding

Pattern Recognit

Pattern Recognit

Pattern Recognit

A trace lasso regularized robust nonparallel proximal support vector machine for noisy classification

IEEE Access

Class noise vs. attribute noise: A Quantitative study

Artif Intell Rev

Classification in the presence of label noise: asurvey

IEEE Trans Neural Netw Learn Syst

A comparative evaluation for liver segmentation from SPIR images and a novel level set method using Signed Pressure Force function

Automated fluorescent miscroscopic image analysis of PTBP1 expression in glioma

PLoS ONE

Automatic kidney segmentation using Gaussian mixture model on MRI sequences

Fully automated and adaptive intensity normalization using statistical features for brain MR images

Celal Bayar University Journal of Science

A method for liver segmentation in perfusion MR images using probabilistic atlases and viscous reconstruction

Pattern Analysis and Applications

Automatic labeling of portal and hepatic veins from MR images prior to liver transplantation

Int J Comput Assist Radiol Surg

Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition

Knowl Inf Syst

Classification with noisy labels by importance reweighting

IEEE Trans Pattern Anal Mach Intell