MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model

doi:10.1016/j.neucom.2015.01.055

Neurocomputing

Volume 159, 2 July 2015, Pages 242-250

https://doi.org/10.1016/j.neucom.2015.01.055 Get rights and content

Abstract

This paper proposes a methodology for identifying data samples that are likely to be mislabeled in a c-class classification problem (dataset). The methodology relies on an assumption that the generalization error of a model learned from the data decreases if a label of some mislabeled sample is changed to its correct class. A general classification model used in the paper is OP-ELM; it also provides a fast way to estimate the generalization error by PRESS Leave-One-Out. It is tested on two toy datasets, as well as on real life datasets for one of which expert knowledge about the identified potential mislabels has been sought.

Introduction

This work focuses on finding data samples with incorrect labels in a given dataset. Such samples create “label noise”, which is generally considered more harmful than feature noise [1]. The work is motivated by studies on a financial dataset [2] where each sample corresponds to a company labeled as either healthy or bankrupt. In this dataset, samples with incorrect labels are important both by themselves (i.e. as companies eligible for a loan but mislabeled by “bankrupt”), and for the whole dataset to allow building more precise machine learning models with a limited amount of data (because each sample is expensive and slow to obtain). There are other areas where the detection of particular mislabeled samples is important, like medical applications [3].

There exist multiple sources of label noise. First, noise can be generated by simple mistakes in data gathering and processing, like people mistyping or sensor malfunction [4], [1]. For real datasets, such noise is estimated to be roughly 5% not including other factors [5]. Second, experts who label the data can make mistakes. This happens especially in cases where labeling quality is traded for lower labeling price, for instance with crowdsourcing [6] like Amazon Mechanical Turk [7] framework. Third, labeling criterion may be vague, then different experts will produce different labels. For example, in EEG segmentation exact beginnings and ends of signals are not formally defined, and different doctors give slightly different signal boundaries [8]. At last, the existing information may be insufficient for reliable labeling of data [4].

Recent methods of machine learning in the presence of mislabeled data can be aggregated in three categories [9]. Data cleansing (or filtering) methods [4] pre-process dataset and fix incorrect labels or remove such samples [10]. The resulting clean dataset is used with general machine learning methods. Noise-robust methods [11], [12] like k-nearest neighbors [13] are tuned to perform well despite the presence of label noise. It is even possible to achieve same theoretical performance with label noise as without one, although in simple cases [14]. Noise-tolerant methods include label noise in their model; an extensive overview of such methods is presented in [9], Section 7. A good survey is given by Frénay in his PhD thesis [15].

The idea of detecting mislabeled samples is to utilize their effect of increasing the model complexity [4], [16]. In a Single Layer Feed-forward Neural network (SLFN) model, a more complex model requires more hidden neurons to learn [24], or equally a more complex model will result in a lower accuracy with the same amount of hidden neurons. Correcting an incorrect sample label will decrease a model error of a fixed SLFN. Note a difference between mislabeled samples and outliers—an outlier is not a typical sample of any class, so changing its label will not lead to a decrease in model error; although outlier detector methods are applicable for dealing with mislabeled data [17].

One existing work on mislabel detection topic is written by Gamberger and Lavrac [3] before millennium for a medical domain. They performed a medical evaluation of particular detected mislabeled samples (corresponding to patients), and found them really mislabeled. No more recent works focused on detecting mislabeled samples were found in literature review.

In [18], the MD-ELM methodology is proposed for the problem of binary classification. This paper changes the method to give efficient performance and stopping criteria, reduces chances of missing a mislabeled sample and improves overall performance with multiple SLFN models, extends the methodology to multiple classes, optimizes computational time. Also the impact of method parameters is analyzed, and their values validated on toy datasets.

Section 2 describes the proposed methodology, including an ELM model, fast LOO error estimation, and the final MD-ELM algorithm. Experimental results, Section 3, present the method performance with artificially added mislabeled samples in binary and multiclass cases. Several real life data sets are then used, in which the method attempts to identify potential original mislabels.

Section snippets

Methodology

This section provides an overview of MD-ELM idea and the developed methodology. The first two sections give a high-level overview and explanations of the proposed methodology. The OP-ELM, Section 2.3, introduces an ELM model, and the PRESS statistics, Section 2.4, explains how a LOO error can be calculated fast in the ELM model framework. Section 2.5 provides a summary algorithm of the MD-ELM method.

Datasets

The methodology uses three real world datasets for experiments. Two of them are Nursery and Breast Cancer UCI datasets [37], which are used for classification performance analysis after application of the MD-ELM method. Only the average classification performance can be evaluated because exact original mislabeled samples are unavailable for those datasets. They are available for the last dataset of 500 companies from the field of Corporate Finance [2], with 50% healthy ones and 50% bankrupt.

Conclusions

The methodology presented in this paper aims at identifying the likeliest samples in a dataset to be mislabeled.

In a classification problem, some samples can be mislabeled, for various reasons. The goal of identifying the mislabels is twofold: By finding which samples are mislabeled, one can get insight on the data (and on the gathering of it) and analyze the possible reasons for these mislabels; also, if the mislabels are correctly identified and relabeled (or removed), the classifier trained

Anton Akusok was born in 1988 in Ukraine. He received a B.Sc. degree in Information Technology from Moscow State Mining University in Russia, and continued his education in the Aalto University in Finland where graduated as a M.Sc. in Techology with a major in Machine Learning, Neural Networks and Image Processing. He is currently pursuing a Ph.D. degree at the University of Iowa in the US. His research includes High-Performance Computing and its application to Machine Learning, particularly in

References (38)

J. Cao et al.
A noise-detection based AdaBoost algorithm for mislabeled data
Pattern Recognit.
(2012)
G.-B. Huang et al.
Extreme learning machinetheory and applications
Neurocomputing
(2006)
G.-B. Huang et al.
Enhanced random search based incremental extreme learning machine
Neurocomputing
(2008)
G.-B. Huang et al.
Convex incremental extreme learning machine
Neurocomputing
(2007)
M.-B. Li et al.
Fully complex extreme learning machine
Neurocomputing
(2005)
X. Zhu et al.
Class noise vs. attribute noise: a quantitative study
Artif. Intell. Rev.
(2004)
P. du Jardin, Prévision de la défaillance et réseaux de neurones: l׳apport des méthodes numériques de sélection de...
D. Gamberger, N. Lavrac, C. Groselj, Experiments with noise filtering in a medical domain,...
C. Brodley et al.
Identifying mislabeled training data
J. Artif. Intell. Res.
(1999)
T.C. Redman
Poor data quality on the typical enterprise
Commun. ACM.
(1998)

M.-C. Yuen, I. King, K.-S. Leung, A survey of crowdsourcing systems, in: Third International Conference on Social...

R. Snow, B. O׳Connor, D. Jurafsky, A.Y. Ng, Cheap and fast—but is it good?: evaluating non-expert annotations for...

N.P. Hughes, S.J. Roberts, L. Tarassenko, Semi-supervised learning of probabilistic models for ecg segmentation, in:...

B. Frénay et al.

Classification in the presence of label noise: a survey

IEEE Trans. Neural Netw. Learn. Syst.

(2013)

X. Zhu, X. Wu, Q. Chen, Eliminating class noise in large datasets, in: ICML, vol. 3, 2003, pp....

P. Jeatrakul et al.

Data cleaning for classification using misclassification analysis

J. Adv. Comput. Intell. Intell. Informat.

(2010)

S. Okamoto et al.

An average-case analysis of the k-nearest neighbor classifier for noisy domains

Int. Joint Conf. Artif. Intell.

(1997)

P.A. Lachenbruch

Discriminant analysis when the initial samples are misclassified ii: non-random misclassification models

Technometrics

(1974)

B. Frénay, Uncertainty and label noise in machine learning, Ph.D. thesis,...

Cited by (0)

David Veganzones received the M.Sc. degree in economics from the University of the Basque Country, Spain, in 2013. He is currently pursuing the Ph.D. degree in management and finance with the Department of Lille Economie & Management, University of Lille, Lille, France. He is interested in various domains of bankruptcy forecasting and the application of machine learning to corporate finance.

Yoan Miche was born in 1983 in France. He received an Engineer׳s Degree from the Institut National Poly-technique de Grenoble (INPG, France), and more specifically from TELECOM, INPG, on September 2006. He also graduated with a Master׳s Degree in Signal, Image and Telecom from ENSERG, INPG, at the same time. He recently received his Ph.D. degree in Computer Science and Signal and Image Processing from both the Aalto University School of Science and Technology (Finland) and the INPG (France). His main research interests are anomaly detection and machine learning for classification/regression.

Kaj-Mikael Björk got his Masters’ degree in Chemical Engineering in 1999 from Åbo Akademi University. He also obtained his Ph.D. in Chemical Engineering at 2002 (Åbo Akademi) and his Ph.D. in Business Administration (Information Systems, Åbo Akademi) at 2006. He has been a visiting researcher in Carnegie Mellon University (Pittsburg, USA, 2000), University of Linköping (Sweden, 2001) and UC Berkeley (California, USA, 2005–2006). Before working as the Head of Department, he was working as a Principal Lecturer in Logistics (Arcada) and Assistant Professor in Information Systems (Åbo Akademi). He has held approx. 15 different courses in the fields of Logistics and Management Science and Engineering. Within the research projects he has participated in approx. 60 scientific peer reviewed articles and he has an H-index of 10 (according to Google scholar). His research interests are in Information Systems, analytics, supply chain management, machine learning, fuzzy logic, and optimization.

Philippe du Jardin is currently professor at Edhec business school, in charge of the information technology department, and conducts his research works within the Edhec Financial Analysis and Accounting Research Centre. He holds Master’s degrees in Business administration, Research and Information technology, as well as a Ph.D. in Business administration from Nice Sophia-Antipolis University. Prior to coming to Edhec, he has held academic appointments at Ceram business school (Skema), Aix-Marseille University and Nice University and has served as a consultant for various companies for questions related to IT and data-mining issues. His research interests focus on credit risk and company financial failure and he is interested in neural networks and nonlinear models.

Eric Séverin is a professor of Finance at USTL (University of Lille) and he is a specialist in corporate finance. His research interests are the following: bankruptcy and financial structure, relationships between economics and finance and financial applications of machine learning in the field of bankruptcy predictions.

Amaury Lendasse was born in 1972 in Belgium. He received the M.S. degree in Mechanical Engineering from the Universite Catholique de Louvain (Belgium) in 1996, M.S. in control in 1997 and Ph.D. in 2003 from the same university. In 2003, he has been a post-doctoral researcher in the Computational Neurodynamics Lab at the University of Memphis. Since 2004, he is a senior researcher and a docent in the Adaptive Informatics Research Centre in the Aalto University School of Science and Technology (previously Helsinki University of Technology) in Finland. He has created and is leading the Environmental and Industrial Machine Learning (previously time series prediction and chemoinformatics) Group. He is a chairman of the annual ESTSP conference (European Symposium on Time Series Prediction) and a member of the editorial board and program committee of several journals and conferences on machine learning. He is the author or the coauthor of around 100 scientific papers in international journals, books or communications to conferences with reviewing committee. His research includes time series prediction, chemometrics, variable selection, noise variance estimation, determination of missing values in temporal databases, non-linear approximation in financial problems, functional neural networks and classification.

View full text

MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model

Abstract

Introduction

Section snippets

Methodology

Datasets

Conclusions

Pattern Recognit.

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Class noise vs. attribute noise: a quantitative study

Artif. Intell. Rev.

Identifying mislabeled training data

J. Artif. Intell. Res.

Poor data quality on the typical enterprise

Commun. ACM.

Classification in the presence of label noise: a survey

IEEE Trans. Neural Netw. Learn. Syst.

Data cleaning for classification using misclassification analysis

J. Adv. Comput. Intell. Intell. Informat.

An average-case analysis of the k-nearest neighbor classifier for noisy domains

Int. Joint Conf. Artif. Intell.

Discriminant analysis when the initial samples are misclassified ii: non-random misclassification models

Technometrics