Novel mislabeled training data detection algorithm

Yuan, Weiwei; Guan, Donghai; Zhu, Qi; Ma, Tinghuai

doi:10.1007/s00521-016-2589-9

Novel mislabeled training data detection algorithm

Original Article
Published: 16 September 2016

Volume 29, pages 673–683, (2018)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Weiwei Yuan^1,2,
Donghai Guan^1,2,
Qi Zhu^1,2 &
…
Tinghuai Ma³

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

As a kind of noise, mislabeled training data exist in many applications. Because of their negative effects on learning, many filter techniques have been proposed to identify and eliminate them. Ensemble learning-based filter (EnFilter) is the most widely used filter which employs ensemble classifiers. In EnFilter, first the noisy training dataset is divided into several subsets. Each noisy subset is then checked by the multiple classifiers which are trained based on other noisy subsets. It is noted that since the training data used to train multiple classifiers are noisy, the quality of these classifiers cannot be guaranteed, which might generate poor noise identification result. This problem is more serious when the noise ratio in the training dataset is high. To solve this problem, a straightforward but effective approach is proposed in this work. Instead of using noisy data to train the classifiers, nearly noise-free (NNF) data are used since they are supposed to train more reliable classifiers. To this end, a novel NNF data extraction approach is also proposed. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble with estimation: seeking for optimization in class noisy data

Article Open access 12 June 2019

A Fast Class Noise Detector with Multi-factor-based Learning

A Novel Feature Selection-Based Sequential Ensemble Learning Method for Class Noise Detection in High-Dimensional Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Guan D, Yuan W, Lee YK (2009) Nearest neighbor editing aided by unlabeled data. Inf Sci 179(13):2273–2282
Article Google Scholar
Van J, Khoshgoftaar T, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190
Article Google Scholar
Van J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542
Article Google Scholar
Zhu XQ, Wu XD (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
Article MATH Google Scholar
Zhu XQ, Wu XD, Yang Y (2004) Dynamic classifier selection for effective mining from noisy data streams. In: Proceedings of fourth IEEE international conference on data mining, pp 305–312
Ma T, Zhou J, Tang M (2015) Social network and tag sources based augmenting collaborative recommender system. IEICE Trans Inf Syst 98(4):902–910
Article Google Scholar
Bi Y, Jeske DR (2010) The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise. J Multivar Anal 101(7):1622–1637
Article MathSciNet MATH Google Scholar
Nettleton D, Orriols-Puig A, Fornells A (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev 33(4):275–306
Article Google Scholar
Zhang J, Yang Y (2003) Robustness of regularized linear classification methods in text categorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 190–197
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198
MATH Google Scholar
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Article Google Scholar
Ratsch G, Onoda T, Muller K (2001) Soft margins for AdaBoost. Mach Learn 42(3):287–320
Article MATH Google Scholar
West M et al (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. In: Proceedings of the national academy of sciences, pp 11462–11467
Hickey RJ (2006) Noise modelling and evaluating learning from examples. Artif Intell 82(1):157–179
MathSciNet Google Scholar
Pechenizkiy M, Tsymbal A, Puuronen S, Pechenizkiy O (2006) Class noise and supervised learning in medical domains: the effect of feature extraction. In: Proceedings of 19th IEEE symposium on computer-based medical systems, pp 708–713
Bootkrajang J, Kaban A (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29(7):870–877
Article Google Scholar
Saez J, Galar M, Luengo J, Herrera F (2012) A first study on decomposition strategies with data with class noise using decision trees. Hybrid Artif Intell Syst (Lect Notes Comput Sci) 7209:25–35
Article Google Scholar
Beigman E, Klebanov BB (2009) Learning with annotation noise. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing, pp 280–287
Sastry PS, Nagendra GD, Manwani N (2010) A team of continuous action learning automata for noise-tolerant learning of half-spaces. IEEE Trans Syst Man Cybern B Cybern 40(1):19–28
Article Google Scholar
Manwani N, Sastry PS (2013) Noise tolerance under risk minimization. IEEE Trans Cybern 43(3):1146–1151
Article Google Scholar
Abellan J, Masegosa AR (2010) Bagging decision trees on data sets with classification noise. In: Proceedings of the 6th international conference on foundations of information and knowledge systems, pp 248–265
Abellan J, Moral S (2003) Building classification trees using the total uncertainty criterion. Int J Intell Syst 18(12):1215–1225
Article MATH Google Scholar
Brodley CE, Friedl MA (1996) Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data. In: Proceedings of geoscience and remote sensing symposium, pp 1379–1381
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
MATH Google Scholar
Chaudhuri BB (1996) A new definition of neighborhood of a point in multi-dimensional space. Pattern Recognit Lett 17:11–17
Article Google Scholar
Guan D, Yuan W et al (2011) Identifying mislabeled training data with the aid of unlabeled data. Appl Intell 35(3):345–358
Article Google Scholar
John GH (1995) Robust decision trees: removing outliers from databases. In: Proceeding of international conference on knowledge discovery and data mining, pp 174–179
Marques AI et al (1876) Decontamination of training data for supevised pattern recognition. Adv Pattern Recognit Lect Notes Comput Sci 2000:621–630
Google Scholar
Marques AI et al (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022
Article Google Scholar
Metxas et al (2004) Distinguishing mislabeled data from correctly labeled data in classifier design. In: Proceedings of 16th IEEE international conference on tools with artificial intelligence, pp 668–672
Verbaeten S, Assche, AV (2003) Ensemble methods for noise elimination in classification problems. In: Proceeding of 4th international workshop on multiple classifier systems, pp 317–325
Wilson DL (1992) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):431–433
MathSciNet Google Scholar
Wu X, Zhu X, Chen Q (2003) Eliminating class noise in large datasets. In: Proceeding of international conference on machine learning, pp 920–927
Young J, Ashburner J, Ourselin S (2013) Wrapper methods to correct mislabeled training data. In: Proceedings of the 3rd international workshop on pattern recognition in neuroimaging, pp 170–173
Zhou ZH, Jiang Y (2004) Editing training data for kNN classifiers with neural network ensemble. Lect Notes Comput Sci 3173:356–361
Article Google Scholar
Gu B, Sheng VS, Tay KY et al (2015) Incremental support vector learning for ordinal regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416
Article MathSciNet Google Scholar
Gu B, Sheng VS (2016) A robust regularization path algorithm for-support vector classification. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2527796
Google Scholar
Gu B, Sun XM, Sheng VS (2016) Structural Minimax Probability Machine. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2544779
Google Scholar
Gu B, Sheng VS, Wang Z et al (2015) Incremental learning for-support vector regression. Neural Netw 67:140–150
Article Google Scholar
Wen X, Shao L, Xue Y et al (2015) A rapid learning algorithm for vehicle classification. Inf Sci 295:395–406
Article Google Scholar
Yuan W, Guan D, Shen L et al (2014) An empirical study of filter-based feature selection algorithms using noisy training data. In: Proceedings of the 4th IEEE international conference on information science and technology, pp 209–212
Guan D et al (2014) Detecting potential labeling errors for bioinformatics by multiple voting. Knowl Based Syst 66:28–35
Article Google Scholar
Nicholson B, Zhang J, Sheng VS (2015) Label noise correction methods. In: Proceedings of 2015 IEEE international conference on data science and advanced analytics, pp 1–9
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Article Google Scholar
Triguero I, Saez JA, Luengo J (2014) On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification. Neurocomputing 132:30–41
Article Google Scholar

Download references

Acknowledgments

This research was supported by “the Fundamental Research Funds for the Central Universities” No. NS2016089.

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211-106, Jiangsu, China
Weiwei Yuan, Donghai Guan & Qi Zhu
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, Jiangsu, 210-023, China
Weiwei Yuan, Donghai Guan & Qi Zhu
Jiangsu Engineering Centre of Network Monitoring, Nanjing University of Information Science and Technology, Nanjing, 210-044, Jiangsu, China
Tinghuai Ma

Authors

Weiwei Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Donghai Guan
View author publications
You can also search for this author inPubMed Google Scholar
Qi Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Tinghuai Ma
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Donghai Guan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, W., Guan, D., Zhu, Q. et al. Novel mislabeled training data detection algorithm. Neural Comput & Applic 29, 673–683 (2018). https://doi.org/10.1007/s00521-016-2589-9

Download citation

Received: 24 February 2016
Accepted: 06 September 2016
Published: 16 September 2016
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00521-016-2589-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel mislabeled training data detection algorithm

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Ensemble with estimation: seeking for optimization in class noisy data

A Fast Class Noise Detector with Multi-factor-based Learning

A Novel Feature Selection-Based Sequential Ensemble Learning Method for Class Noise Detection in High-Dimensional Data

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now