Identifying mislabeled training data with the aid of unlabeled data

Guan, Donghai; Yuan, Weiwei; Lee, Young-Koo; Lee, Sungyoung

doi:10.1007/s10489-010-0225-4

Identifying mislabeled training data with the aid of unlabeled data

Published: 26 March 2010

Volume 35, pages 345–358, (2011)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Donghai Guan¹,
Weiwei Yuan²,
Young-Koo Lee² &
…
Sungyoung Lee²

607 Accesses
37 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents a new approach for identifying and eliminating mislabeled training instances for supervised learning algorithms. The novelty of this approach lies in the using of unlabeled instances to aid the detection of mislabeled training instances. This is in contrast with existing methods which rely upon only the labeled training instances. Our approach is straightforward and can be applied to many existing noise detection methods with only marginal modifications on them as required. To assess the benefit of our approach, we choose two popular noise detection methods: majority filtering (MF) and consensus filtering (CF). MFAUD/CFAUD is the new proposed variant of MF/CF which relies on our approach and denotes majority/consensus filtering with the aid of unlabeled data. Empirical study validates the superiority of our approach and shows that MFAUD and CFAUD can significantly improve the performances of MF and CF under different noise ratios and labeled ratios. In addition, the improvement is more remarkable when the noise ratio is greater.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel mislabeled training data detection algorithm

Article 16 September 2016

Label denoising based on Bayesian aggregation

Article 19 December 2015

Small-Vote Sample Selection for Label-Noise Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Mingers J (1989) An empirical comparison of pruning methods for decision tree induction. Mach Learn 4(2):227–243
Article Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Gamberger D, Lavrac N, Dzeroski S (2000) Noise detection and elimination in data proprocessing: Experiments in medical domains. Appl Artif Intell 14(2):205–223
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Guyon I, Matic N, Vapnik V (1996) Discovering information patterns and data cleaning. In: Advances in knowledge discovery and data mining. AAAI/MIT Press, Cambridge, pp 181–203
Google Scholar
Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of 16th international conference on machine learning, pp 143–151
Brodley CE, Friedl MA (1996) Identifying and eliminating mislabeled training instances. In: Proceedings of 13th national conference on artificial intelligence, pp 799–805
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
MATH Google Scholar
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2:408–421
Article MATH Google Scholar
Aha D, Kibler D, Albert M (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Google Scholar
Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the 7th conference on natural language learning, pp 25–32
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of 11th annual conference on computational learning theory, pp 92–100
UCI KDD archive. http://kdd.ics.uci.edu
Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649
Article MATH Google Scholar
John G, Leonard E (1995) An instance-based learner using an entropic distance measure. In: Proceedings of the 12th international conference on machine learning, pp 108–114
Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2(2):343–370
Google Scholar

Download references

Author information

Authors and Affiliations

College of Automation, Harbin Engineering University, 145 Nantong Street, Nangang District, Harbin, 150001, China
Donghai Guan
Dept. of Computer Engineering, Kyung Hee University, 1 Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do, 446-701, Korea
Weiwei Yuan, Young-Koo Lee & Sungyoung Lee

Authors

Donghai Guan
View author publications
You can also search for this author inPubMed Google Scholar
Weiwei Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Young-Koo Lee
View author publications
You can also search for this author inPubMed Google Scholar
Sungyoung Lee
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Young-Koo Lee.

Additional information

This research was done when Dr. Donghai Guan worked in Ubiquitous Computing Lab, Kyung Hee University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guan, D., Yuan, W., Lee, YK. et al. Identifying mislabeled training data with the aid of unlabeled data. Appl Intell 35, 345–358 (2011). https://doi.org/10.1007/s10489-010-0225-4

Download citation

Published: 26 March 2010
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10489-010-0225-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying mislabeled training data with the aid of unlabeled data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Novel mislabeled training data detection algorithm

Label denoising based on Bayesian aggregation

Small-Vote Sample Selection for Label-Noise Learning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now