Skip to main content
Log in

Ensembles of label noise filters: a ranking approach

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Label noise can be a major problem in classification tasks, since most machine learning algorithms rely on data labels in their inductive process. Thereupon, various techniques for label noise identification have been investigated in the literature. The bias of each technique defines how suitable it is for each dataset. Besides, while some techniques identify a large number of examples as noisy and have a high false positive rate, others are very restrictive and therefore not able to identify all noisy examples. This paper investigates how label noise detection can be improved by using an ensemble of noise filtering techniques. These filters, individual and ensembles, are experimentally compared. Another concern in this paper is the computational cost of ensembles, once, for a particular dataset, an individual technique can have the same predictive performance as an ensemble. In this case the individual technique should be preferred. To deal with this situation, this study also proposes the use of meta-learning to recommend, for a new dataset, the best filter. An extensive experimental evaluation of the use of individual filters, ensemble filters and meta-learning was performed using public datasets with imputed label noise. The results show that ensembles of noise filters can improve noise filtering performance and that a recommendation system based on meta-learning can successfully recommend the best filtering technique for new datasets. A case study using a real dataset from the ecological niche modeling domain is also presented and evaluated, with the results validated by an expert.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://archive.ics.uci.edu/ml/datasets.html

References

  • Bensusan H, Giraud-Carrier CG, Kennedy CJ (2000) A higher-order approach to meta-learning. In: 10th international conference on inductive logic programming, pp 33–42

  • Brazdil P, Giraud-Carrier CG, Soares C, Vilalta R (2009) Metalearning—applications to data mining. Cognitive technologies. Springer, Berlin

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  • Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    MATH  Google Scholar 

  • Brown G (2010) Ensemble learning. Encyclopedia of machine learning. Springer, Berlin, pp 312–320

    Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Everitt BS, Landau S, Leese M (2009) Cluster analysis. Wiley, New York

    MATH  Google Scholar 

  • Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  • Gamberger D, Lavrač N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: 16th international conference on machine learning (ICML), pp 143–151

  • Garcia LPF, Lorena AC, de Carvalho ACPLF (2012) A study on class noise detection and elimination. In: Brazilian symposium on neural networks (SBRN), pp 13–18

  • Garcia LPF, de Carvalho ACPLF, Lorena AC (2015a) Effect of label noise in the complexity of classification problems. Neurocomputing 160:108–119

    Article  Google Scholar 

  • Garcia LPF, Sáez JA, Luengo J, Lorena AC, de Carvalho ACPLF, Herrera F (2015b) Using the one-vs-one decomposition to improve the performance of class noise filters via an aggregation strategy in multi-class classification problems. Knowledge-Based Syst 90:153–164

    Article  Google Scholar 

  • Garcia LPF, de Carvalho ACPLF, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25

    Article  Google Scholar 

  • García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, New York

    Book  Google Scholar 

  • Giraud-Carrier CG, Vilalta R, Brazdil P (2004) Introduction to the special issue on meta-learning. Machine Learning 54(3):187–193

    Article  Google Scholar 

  • Giraud-Carrier CG, Brazdil P, Soares C, Vilalta R (2009) Meta-learning. In: Wang J (ed) Encyclopedia of data warehousing and mining. IGI Global, Hershey, pp 1207–1215

    Chapter  Google Scholar 

  • Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato

  • Han J, Kamber M, Pei J (2012) Data preprocessing. Data mining. The Morgan Kaufmann series in data management systems, 3rd edn. Morgan Kaufmann, Boston, pp 83–124

    Google Scholar 

  • Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  • Kanda J, de Carvalho ACPLF, Hruschka ER, Soares C (2011) Selection of algorithms to solve traveling salesman problems using meta-learning. Int J Hybrid Intell Syst 8(3):117–128

    Article  Google Scholar 

  • Khoshgoftaar T, Rebours P (2004) Generating multiple noise elimination filters with the ensemble-partitioning filter. In: IEEE international conference on information reuse and integration, pp 369–375

  • Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

  • Lorena AC, Garcia LPF, de Carvalho ACPLF (2015) Adapting noise filters for ranking. In: Brazilian conference on intelligent systems (BRACIS), pp 299–304

  • Mantovani RG, Rossi ALD, Vanschoren J, Bischl B, de Carvalho ACPLF (2015) To tune or not to tune: recommending when to adjust SVM hyper-parameters via meta-learning. In: International joint conference on neural networks (IJCNN), pp 1–8

  • Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle River

    MATH  Google Scholar 

  • Miranda PBC, Prudêncio RBC, de Carvalho ACPLF, Soares C (2014) A hybrid meta-learning architecture for multi-objective optimization of SVM parameters. Neurocomputing 143:27–43

    Article  Google Scholar 

  • Orriols-Puig A, Maciá N, Ho TK (2010) Documentation for the data complexity library in C++. Technical report. La Salle—Universitat Ramon Llull

  • Peng Y, Flach PA, Soares C, Brazdil P (2002) Improved dataset characterisation for meta-learning. In: 5th international conference on discovery science, pp 141–152

  • Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: 17th international conference on machine learning (ICML), pp 743–750

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  • Rossi ALD, de Carvalho ACPLF, Soares C (2012) Meta-learning for periodic algorithm selection in time-changing data. In: Brazilian symposium on neural networks (SBRN), pp 7–12

  • Rossi ALD, de Carvalho ACPLF, Soares C, de Souza BF (2014) Metastream: a meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127:52–64

    Article  Google Scholar 

  • Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364

    Article  Google Scholar 

  • Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  • Sáez JA, Galar M, Luengo J, Herrera F (2016) INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inf Fusion 27:19–32

    Article  Google Scholar 

  • Schubert E, Wojdanowski R, Zimek A, Kriegel HP (2012) On evaluation of outlier rankings and outlier scores. In: 12th SIAM international conference on data mining (SDM), pp 1047–1058

  • Sluban B, Gamberger D, Lavrač N (2010) Advances in class noise detection. In: 19th European conference on artificial intelligence (ECAI), pp 1105–1106

  • Sluban B, Gamberger D, Lavrač N (2014) Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Min Knowl Discov 28(2):265–303

    Article  MathSciNet  MATH  Google Scholar 

  • Smith-Miles KA (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):6:1–6:25

    Article  Google Scholar 

  • de Souza BF, de Carvalho ACPLF, Soares C (2010) Empirical evaluation of ranking prediction methods for gene expression data classification. In: 12th Ibero-American conference on artificial intelligence (IBERAMIA), pp 194–203

  • Teng CM (1999) Correcting noisy data. In: 16th international conference on machine learning (ICML), pp 239–248

  • Toledo RY, Mota YC, Martínez L (2015) Correcting noisy ratings in collaborative recommender systems. Knowledge-Based Syst 76:96–108

    Article  Google Scholar 

  • Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(6):448–452

    Article  MathSciNet  MATH  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Verbaeten S, Assche AV (2003) Ensemble methods for noise elimination in classification problems. In: 4th international workshop multiple classifier systems, pp 317–325

  • Wilson DL (1972) Asymtoptic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421

    Article  MATH  Google Scholar 

  • Wu X, Zhu X (2008) Mining with noise knowledge: error-aware data mining. IEEE Trans Syst Man Cybern Part A 38(4):917–932

    Article  Google Scholar 

  • Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu X, Wu X, Chen Q (2003) Eliminating class noise in large datasets. In: 20th international conference on machine learning (ICML), pp 920–927

Download references

Acknowledgments

The authors would like to thank FAPESP (processes 2011/14602-7 and 2012/22608-8), CNPq and CAPES for their financial support. The third author’s research was supported by the Natural Sciences and Engineering Research Council of Canada, by the CALDO Programme, and by the National Research Centre of Poland (NCN) Grant DEC-2013/09/B/ST6/01549. We are also very grateful to Dr. Augusto Hashimoto de Mendonça which works at Center for Water Resources & Applied Ecology from Environmental Engineering Sciences of the School of Engineering of São Carlos at University of São Paulo and Professor Dr. Giselda Durigan from Forestry Institute of the State of São Paulo for their evaluation of the given list of potentially noisy examples of non native specie H. coronarium dataset.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luís P. F. Garcia.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.

Appendix 1: Characterization and complexity measures

Appendix 1: Characterization and complexity measures

Table 4 summarizes the meta-features used to describe the noisy datasets: characterization and complexity measures.

Table 4 Summary of the meta-features employed in the paper

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garcia, L.P.F., Lorena, A.C., Matwin, S. et al. Ensembles of label noise filters: a ranking approach. Data Min Knowl Disc 30, 1192–1216 (2016). https://doi.org/10.1007/s10618-016-0475-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0475-9

Keywords

Navigation