On classifier behavior in the presence of mislabeling noise

Mirylenka, Katsiaryna; Giannakopoulos, George; Do, Le Minh; Palpanas, Themis

doi:10.1007/s10618-016-0484-8

On classifier behavior in the presence of mislabeling noise

Published: 05 December 2016

Volume 31, pages 661–701, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Katsiaryna Mirylenka¹,
George Giannakopoulos²,
Le Minh Do³ &
…
Themis Palpanas⁴

748 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Machine learning algorithms perform differently in settings with varying levels of training set mislabeling noise. Therefore, the choice of the right algorithm for a particular learning problem is crucial. The contribution of this paper is towards two, dual problems: first, comparing algorithm behavior; and second, choosing learning algorithms for noisy settings. We present the “sigmoid rule” framework, which can be used to choose the most appropriate learning algorithm depending on the properties of noise in a classification problem. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. We study the characteristics of the sigmoid function using five representative non-sequential classifiers, namely, Naïve Bayes, kNN, SVM, a decision tree classifier, and a rule-based classifier, and three widely used sequential classifiers based on hidden Markov models, conditional random fields and recursive neural networks. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, showing that we can estimate an expected performance over a dataset regardless of the underlying algorithm. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Classifier Selection Based on Imprecise Probabilities: A Case Study for the Naive Bayes Classifier

A drift detection method based on dynamic classifier selection

Article 11 October 2019

Supervised Learning: Classification and Regression

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

When we write performance of an algorithm, we mean the classification accuracy.
Throughout the rest of this work we will use the term noise to refer to the label noise, unless otherwise indicated.
A preliminary version of this work has appeared in Mirylenka et al. (2012).
Instead of 0.05, one can use any value close to 0, describing a normalized measure of distance from the optimal performance. In the case of $p = 0.05$, the distance from the optimal performance is $5\%$.
We note that high levels of noise such as 95% are often observed in the presence of concept drift, e.g., when learning computer-user browsing habits in a network environment with a single IP, and several different users sharing it.
http://stat.ethz.ch/R-manual/R-patched/library/nlme/html/ACF.html
Repeated random sub-sampling validation is also known as Monte Carlo cross-validation (Kuhn and Johnson 2013).
The settings of the genetic algorithm can be found in the appendix.
We would like to thank Christos Faloutsos for kindly providing the code for the fractal dimensionality estimation.
RMAE is also known as mean absolute percentage error.
See http://jgap.sourceforge.net/.
See http://www.jsc.nildram.co.uk/.

References

Abdulrahman SM, Brazdil P, van Rijn JN, Vanschoren J (2015) Algorithm selection via meta-learning and sample-based active testing. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel) co-located with European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 55–66
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. doi:10.1109/TAC.1974.1100705
Article MathSciNet MATH Google Scholar
Ali S, Smith K (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138
Article Google Scholar
Bodén M (2002) A guide to recurrent neural networks and backpropagation. The Dallas project, SICS technical report (2), pp 1–10. http://130.102.79.1/~mikael/papers/rn_dallas
Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, New York
MATH Google Scholar
Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Article Google Scholar
Brazdil P, Giraud Carrier C, Soares C, Vilalta R (2009) Development of metalearning systems for algorithm recommendation. Metalearning, 31–59
Brazdil PB, Soares C, Pinto Da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277
Article Google Scholar
Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407
Article Google Scholar
Chevaleyre Y, Zucker JD (2000) Noise-tolerant rule induction from multi-instance data. In: Proceedings of the workshop on attribute-value and relational learning: crossing the boundaries, co-located with international conference on machine learning (ICML), pp 47–52
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning, pp 115–123
Chapter Google Scholar
Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians: a step-by-step approach. Wiley, Hoboken
Book Google Scholar
Cruz RM, Sabourin R, Cavalcanti GD, Ren TI (2015) Meta-des: a dynamic ensemble selection framework using meta-learning. Pattern Recognit 48(5):1925–1935
Article Google Scholar
de Sousa E, Traina A, Traina Jr. C, Faloutsos C (2006) Evaluating the intrinsic dimension of evolving data streams. In: Proceedings of the 2006 ACM symposium on applied computing, pp 643–648
Dupont P (2006) Noisy sequence classification with smoothed Markov chains. In: Proceedings of the 8th French conference on machine learning (CAP 2006), pp 187–201
Elman J (1990) Finding structure in time. Cognit Sci 211:179–211
Article Google Scholar
Eom SB, Ketcherside MA, Lee HH, Rodgers ML, Starrett D (2004) The determinants of web-based instructional systems’ outcome and satisfaction: an empirical investigation. Cognitive aspects of online programs. Instr Technol, pp 96–139
François JM (2013) Jahmm-hidden Markov model (HMM): an implementation in Java. https://code.google.com/p/jahmm/
Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
Garcia LPF, de Carvalho ACPLF, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25
Article Google Scholar
Giannakopoulos G, Palpanas T (2010) The effect of history on modeling systems’ performance: the problem of the demanding lord. In: IEEE 10th international conference on data mining (ICDM). doi:10.1109/ICDM.2010.90
Giannakopoulos G, Palpanas T (2013) Revisiting the effect of history on learning performance: the problem of the demanding lord. Knowl Inf Syst 36(3):653–691. doi:10.1007/s10115-012-0568-8
Article Google Scholar
Giraud-Carrier C, Vilalta R, Brazdil P (2004) Introduction to the special issue on meta-learning. Mach Learn 54(3):187–193
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
MATH Google Scholar
Haussler D (1990) Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory
Heywood MI (2015) Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genet Program Evolvable Mach 16(3):283–326
Article MathSciNet Google Scholar
Kalapanidas E, Avouris N, Craciun M, Neagu D (2003) Machine learning algorithms: a study on noise sensitivity. In: Proceedings 1st Balcan conference in informatics, pp 356–365
Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649
Article Google Scholar
Klinkenberg R (2005) Meta-learning, model selection, and example selection in machine learning domains with concept drift. In: Lernen, Wissensentdeckung und Adaptivität (LWA) 2005, GI Workshops, Saarbrücken, October 10th–12th, pp 164–171
Kuh A, Petsche T, Rivest RL (1990) Learning time-varying concepts. In: Conference on neural information processing systems (NIPS), pp 183–189
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York. doi:10.1007/978-1-4614-6849-3
Book MATH Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning (ICML), pp 282–289
Li Q, Li T, Zhu S, Kambhamettu C (2002) Improving medical/biological data classification performance by wavelet preprocessing. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 657–660
Marsaglia G, Tsang WW, Wang J (2003) Evaluating Kolmogorovś distribution. J Stat Softw 8(1):1–4. doi:10.18637/jss.v008.i18
Article Google Scholar
Massey FJ Jr (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78
Article Google Scholar
McCallum AK (2002) Mallet: a machine learning for language toolkit. http://mallet.cs.umass.edu
Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. Int J Very Large Data Bases (VLDB) 24(3):395–414
Article Google Scholar
Mirylenka K, Giannakopoulos G, Palpanas T (2012) SRF: a framework for the study of classifier behavior under training set mislabeling noise. In: Advances in knowledge discovery and data mining, lecture notes in computer science, vol 7301, pp 109–121
Chapter Google Scholar
Mirylenka K, Palpanas T, Cormode G, Srivastava D (2013) Finding interesting correlations with conditional heavy hitters. In: IEEE 29th international conference on data engineering (ICDE), pp 1069–1080
Mantovani RG, Rossi ALD, Vanschoren J, Carvalho ACPLF (2015) Meta-learning recommendation of default hyper-parameter values for SVMs in classification tasks. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel), European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 80–92
Nettleton DF, Orriols-Puig A, Fornells A (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev 33(4):275–306. doi:10.1007/s10462-010-9156-z
Article Google Scholar
Pechenizkiy M (2015) Predictive analytics on evolving data streams anticipating and adapting to changes in known and unknown contexts. In: IEEE international conference on high performance computing & simulation (HPCS), pp 658–659
Pendrith M, Sammut C (1994) On reinforcement learning of control actions in noisy and non-markovian domains. Technical report, School of Computer Science and Engineering, The University of New South Wales, Sydney
Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16
Article Google Scholar
Rossi ALD, de Leon Ponce, Ferreira de Carvalho AC, Soares C, Feres de Souza B (2014) MetaStream: a meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127:52–64
Article Google Scholar
Smith MR, Mitchell L, Giraud-Carrier C, Martinez T (2014) Recommending learning algorithms and their associated hyperparameters. arXiv:1407.1890
Sutton C, McCallum A (2010) An introduction to conditional random fields. arXiv:1011.4088
Taylor R (1990) Interpretation of the correlation coefficient: a basic review. J Diagn Med Sonogr 6(1):35–39
Article Google Scholar
Teytaud O (2001) Learning with noise. Extension to regression. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN’01) vol 3, pp 1787–1792
Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, San Diego
MATH Google Scholar
Valiant L (1984) A theory of the learnable. Commun ACM 27(11):1134–1142
Article Google Scholar
Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York
Waluyan L, Sasipan S, Noguera S, Asai T (2009) Analysis of potential problems in people management concerning information security in cross-cultural environment -in the case of Malaysia. In: Proceedings of the third international symposium on human aspects of information security & assurance (HAISA), pp 13–24
Widmer G (1997) Tracking context changes through meta-learning. Mach Learn 27(3):259–286
Article Google Scholar
Wolpert D (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8:1391–1421
Article Google Scholar
Wolpert D (2001) The supervised learning no-free-lunch theorems. In: Proceedings of the 6th online world conference on soft computing in industrial applications. Springer, London, pp 25–42
Chapter Google Scholar
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390
Article Google Scholar
won Lee J, Giraud-Carrier C (2008) New insights into learning algorithms and datasets. In: IEEE seventh international conference on machine learning and applications (ICMLA’08), pp 135–140
Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor Newsl 12(1):40. doi:10.1145/1882471.1882478
Article Google Scholar

Download references

Acknowledgements

We thank the editor and the reviewers of this paper for their valuable comments and advice. K. Mirylenka also thanks Daniil Mirylenka for proofreading the paper.

Author information

Authors and Affiliations

IBM Research–Zurich, Säumerstrasse 4, 8803, Rüschlikon, Switzerland
Katsiaryna Mirylenka
Institute of Informatics and Telecommunications of NCSR Demokritos, Ag. Paraskevi, 15310, Athens, Greece
George Giannakopoulos
University of Trento, via Sommarive 5, 39123, Trento, Italy
Le Minh Do
Department of Computer Science, Paris Descartes University, 45 Rue Des Saints-Peres, 75006, Paris, France
Themis Palpanas

Authors

Katsiaryna Mirylenka
View author publications
You can also search for this author inPubMed Google Scholar
George Giannakopoulos
View author publications
You can also search for this author inPubMed Google Scholar
Le Minh Do
View author publications
You can also search for this author inPubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Katsiaryna Mirylenka.

Additional information

Responsible editor: Johannes Fuernkranz.

Katsiaryna Mirylenka: Work mainly done while at the University of Trento, Italy.

Appendices

1.1 Appendix 1: The genetic algorithm settings

We used the JGAP^{Footnote 11} genetic algorithms package for the search in the sigmoid parameter space and Java Statistical Classes library for the statistical measurement the Kolmogorov–Smirnov (KS) test.^{Footnote 12} Default operators for double numbers were used. The alleles were five parameters, one per parameter:

m with allowed values in [0.0, 0.5].
M with allowed values in [0.5, 1.0].
b with allowed values in [0.0, 50.0].
c with allowed values in [0.0, 50.0].
d with allowed values in $[- 5.0, 5.0]$.

Essentially, employing the genetic algorithm, we try to maximize the following quantity:

$$\begin{aligned} fitness(i) = 100 * \left( 1 + \frac{1}{D+1}\right) , \end{aligned}$$

where i is a candidate individual in the genetic algorithm, corresponding to a given set of parameter values and fitness(i) the value of the fitness function for that individual. The parameter D is the D statistic of the KS test (Marsaglia et al. 2003), which is higher when the fit is worse. The population per iteration is 10,000 individuals. Our search ends when there is no significant (that is $10^{-5}$) improvement after 20 consecutive iterations of the genetic algorithms, or when 1000 iterations have been completed.

1.2 Appendix 2: Correlation results

Non-sequential case

All the correlation coefficients with the corresponding p-values are shown in Table 11. Green (dark) cells mark the pairs that have medium correlation. P-values of importance are emphasized as follows: (p-value <0.05 underlined bold cells, p-value <0.1 italics-bold cells).

Table 11 The values of three correlation coefficients between parameters of the dataset and the parameters of the sigmoid non-sequential case (Color table online)

Full size table

Sequential case

All the correlation coefficients with the corresponding p-values are shown in Table 12, where yellow (light) cells mark the pairs that have strong statistically significant correlation, green (dark) cells mark the pairs that have medium statistically significant correlation. If p-value <0.05 the cell is emphasized with underlined bold, if p-value <0.1—with italics-bold.

Table 12 Three correlation coefficients between parameters of the dataset and parameters of the sigmoid sequential case (Color table online)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mirylenka, K., Giannakopoulos, G., Do, L.M. et al. On classifier behavior in the presence of mislabeling noise. Data Min Knowl Disc 31, 661–701 (2017). https://doi.org/10.1007/s10618-016-0484-8

Download citation

Received: 12 November 2015
Accepted: 01 November 2016
Published: 05 December 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s10618-016-0484-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On classifier behavior in the presence of mislabeling noise

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Classifier Selection Based on Imprecise Probabilities: A Case Study for the Naive Bayes Classifier

A drift detection method based on dynamic classifier selection

Supervised Learning: Classification and Regression

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendices

1.1 Appendix 1: The genetic algorithm settings

1.2 Appendix 2: Correlation results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now