Abstract
Noisy data is inherent in many real-life and industrial modelling situations. If prior knowledge of such data was available, it would be a simple process to remove or account for noise and improve model robustness. Unfortunately, in the majority of learning situations, the presence of underlying noise is suspected but difficult to detect.
Ensemble classification techniques such as bagging, (Breiman, 1996a), boosting (Freund & Schapire, 1997) and arcing algorithms (Breiman, 1997) have received much attention in recent literature. Such techniques have been shown to lead to reduced classification error on unseen cases, and this paper demonstrates that they may also be employed as noise detectors. Recently defined diagnostics such as edge and margin (Breiman, 1997; Freund & Schapire, 1997; Schapire et al., 1998) have been used to explain the improvements made in generalisation error when ensemble classifiers are built. The distributions of these measures are key in the noise detection process introduced in this study.
This paper presents some empirical results on edge distributions which confirm exisiting theories on boosting’s tendency to ‘balance’ error rates. The results are then extended to introduce a methodology whereby boosting may be used to identify noise in training data by examining the changes in edge and margin distributions as boosting proceeds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Reference
Breiman, L. (1996a). Bagging predictors. Machine Learning, 26(2), 123–140.
Breiman, L. (1996b). Bias, Variance and Arcing Classifiers (Technical Report 460). Statistics Department, University of California, Berkeley.
Breiman, L. (1997). Arcing the edge (Technical Report 486). Statistics Department, University of California, Berkeley.
Breiman, L. (1999). Random Forests-Random Features (Technical Report 567). Statistics Department, University of California, Berkeley.
Dietterich, T.G. (1997). Machine learning research: Four current directions. AI Magazine, 18(4), 99–137.
Freund, Y., & Schapire, R.E. (1996). Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning (pp. 148–156). Morgan Kaufmann.
Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalisation to on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J.H. (1997). On bias, variance, 0/1-loss and the curse of dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77.
Friedman, J.H., Hastie, T. & Tibshirani, R. (1998). Additive logistic regression: a statistical perspective on boosting. (Technical Report 199). Department of Statistics, Stanford Univeristy.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Quinlan, J.R. (1996). Bagging, boosting and C4.5. Proceedings of the Thirteenth National Conference on Articifical Intelligence. (pp. 725–730). Menlo Park California, American Association for Artificial Intelligence.
Schapire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boosting the margin:a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686.
Schapire, R.E. & Singer, Y. (1998). Improved boosting algorithms using confidence rated predictions. Proceedings of the Eleventh Computational Learning Theory (pp.80–91)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wheway, V. (2001). Using Boosting to Detect Noisy Data. In: Kowalczyk, R., Loke, S.W., Reed, N.E., Williams, G.J. (eds) Advances in Artificial Intelligence. PRICAI 2000 Workshop Reader. PRICAI 2000. Lecture Notes in Computer Science(), vol 2112. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45408-X_13
Download citation
DOI: https://doi.org/10.1007/3-540-45408-X_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42597-7
Online ISBN: 978-3-540-45408-3
eBook Packages: Springer Book Archive