Abstract
Real data are often corrupted by noise, which can be provenient from errors in data collection, storage and processing. The presence of noise hampers the induction of Machine Learning models from data, which can have their predictive or descriptive performance impaired, while also making the training time longer. Moreover, these models can be overly complex in order to accomodate such errors. Thus, the identification and reduction of noise in a data set may benefit the learning process. In this paper, we thereby investigate the use of data complexity measures to identify the presence of noise in a data set. This identification can support the decision regarding the need of the application of noise redution techniques.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)
Wu, X.: Knowledge Acquisition from Databases. Ablex Pulishing Corp. (1995)
Maletic, J.I., Marcus, A.: Data cleansing: Beyond integrity analysis. In: Proc. Conf. Information Quality, pp. 200–209 (2000)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Quinlan, J.R.: The effect of noise on concept learning. In: Michalski, R.S.I., Carboneel, J.G., Mitchell (eds.) Machine Learning. Morgan Kaufmann Publishers Inc. (1986)
Lorena, A.C., Carvalho, A.C.P.L.F.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genetics and Molecular Biology 27(4), 665–672 (2004)
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Gamberger, D., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data proprocessing: Experiments in medical domains. Applied Artificial Intelligence 14(2), 205–223 (2000)
John, G.H.: Robust decision trees: Removing outliers from databases. In: KDD, pp. 174–179 (1995)
Zhao, Q., Nishida, T.: Using qualitative hypotheses to identify inaccurate data. J. Artif. Intell. Res. (JAIR) 3, 119–145 (1995)
Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, vol. 1, pp. 799–805 (1996)
Teng, C.M.: Correcting noisy data. In: ICML, pp. 239–248 (1999)
Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: ICML, pp. 920–927 (2003)
Zhu, X., Wu, X., Yang, Y.: Error detection and impact-sensitive instance ranking in noisy datasets. In: AAAI, pp. 378–384 (2004)
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355–364 (2013)
Sluban, B., Gamberger, D., Lavrac, N.: Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Mining and Knowledge Discovery (2013)
Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
Orriols-Puig, A., Maciá, N., Ho, T.K.: Documentation for the data complexity library in C++. Technical report, La Salle - Universitat Ramon Llull (2010)
Heckerman, D.: A tutorial on learning with bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research (1995)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
Mitchell, T.M.: Machine Learning, 1st edn. McGraw Hill series in computer science. McGraw-Hill (1997)
Vapnik, V.N.: The nature of Statistical learning theory. Springer (1995)
Bache, K., Lichman, M.: UCI machine learning repository (2013)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garćia, L.P.F., de Carvalho, A.C.P.L.F., Lorena, A.C. (2013). Noisy Data Set Identification. In: Pan, JS., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2013. Lecture Notes in Computer Science(), vol 8073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40846-5_63
Download citation
DOI: https://doi.org/10.1007/978-3-642-40846-5_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40845-8
Online ISBN: 978-3-642-40846-5
eBook Packages: Computer ScienceComputer Science (R0)