Abstract
Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.
Similar content being viewed by others
References
Aha, D.W., Kibler, D., and Albert, M.K. (1991). Instance-Based Learning Algorithms. Machine Learn., 6, 37-66.
Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data, 2nd edition. Norwich: Wiley.
Beckman, R.J. and Cooks, R.D. (1983). Outlier...s. Technometrics, 25, 119-149.
Blake, C.L. and Merz, C.J. (1998). UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Brodley, C.E. and Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. In Proc. of the 30th National Conference on Artificial Intelligence (pp. 799-805). Portland, OR: AAI Press.
Brodley, C.E. and Friedl, M.A. (1999). Identifying Mislabeled Training Data. JAIR, 11, 131-167.
Cliff, A.D. and Ord, J.K. (1981). Spatial Processes, Models and Applications. London: Pion Limited.
Cover, T.M. and Hart, P.E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27.
Elfving, T. and Eklundh, J.O. (1982). Some Properties of Stochastic Labeling Procedures. Computer Graphics and Image Processing, 20, 158-170.
Hummel, R. and Zucker, S. (1983). On the Foundations of Relaxation Labelling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 267-287.
Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall.
John, G.H. (1995). Robust Decision Trees: Removing Outliers from Data. In Proc. of the First International Conference on Knowledge Discovery and Data Mining (pp. 174-179). Montréal: AAI Press.
Kittler, J. and Illingworth, J. (1985). Relaxation Labelling Algorithms-A Review. Image and Vision Computing, 3(4), 206-216.
Knorr, E.M., Ng, R.T., and Tucakov, V. (2000). Distance-Based Outliers: Algorithms and Applications. The VLDB Journal, 8(3), 237-253.
Lallich, S., Muhlenbach, F., and Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In Foundations of Intelligent Systems, Proc. of the 13th International Symposium on Methodologies for Intelligent Systems (ISMIS 2002) (pp. 5-15). Lyon, France, LNAI 2366, Springer-Verlag.
Lallich, S., Muhlenbach, F., and Zighed, D.A. (2003). Traitement des exemples atypiques en apprentissage par la régression. RSTI, série RIA-ECA, 17(1-3), 399-410.
Largeron, C. (1991). Reconnaissance des formes par relaxation: un modèle d'aide à la décision. Ph.D. Thesis, Université Lyon 1.
Milligan, G.W. and Cooper, M.C. (1988). A Study of Standardization of Variables in Cluster Analysis. Journal of Classification, 5, 181-204.
Mood, A. (1940). The Distribution Theory of Runs. Ann. of Math. Statist., 11, 367-392.
Moran, P.A.P. (1948). The Interpretation of Statistical Maps. Journal of the Royal Statistical Society, Serie B, 246-251.
Muhlenbach, F., Lallich, S., and Zighed, D.A. (2002). Amélioration d'une classification par filtrage des exemples malétiquetés. ECA, 1(4), 155-166.
Quinlan, J.R. (1986). Induction of Decisions Trees. Machine Learning, 1, 81-106.
Rosenfeld, A., Hummel, R.A., and Zucker, S.W. (1976). Scene Labeling by Relaxation Operations. IEEE Transactions on Systems Man and Cybernetics, 6(6), 420-433.
Tomek, I. (1976). An Experiment with the Edited Nearest Neighbor Rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 448-452.
Toussaint, G.T. (1980). The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recog., 12, 261-268.
Wald, A. and Wolfowitz, J. (1940). On a Test Wether Two Samples are from the Same Population. Ann. of Math. Statist., 11, 147-162.
Wilson, D.R. (1972). Asymptotic Properties of Nearest Neighbors Rules Using Edited Data. IEEE Transactions on Systems, Man and Cybernetics, 2, 408-421.
Wilson, D.R. and Martinez, T.R. (2000). ReductionTechniques for Exemplar-Based Learning Algorithms. Machine Learning, 38, 257-268.
Zighed, D.A., Lallich, S., and Muhlenbach, F. (2001). Séparabilité des classes dans R p. In Actes du VIIIème Congrès de la Société Francophone de Classification-SFC'01 (pp. 356-363). Pointe-à-Pitre, France.
Zighed, D.A., Lallich, S., and Muhlenbach, F. (2002). Separability Index in Supervised Learning. In Principles of Data Mining and Knowledge Discovery, Proc. of the 6th European Conference PKDD 2002 (pp. 475-487). Helsinki, Finland, LNAI2431, Springer-Verlag.
Zighed, D.A. and Sebban, M. (1999). Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini (Eds.), Apprentissage Automatique (pp. 85-107). Paris, Hermes.
Zighed, D.A., Tounissoux, D., Auray, J.P., and Largeron, C. (1990). Discrimination basée sur un critère d'homogénéité locale. Traitement du Signal, 2, 213-220.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Muhlenbach, F., Lallich, S. & Zighed, D.A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems 22, 89–109 (2004). https://doi.org/10.1023/A:1025832930864
Issue Date:
DOI: https://doi.org/10.1023/A:1025832930864