Skip to main content
Log in

Identifying and Handling Mislabelled Instances

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aha, D.W., Kibler, D., and Albert, M.K. (1991). Instance-Based Learning Algorithms. Machine Learn., 6, 37-66.

    Google Scholar 

  • Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data, 2nd edition. Norwich: Wiley.

    Google Scholar 

  • Beckman, R.J. and Cooks, R.D. (1983). Outlier...s. Technometrics, 25, 119-149.

    Google Scholar 

  • Blake, C.L. and Merz, C.J. (1998). UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html].

    Google Scholar 

  • Brodley, C.E. and Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. In Proc. of the 30th National Conference on Artificial Intelligence (pp. 799-805). Portland, OR: AAI Press.

    Google Scholar 

  • Brodley, C.E. and Friedl, M.A. (1999). Identifying Mislabeled Training Data. JAIR, 11, 131-167.

    Google Scholar 

  • Cliff, A.D. and Ord, J.K. (1981). Spatial Processes, Models and Applications. London: Pion Limited.

    Google Scholar 

  • Cover, T.M. and Hart, P.E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27.

    Google Scholar 

  • Elfving, T. and Eklundh, J.O. (1982). Some Properties of Stochastic Labeling Procedures. Computer Graphics and Image Processing, 20, 158-170.

    Google Scholar 

  • Hummel, R. and Zucker, S. (1983). On the Foundations of Relaxation Labelling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 267-287.

    Google Scholar 

  • Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall.

  • John, G.H. (1995). Robust Decision Trees: Removing Outliers from Data. In Proc. of the First International Conference on Knowledge Discovery and Data Mining (pp. 174-179). Montréal: AAI Press.

    Google Scholar 

  • Kittler, J. and Illingworth, J. (1985). Relaxation Labelling Algorithms-A Review. Image and Vision Computing, 3(4), 206-216.

    Google Scholar 

  • Knorr, E.M., Ng, R.T., and Tucakov, V. (2000). Distance-Based Outliers: Algorithms and Applications. The VLDB Journal, 8(3), 237-253.

    Google Scholar 

  • Lallich, S., Muhlenbach, F., and Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In Foundations of Intelligent Systems, Proc. of the 13th International Symposium on Methodologies for Intelligent Systems (ISMIS 2002) (pp. 5-15). Lyon, France, LNAI 2366, Springer-Verlag.

    Google Scholar 

  • Lallich, S., Muhlenbach, F., and Zighed, D.A. (2003). Traitement des exemples atypiques en apprentissage par la régression. RSTI, série RIA-ECA, 17(1-3), 399-410.

    Google Scholar 

  • Largeron, C. (1991). Reconnaissance des formes par relaxation: un modèle d'aide à la décision. Ph.D. Thesis, Université Lyon 1.

  • Milligan, G.W. and Cooper, M.C. (1988). A Study of Standardization of Variables in Cluster Analysis. Journal of Classification, 5, 181-204.

    Google Scholar 

  • Mood, A. (1940). The Distribution Theory of Runs. Ann. of Math. Statist., 11, 367-392.

    Google Scholar 

  • Moran, P.A.P. (1948). The Interpretation of Statistical Maps. Journal of the Royal Statistical Society, Serie B, 246-251.

  • Muhlenbach, F., Lallich, S., and Zighed, D.A. (2002). Amélioration d'une classification par filtrage des exemples malétiquetés. ECA, 1(4), 155-166.

    Google Scholar 

  • Quinlan, J.R. (1986). Induction of Decisions Trees. Machine Learning, 1, 81-106.

    Google Scholar 

  • Rosenfeld, A., Hummel, R.A., and Zucker, S.W. (1976). Scene Labeling by Relaxation Operations. IEEE Transactions on Systems Man and Cybernetics, 6(6), 420-433.

    Google Scholar 

  • Tomek, I. (1976). An Experiment with the Edited Nearest Neighbor Rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 448-452.

    Google Scholar 

  • Toussaint, G.T. (1980). The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recog., 12, 261-268.

    Google Scholar 

  • Wald, A. and Wolfowitz, J. (1940). On a Test Wether Two Samples are from the Same Population. Ann. of Math. Statist., 11, 147-162.

    Google Scholar 

  • Wilson, D.R. (1972). Asymptotic Properties of Nearest Neighbors Rules Using Edited Data. IEEE Transactions on Systems, Man and Cybernetics, 2, 408-421.

    Google Scholar 

  • Wilson, D.R. and Martinez, T.R. (2000). ReductionTechniques for Exemplar-Based Learning Algorithms. Machine Learning, 38, 257-268.

    Google Scholar 

  • Zighed, D.A., Lallich, S., and Muhlenbach, F. (2001). Séparabilité des classes dans R p. In Actes du VIIIème Congrès de la Société Francophone de Classification-SFC'01 (pp. 356-363). Pointe-à-Pitre, France.

    Google Scholar 

  • Zighed, D.A., Lallich, S., and Muhlenbach, F. (2002). Separability Index in Supervised Learning. In Principles of Data Mining and Knowledge Discovery, Proc. of the 6th European Conference PKDD 2002 (pp. 475-487). Helsinki, Finland, LNAI2431, Springer-Verlag.

    Google Scholar 

  • Zighed, D.A. and Sebban, M. (1999). Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini (Eds.), Apprentissage Automatique (pp. 85-107). Paris, Hermes.

    Google Scholar 

  • Zighed, D.A., Tounissoux, D., Auray, J.P., and Largeron, C. (1990). Discrimination basée sur un critère d'homogénéité locale. Traitement du Signal, 2, 213-220.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muhlenbach, F., Lallich, S. & Zighed, D.A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems 22, 89–109 (2004). https://doi.org/10.1023/A:1025832930864

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025832930864

Navigation