Identifying and Handling Mislabelled Instances

Muhlenbach, Fabrice; Lallich, Stéphane; Zighed, Djamel A.

doi:10.1023/A:1025832930864

Identifying and Handling Mislabelled Instances

Published: January 2004

Volume 22, pages 89–109, (2004)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Fabrice Muhlenbach¹,
Stéphane Lallich¹ &
Djamel A. Zighed¹

757 Accesses
98 Citations
Explore all metrics

Abstract

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Aha, D.W., Kibler, D., and Albert, M.K. (1991). Instance-Based Learning Algorithms. Machine Learn., 6, 37-66.
Google Scholar
Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data, 2nd edition. Norwich: Wiley.
Google Scholar
Beckman, R.J. and Cooks, R.D. (1983). Outlier...s. Technometrics, 25, 119-149.
Google Scholar
Blake, C.L. and Merz, C.J. (1998). UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Google Scholar
Brodley, C.E. and Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. In Proc. of the 30th National Conference on Artificial Intelligence (pp. 799-805). Portland, OR: AAI Press.
Google Scholar
Brodley, C.E. and Friedl, M.A. (1999). Identifying Mislabeled Training Data. JAIR, 11, 131-167.
Google Scholar
Cliff, A.D. and Ord, J.K. (1981). Spatial Processes, Models and Applications. London: Pion Limited.
Google Scholar
Cover, T.M. and Hart, P.E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27.
Google Scholar
Elfving, T. and Eklundh, J.O. (1982). Some Properties of Stochastic Labeling Procedures. Computer Graphics and Image Processing, 20, 158-170.
Google Scholar
Hummel, R. and Zucker, S. (1983). On the Foundations of Relaxation Labelling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 267-287.
Google Scholar
Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall.
John, G.H. (1995). Robust Decision Trees: Removing Outliers from Data. In Proc. of the First International Conference on Knowledge Discovery and Data Mining (pp. 174-179). Montréal: AAI Press.
Google Scholar
Kittler, J. and Illingworth, J. (1985). Relaxation Labelling Algorithms-A Review. Image and Vision Computing, 3(4), 206-216.
Google Scholar
Knorr, E.M., Ng, R.T., and Tucakov, V. (2000). Distance-Based Outliers: Algorithms and Applications. The VLDB Journal, 8(3), 237-253.
Google Scholar
Lallich, S., Muhlenbach, F., and Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In Foundations of Intelligent Systems, Proc. of the 13th International Symposium on Methodologies for Intelligent Systems (ISMIS 2002) (pp. 5-15). Lyon, France, LNAI 2366, Springer-Verlag.
Google Scholar
Lallich, S., Muhlenbach, F., and Zighed, D.A. (2003). Traitement des exemples atypiques en apprentissage par la régression. RSTI, série RIA-ECA, 17(1-3), 399-410.
Google Scholar
Largeron, C. (1991). Reconnaissance des formes par relaxation: un modèle d'aide à la décision. Ph.D. Thesis, Université Lyon 1.
Milligan, G.W. and Cooper, M.C. (1988). A Study of Standardization of Variables in Cluster Analysis. Journal of Classification, 5, 181-204.
Google Scholar
Mood, A. (1940). The Distribution Theory of Runs. Ann. of Math. Statist., 11, 367-392.
Google Scholar
Moran, P.A.P. (1948). The Interpretation of Statistical Maps. Journal of the Royal Statistical Society, Serie B, 246-251.
Muhlenbach, F., Lallich, S., and Zighed, D.A. (2002). Amélioration d'une classification par filtrage des exemples malétiquetés. ECA, 1(4), 155-166.
Google Scholar
Quinlan, J.R. (1986). Induction of Decisions Trees. Machine Learning, 1, 81-106.
Google Scholar
Rosenfeld, A., Hummel, R.A., and Zucker, S.W. (1976). Scene Labeling by Relaxation Operations. IEEE Transactions on Systems Man and Cybernetics, 6(6), 420-433.
Google Scholar
Tomek, I. (1976). An Experiment with the Edited Nearest Neighbor Rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 448-452.
Google Scholar
Toussaint, G.T. (1980). The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recog., 12, 261-268.
Google Scholar
Wald, A. and Wolfowitz, J. (1940). On a Test Wether Two Samples are from the Same Population. Ann. of Math. Statist., 11, 147-162.
Google Scholar
Wilson, D.R. (1972). Asymptotic Properties of Nearest Neighbors Rules Using Edited Data. IEEE Transactions on Systems, Man and Cybernetics, 2, 408-421.
Google Scholar
Wilson, D.R. and Martinez, T.R. (2000). ReductionTechniques for Exemplar-Based Learning Algorithms. Machine Learning, 38, 257-268.
Google Scholar
Zighed, D.A., Lallich, S., and Muhlenbach, F. (2001). Séparabilité des classes dans R ^p. In Actes du VIIIème Congrès de la Société Francophone de Classification-SFC'01 (pp. 356-363). Pointe-à-Pitre, France.
Google Scholar
Zighed, D.A., Lallich, S., and Muhlenbach, F. (2002). Separability Index in Supervised Learning. In Principles of Data Mining and Knowledge Discovery, Proc. of the 6th European Conference PKDD 2002 (pp. 475-487). Helsinki, Finland, LNAI2431, Springer-Verlag.
Google Scholar
Zighed, D.A. and Sebban, M. (1999). Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini (Eds.), Apprentissage Automatique (pp. 85-107). Paris, Hermes.
Google Scholar
Zighed, D.A., Tounissoux, D., Auray, J.P., and Largeron, C. (1990). Discrimination basée sur un critère d'homogénéité locale. Traitement du Signal, 2, 213-220.
Google Scholar

Download references

Author information

Authors and Affiliations

ERIC Laboratory, Lumi, ère University (Lyon 2), Bâtiment L, 5. Avenue Pierre Mendè-France, 69676, Bron Cedex, France
Fabrice Muhlenbach, Stéphane Lallich & Djamel A. Zighed

Authors

Fabrice Muhlenbach
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Lallich
View author publications
You can also search for this author in PubMed Google Scholar
Djamel A. Zighed
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muhlenbach, F., Lallich, S. & Zighed, D.A. Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems 22, 89–109 (2004). https://doi.org/10.1023/A:1025832930864

Download citation

Issue Date: January 2004
DOI: https://doi.org/10.1023/A:1025832930864

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Identifying and Handling Mislabelled Instances

Abstract

Access this article

Similar content being viewed by others

Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Class Noise Detection Using Classification Filtering Algorithms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Identifying and Handling Mislabelled Instances

Abstract

Access this article

Similar content being viewed by others

Markov Blanket Discovery in Positive-Unlabelled and Semi-supervised Data

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Class Noise Detection Using Classification Filtering Algorithms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation