Training Data Cleaning for Text Classification

Esuli, Andrea; Sebastiani, Fabrizio

doi:10.1007/978-3-642-04417-5_4

Andrea Esuli²¹ &
Fabrizio Sebastiani²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5766))

Included in the following conference series:

Conference on the Theory of Information Retrieval

1173 Accesses

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A review of semi-supervised learning for text classification

Article 31 January 2023

Active Learning for Text Mining from Crowds

A discriminative model selection approach and its application to text classification

Article 15 July 2017

References

Esuli, A., Fagni, T., Sebastiani, F.: MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar
Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)
Article Google Scholar
Abney, S., Schapire, R.E., Singer, Y.: Boosting applied to tagging and PP attachment. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 1999), College Park, US, pp. 38–45 (1999)
Google Scholar
Shinnou, H.: Detection of errors in training data by using a decision list and Adaboost. In: Proceedings of the IJCAI 2001 Workshop on Text Learning Beyond Supervision, Seattle, US (2001)
Google Scholar
Nakagawa, T., Matsumoto, Y.: Detecting errors in corpora using support vector machines. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, TW, pp. 1–7 (2002)
Google Scholar
Dickinson, M., Meurers, W.D.: Detecting errors in part-of-speech annotation. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, HU, pp. 107–114 (2003)
Google Scholar
Fukumoto, F., Suzuki, Y.: Correcting category errors in text classification. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, CH, pp. 868–874 (2004)
Google Scholar
Argamon-Engelson, S., Dagan, I.: Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research 11, 335–360 (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Istituto di Scienza e Tecnologia dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi 1, 56124, Pisa, Italy
Andrea Esuli & Fabrizio Sebastiani

Authors

Andrea Esuli
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, Sir Alwyn Williams Building, Lilybank Gardens, University of Glasgow, G12 8QQ, Glasgow, Scotland, UK
Leif Azzopardi
Microsoft Research Ltd, 7 JJ Thomson Avenue, CB3 0FB, Cambridge, UK
Gabriella Kazai & Stephen Robertson &
Knowledge Media Institute,, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Microsoft Research Ltd, 7 JJ Thomson Avenue, CB3 0FB, Cambridge, United Kingdom
Milad Shokouhi & Emine Yilmaz &
School of Computing, The Robert Gordon University, St Andrew Street, AB25 1HG, Aberdeen, UK
Dawei Song

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esuli, A., Sebastiani, F. (2009). Training Data Cleaning for Text Classification. In: Azzopardi, L., et al. Advances in Information Retrieval Theory. ICTIR 2009. Lecture Notes in Computer Science, vol 5766. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04417-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-04417-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04416-8
Online ISBN: 978-3-642-04417-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics