Abstract
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Esuli, A., Fagni, T., Sebastiani, F.: MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 1–12. Springer, Heidelberg (2006)
Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)
Abney, S., Schapire, R.E., Singer, Y.: Boosting applied to tagging and PP attachment. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 1999), College Park, US, pp. 38–45 (1999)
Shinnou, H.: Detection of errors in training data by using a decision list and Adaboost. In: Proceedings of the IJCAI 2001 Workshop on Text Learning Beyond Supervision, Seattle, US (2001)
Nakagawa, T., Matsumoto, Y.: Detecting errors in corpora using support vector machines. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, TW, pp. 1–7 (2002)
Dickinson, M., Meurers, W.D.: Detecting errors in part-of-speech annotation. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, HU, pp. 107–114 (2003)
Fukumoto, F., Suzuki, Y.: Correcting category errors in text classification. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, CH, pp. 868–874 (2004)
Argamon-Engelson, S., Dagan, I.: Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research 11, 335–360 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esuli, A., Sebastiani, F. (2009). Training Data Cleaning for Text Classification. In: Azzopardi, L., et al. Advances in Information Retrieval Theory. ICTIR 2009. Lecture Notes in Computer Science, vol 5766. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04417-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-04417-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04416-8
Online ISBN: 978-3-642-04417-5
eBook Packages: Computer ScienceComputer Science (R0)