skip to main content
article

Instance Filtering for entity recognition

Published: 01 June 2005 Publication History

Abstract

In this paper we propose Instance Filtering as preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. We performed a comparative study on a class of Instance Filtering techniques, called Stop Word Filters, that simply remove all the tokens belonging to a list of stop words. We evaluated our approach on three different entity recognition tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) in English and Dutch, showing that both the skewness and the data set size are drastically reduced. Consequently, we reported an impressive reduction of the computation time required for training and classification, while maintaining (and sometimes improving) the prediction accuracy.

References

[1]
X. Carreras, L. Márques, and L. Padró. Named entity extraction using AdaBoost. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002.
[2]
N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 6(1):1--6, 2004.
[3]
F. Ciravegna. Learning to tag for information extraction. In F. Ciravegna, R. Basili, and R. Gaizauskas, editors, Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin, 2000.
[4]
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
[5]
A. De Sitter and W. Daelemans. Information extraction via double classification. In International Workshop on Adaptive Text Extraction and Mining, 2003.
[6]
D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), pages 577--583, 2000.
[7]
C. Giuliano, A. Lavelli, and L. Romano. Simple information extraction (SIE). Technical report, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 2005.
[8]
A. M. Gliozzo, C. Giuliano, and R. Rinaldi. Instance pruning by filtering uninformative words: an Information Extraction case study. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), Mexico City, Mexico, 13--19 February 2005.
[9]
J. Kim, T. Ohta, Y. Tateishi, and J. Tsujii. Genia corpus - a semantically annotated corpus for biotextmining. Rioinformatics, 19(Suppl.1):180--182, 2003.
[10]
J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004.
[11]
S. Kotsiantis and P. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing and Teleinformatics, 1(1):46--55, 2003.
[12]
J. Leskovec and J. Shawe-Taylor. Linear programming boosting for uneven datasets. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 456--463. AAI Press, 2003.
[13]
D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), 2001.
[14]
Y. Song, E. Yi, E. Kim, and G. G. Lee. POSBIOTMNER in the shared task of bionip/nlpba2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004.
[15]
I. Steinwart, Sparseness of Support Vector Machines---some asymptotically sharp bounds. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
[16]
G. Weiss and F. Provost. The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University, 2001.
[17]
G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6(1):7--19, 2004.
[18]
D. R. Wilson and T. R. Martinez. Instance pruning techniques. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 403--411, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
[19]
D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000.
[20]
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
[21]
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 6(1):80--89, 2004.
[22]
G. D. Zhou and J. Su. Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 7, Issue 1
Natural language processing and text mining
June 2005
81 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1089815
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2005
Published in SIGKDD Volume 7, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media