Abstract
In this paper we present a novel instance pruning technique for Information Extraction (IE). In particular, our technique filters out uninformative words from texts on the basis of the assumption that very frequent words in the language do not provide any specific information about the text in which they appear, therefore their expectation of being (part of) relevant entities is very low. The experiments on two benchmark datasets show that the computation time can be significantly reduced without any significant decrease in the prediction accuracy. We also report an improvement in accuracy for one task.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word sequence kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)
Ciravegna, F.: Learning to tag for information extraction. In: Ciravegna, F., Basili, R., Gaizauskas, R. (eds.) Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin (2000)
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain (2004)
Dagan, I., Itai, A.: Word sense disambiguation using a second language monolingual corpus. Computational Linguistics 20(4), 536–596 (1994)
Finn, A., Kushmerick, N.: Multi-level boundary classification for information. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004), San Jose, California (2004)
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University (1998)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)
Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: AAAI/IAAI, pp. 584–589 (2000)
Gliozzo, A., Strapparava, C., Dagan, I.: Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech and Language 18(3), 275–299 (2004)
Kim, T.O.J., Tateishi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl.1), 180–182 (2003)
Joachims, T.: Making large-scale support vector machine learning practical. In: Schölkopf, A.S.B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)
Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Collier, N., Ruch, P., Nazarenko, A. (eds.) Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), Geneva, Switzerland, August 28–29, pp. 70–75 (2004); held in conjunction with COLING 2004
Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: IE evaluation: Criticisms and recommendations. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004), San Jose, California (2004)
Leskovec, J., Shawe-Taylor, J.: Linear programming boosting for uneven datasets. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, August 21-24, pp. 456–463. AAI Press (2003)
Roth, D., tau Yih, W.: Relational learning via propositional algorithms: An information extraction case study. In: Seventeenth International Joint Conf. on Artificial Intelligence, 2001 (2001)
Song, Y., Yi, E., Kim, E., Lee, G.G.: Posbiotm-ner: A machine learning approach for bio-named entity recognition. In: The 20th International Conference on Computational Linguistics (2004)
Yarowsky, D.: One sense per collocation. In: ARPA Workshop on Human Language Technology (1993)
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gliozzo, A.M., Giuliano, C., Rinaldi, R. (2005). Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)