Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study

Gliozzo, Alfio Massimiliano; Giuliano, Claudio; Rinaldi, Raffaella

doi:10.1007/978-3-540-30586-6_54

Alfio Massimiliano Gliozzo¹⁷,
Claudio Giuliano¹⁷ &
Raffaella Rinaldi¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2301 Accesses
3 Citations

Abstract

In this paper we present a novel instance pruning technique for Information Extraction (IE). In particular, our technique filters out uninformative words from texts on the basis of the assumption that very frequent words in the language do not provide any specific information about the text in which they appear, therefore their expectation of being (part of) relevant entities is very low. The experiments on two benchmark datasets show that the computation time can be significantly reduced without any significant decrease in the prediction accuracy. We also report an improvement in accuracy for one task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Research on Reliability of Instance and Pattern in Semi-supervised Entity Relation Extraction

Information Extraction Approaches: A Survey

References

Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word sequence kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)
Article MATH MathSciNet Google Scholar
Ciravegna, F.: Learning to tag for information extraction. In: Ciravegna, F., Basili, R., Gaizauskas, R. (eds.) Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin (2000)
Google Scholar
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain (2004)
Google Scholar
Dagan, I., Itai, A.: Word sense disambiguation using a second language monolingual corpus. Computational Linguistics 20(4), 536–596 (1994)
Google Scholar
Finn, A., Kushmerick, N.: Multi-level boundary classification for information. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004), San Jose, California (2004)
Google Scholar
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University (1998)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)
Google Scholar
Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: AAAI/IAAI, pp. 584–589 (2000)
Google Scholar
Gliozzo, A., Strapparava, C., Dagan, I.: Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech and Language 18(3), 275–299 (2004)
Article Google Scholar
Kim, T.O.J., Tateishi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl.1), 180–182 (2003)
Article Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. In: Schölkopf, A.S.B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)
Google Scholar
Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Collier, N., Ruch, P., Nazarenko, A. (eds.) Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), Geneva, Switzerland, August 28–29, pp. 70–75 (2004); held in conjunction with COLING 2004
Google Scholar
Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: IE evaluation: Criticisms and recommendations. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004), San Jose, California (2004)
Google Scholar
Leskovec, J., Shawe-Taylor, J.: Linear programming boosting for uneven datasets. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, August 21-24, pp. 456–463. AAI Press (2003)
Google Scholar
Roth, D., tau Yih, W.: Relational learning via propositional algorithms: An information extraction case study. In: Seventeenth International Joint Conf. on Artificial Intelligence, 2001 (2001)
Google Scholar
Song, Y., Yi, E., Kim, E., Lee, G.G.: Posbiotm-ner: A machine learning approach for bio-named entity recognition. In: The 20th International Conference on Computational Linguistics (2004)
Google Scholar
Yarowsky, D.: One sense per collocation. In: ARPA Workshop on Human Language Technology (1993)
Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Istituto per la Ricerca Scientifica e Tecnologica, ITC-irst, I-38050, Trento, Italy
Alfio Massimiliano Gliozzo, Claudio Giuliano & Raffaella Rinaldi

Authors

Alfio Massimiliano Gliozzo
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Giuliano
View author publications
You can also search for this author in PubMed Google Scholar
Raffaella Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gliozzo, A.M., Giuliano, C., Rinaldi, R. (2005). Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study

Abstract

Access this chapter

Preview

Similar content being viewed by others

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Research on Reliability of Instance and Pattern in Semi-supervised Entity Relation Extraction

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Instance Pruning by Filtering Uninformative Words: An Information Extraction Case Study

Abstract

Access this chapter

Preview

Similar content being viewed by others

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Research on Reliability of Instance and Pattern in Semi-supervised Entity Relation Extraction

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation