research-article

An adaptive information extraction system based on wrapper induction with POS tagging

Authors:
Rinaldo Lima

Cidade Universitária, Recife, PE, Brazil

Cidade Universitária, Recife, PE, Brazil
View Profile

,
Bernard Espinasse

Domaine Universitaire de St Jerôme, Marseille Cedex, France

Domaine Universitaire de St Jerôme, Marseille Cedex, France
View Profile

,
Fred Freitas

Cidade Universitária, Recife, PE, Brazil

Cidade Universitária, Recife, PE, Brazil
View Profile

SAC '10: Proceedings of the 2010 ACM Symposium on Applied ComputingMarch 2010Pages 1815–1820https://doi.org/10.1145/1774088.1774471

Published:22 March 2010Publication History

SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

Pages 1815–1820

ABSTRACT

Information Extraction (IE) performs two important tasks: identifying certain pieces of information from documents and storing them for future use. This work proposes an adaptive IE system based on Boosted Wrapper Induction (BWI), a supervised wrapper induction algorithm. However, some authors have shown that boosting techniques face difficulties during the processing of natural language texts. This fact became the rationale for coupling Parts-of-Speech tagging with the BWI algorithm in our proposed system. In order to evaluate its performance, several experiments were carried out on three standard corpora. The results obtained suggest that the union of POS tagging and BWI offers a small gain of 3--5% of performance over the original BWI algorithm for unstructured texts. These results position our system among the very best similar IE systems endowed with POS tagging, according to a comparison presented and discussed in the article.

References

Califf M. E, Mooney R. J. Relational learning of pattern-match rules for information extraction. In Proc. of the 16^h National Conference on AI (AAAI-99), 1999, 328--334. Google ScholarDigital Library
Ciravegna, F. (LP)², Rule Induction for Information Extraction Using Linguistic Constraints. Technical Report CS-03-07, Dep. of CS, Univ. of Sheffield, Sheffield, 2003.Google Scholar
Freitag D., Kushmerick N. Boosted Wrapper Induction. In Proc. of the 17h National Conf. on AI (AAAI-2000), 2000. Google ScholarDigital Library
Giuliano C., Lavelli A., Romano L. Simple Information Extraction (SIE): A Portable and Effective IE System. In Proc. of the EACL-06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006), Trento, Italy, 2006.Google Scholar
Girardi, C. HtmlCleaner: Extracting Relevant Text from Web Pages. In Proc. of WAC3 2007 - 3rd Web as Corpus Workshop. Louvain-la-Neuve, Belgium, 15--16, 2007.Google Scholar
Kauchak D., Smarr J., Elkan C. Sources of Success for Information Extraction Methods, Technical Report CS2002-0696. UC, San Diego, 2002.Google Scholar
Kushmerick, N., Thomas B. Adaptive Information Extraction: Core Tech. for Information Agents, Springer, 2003, 79--103.Google Scholar
Ireson N., Ciravegna F., Califf M. E., Freitag D., Kushmerick N., Lavelli A. Evaluating machine learning for information extraction. In Proc. of the 22nd Int. Conf. on ML, Vol. 119, Bonn, Germany, 2005, 345--352. Google ScholarDigital Library
TIES. Trainable Information Extraction System. Dot.Kom project, 2004. Available at: http://tcc.itc.it/research/textec/tools-resources/ties.htmlGoogle Scholar
Lavelli A., Califf M. E, Ciravegna F., Freitag D., Giuliano C., Kushmerick N., Romano L. IE Evaluation: Criticisms and Recommendations. In Workshop on Adaptive Text Extraction and Mining, AAAI-2004, 2004.Google Scholar
Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In Proc. of the 17th PACLIC, Singapore, 2003, 216--227.Google Scholar
Li Y., Bontcheva K., Dowman M.; Roberts I., Cunningham, H. Ontology Based Information Extraction (OBIE) v. 1, SEKT deliverable, University of Sheffield, 2004.Google Scholar
Mason O., Tufis D. Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger. In Proc. of 1st LREC, Granada, Spain, 1998, 589--596.Google Scholar
Tang J., Hong M., Zhang D., Liang B., Li, J. Information Extraction: Methodologies and Applications. DCS-Tsinghua University, 2007.Google Scholar

Index Terms

An adaptive information extraction system based on wrapper induction with POS tagging
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information retrieval

Recommendations

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & Security

Part-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Read More
Experiments on POS tagging and data driven dependency parsing for Telugu language
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and Informatics

In this paper we present our experiments on Part-Of-Speech tagging and data driven dependency Parsing for Telugu language. We adopted three Part-Of-Speech taggers named as Brill tagger, Maximum Entropy tagger and Trigrams 'n' Tags tagger (TnT) to Telugu ...
Read More
Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging
Computational Data and Social Networks
Abstract
It is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
March 2010
2712 pages
ISBN:9781605586397
DOI:10.1145/1774088
Conference Chairs:
Sung Y. Shin
South Dakota State University
,
Sascha Ossowski
University Rey Juan Carlos, Spain
,
Michael Schumacher
University of Applied Sciences Western Switzerland, Switzerland
,
Program Chairs:
Mathew J. Palakal
Indiana University Purdue University
,
Chih-Cheng Hung
Southern Polytechnic State University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
POS tagging
boosting
information extraction
machine learning
supervised classification
wrapper induction
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '10 Paper Acceptance Rate364of1,353submissions,27%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 244
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An adaptive information extraction system based on wrapper induction with POS tagging

SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus

Experiments on POS tagging and data driven dependency parsing for Telugu language

Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging