ABSTRACT
Information Extraction (IE) performs two important tasks: identifying certain pieces of information from documents and storing them for future use. This work proposes an adaptive IE system based on Boosted Wrapper Induction (BWI), a supervised wrapper induction algorithm. However, some authors have shown that boosting techniques face difficulties during the processing of natural language texts. This fact became the rationale for coupling Parts-of-Speech tagging with the BWI algorithm in our proposed system. In order to evaluate its performance, several experiments were carried out on three standard corpora. The results obtained suggest that the union of POS tagging and BWI offers a small gain of 3--5% of performance over the original BWI algorithm for unstructured texts. These results position our system among the very best similar IE systems endowed with POS tagging, according to a comparison presented and discussed in the article.
- Califf M. E, Mooney R. J. Relational learning of pattern-match rules for information extraction. In Proc. of the 16h National Conference on AI (AAAI-99), 1999, 328--334. Google ScholarDigital Library
- Ciravegna, F. (LP)2, Rule Induction for Information Extraction Using Linguistic Constraints. Technical Report CS-03-07, Dep. of CS, Univ. of Sheffield, Sheffield, 2003.Google Scholar
- Freitag D., Kushmerick N. Boosted Wrapper Induction. In Proc. of the 17h National Conf. on AI (AAAI-2000), 2000. Google ScholarDigital Library
- Giuliano C., Lavelli A., Romano L. Simple Information Extraction (SIE): A Portable and Effective IE System. In Proc. of the EACL-06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006), Trento, Italy, 2006.Google Scholar
- Girardi, C. HtmlCleaner: Extracting Relevant Text from Web Pages. In Proc. of WAC3 2007 - 3rd Web as Corpus Workshop. Louvain-la-Neuve, Belgium, 15--16, 2007.Google Scholar
- Kauchak D., Smarr J., Elkan C. Sources of Success for Information Extraction Methods, Technical Report CS2002-0696. UC, San Diego, 2002.Google Scholar
- Kushmerick, N., Thomas B. Adaptive Information Extraction: Core Tech. for Information Agents, Springer, 2003, 79--103.Google Scholar
- Ireson N., Ciravegna F., Califf M. E., Freitag D., Kushmerick N., Lavelli A. Evaluating machine learning for information extraction. In Proc. of the 22nd Int. Conf. on ML, Vol. 119, Bonn, Germany, 2005, 345--352. Google ScholarDigital Library
- TIES. Trainable Information Extraction System. Dot.Kom project, 2004. Available at: http://tcc.itc.it/research/textec/tools-resources/ties.htmlGoogle Scholar
- Lavelli A., Califf M. E, Ciravegna F., Freitag D., Giuliano C., Kushmerick N., Romano L. IE Evaluation: Criticisms and Recommendations. In Workshop on Adaptive Text Extraction and Mining, AAAI-2004, 2004.Google Scholar
- Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In Proc. of the 17th PACLIC, Singapore, 2003, 216--227.Google Scholar
- Li Y., Bontcheva K., Dowman M.; Roberts I., Cunningham, H. Ontology Based Information Extraction (OBIE) v. 1, SEKT deliverable, University of Sheffield, 2004.Google Scholar
- Mason O., Tufis D. Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger. In Proc. of 1st LREC, Granada, Spain, 1998, 589--596.Google Scholar
- Tang J., Hong M., Zhang D., Liang B., Li, J. Information Extraction: Methodologies and Applications. DCS-Tsinghua University, 2007.Google Scholar
Index Terms
- An adaptive information extraction system based on wrapper induction with POS tagging
Recommendations
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Experiments on POS tagging and data driven dependency parsing for Telugu language
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsIn this paper we present our experiments on Part-Of-Speech tagging and data driven dependency Parsing for Telugu language. We adopted three Part-Of-Speech taggers named as Brill tagger, Maximum Entropy tagger and Trigrams 'n' Tags tagger (TnT) to Telugu ...
Knowledge Transfer via Word Alignment and Its Application to Vietnamese POS Tagging
Computational Data and Social NetworksAbstractIt is not difficult to build a linguistic tagger with a large annotated corpus. Labeled data becomes a big problem with low-resource languages such as Vietnamese. Due to the development and investment in research, there is no large and high-...
Comments