skip to main content
10.1145/1774088.1774471acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

An adaptive information extraction system based on wrapper induction with POS tagging

Published:22 March 2010Publication History

ABSTRACT

Information Extraction (IE) performs two important tasks: identifying certain pieces of information from documents and storing them for future use. This work proposes an adaptive IE system based on Boosted Wrapper Induction (BWI), a supervised wrapper induction algorithm. However, some authors have shown that boosting techniques face difficulties during the processing of natural language texts. This fact became the rationale for coupling Parts-of-Speech tagging with the BWI algorithm in our proposed system. In order to evaluate its performance, several experiments were carried out on three standard corpora. The results obtained suggest that the union of POS tagging and BWI offers a small gain of 3--5% of performance over the original BWI algorithm for unstructured texts. These results position our system among the very best similar IE systems endowed with POS tagging, according to a comparison presented and discussed in the article.

References

  1. Califf M. E, Mooney R. J. Relational learning of pattern-match rules for information extraction. In Proc. of the 16h National Conference on AI (AAAI-99), 1999, 328--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ciravegna, F. (LP)2, Rule Induction for Information Extraction Using Linguistic Constraints. Technical Report CS-03-07, Dep. of CS, Univ. of Sheffield, Sheffield, 2003.Google ScholarGoogle Scholar
  3. Freitag D., Kushmerick N. Boosted Wrapper Induction. In Proc. of the 17h National Conf. on AI (AAAI-2000), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Giuliano C., Lavelli A., Romano L. Simple Information Extraction (SIE): A Portable and Effective IE System. In Proc. of the EACL-06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006), Trento, Italy, 2006.Google ScholarGoogle Scholar
  5. Girardi, C. HtmlCleaner: Extracting Relevant Text from Web Pages. In Proc. of WAC3 2007 - 3rd Web as Corpus Workshop. Louvain-la-Neuve, Belgium, 15--16, 2007.Google ScholarGoogle Scholar
  6. Kauchak D., Smarr J., Elkan C. Sources of Success for Information Extraction Methods, Technical Report CS2002-0696. UC, San Diego, 2002.Google ScholarGoogle Scholar
  7. Kushmerick, N., Thomas B. Adaptive Information Extraction: Core Tech. for Information Agents, Springer, 2003, 79--103.Google ScholarGoogle Scholar
  8. Ireson N., Ciravegna F., Califf M. E., Freitag D., Kushmerick N., Lavelli A. Evaluating machine learning for information extraction. In Proc. of the 22nd Int. Conf. on ML, Vol. 119, Bonn, Germany, 2005, 345--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. TIES. Trainable Information Extraction System. Dot.Kom project, 2004. Available at: http://tcc.itc.it/research/textec/tools-resources/ties.htmlGoogle ScholarGoogle Scholar
  10. Lavelli A., Califf M. E, Ciravegna F., Freitag D., Giuliano C., Kushmerick N., Romano L. IE Evaluation: Criticisms and Recommendations. In Workshop on Adaptive Text Extraction and Mining, AAAI-2004, 2004.Google ScholarGoogle Scholar
  11. Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In Proc. of the 17th PACLIC, Singapore, 2003, 216--227.Google ScholarGoogle Scholar
  12. Li Y., Bontcheva K., Dowman M.; Roberts I., Cunningham, H. Ontology Based Information Extraction (OBIE) v. 1, SEKT deliverable, University of Sheffield, 2004.Google ScholarGoogle Scholar
  13. Mason O., Tufis D. Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger. In Proc. of 1st LREC, Granada, Spain, 1998, 589--596.Google ScholarGoogle Scholar
  14. Tang J., Hong M., Zhang D., Liang B., Li, J. Information Extraction: Methodologies and Applications. DCS-Tsinghua University, 2007.Google ScholarGoogle Scholar

Index Terms

  1. An adaptive information extraction system based on wrapper induction with POS tagging

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
        March 2010
        2712 pages
        ISBN:9781605586397
        DOI:10.1145/1774088

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 March 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SAC '10 Paper Acceptance Rate364of1,353submissions,27%Overall Acceptance Rate1,650of6,669submissions,25%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader