Skip to main content

PIES: A Web Information Extraction System Using Ontology and Tag Patterns

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Abstract

We propose a new web information extraction system, PIES, to convert web information into XML documents. PIES uses a user-specified ontology and HTML tag pattern descriptions. The ontology validates the web information the pattern descriptions extract. We designed a new language to describe HTML tag patterns and extraction rules. We implemented PIES and applied it to the US patent web site for evaluation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, Seattle, pp. 283–294 (1998)

    Google Scholar 

  2. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, San Diego, June 2003, pp. 337–348 (2003)

    Google Scholar 

  3. Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: Proc. Int’l Conf. on World Wide Web (WWW10), Hong Kong, May 2001, pp. 681–688 (2001)

    Google Scholar 

  4. Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse Engineering for Web Data: From Visual to Semantic Structures. In: Proc. Int’l Conf. on Data Engineering (ICDE 2002), San Jose, California, pp. 363–374 (2002)

    Google Scholar 

  5. Crescenzi, V., Mecca, G.: Grammars Have Exceptions. Information Systems 23(8), 539–565 (1998)

    Article  Google Scholar 

  6. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. Int’l Conf. on Very Large Data Bases, Rome, pp. 109–118 (2001)

    Google Scholar 

  7. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  8. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-Based Wrappers in the TSIMMIS System. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, AZ, USA, pp. 532–535 (1997)

    Google Scholar 

  9. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  10. Sahuguet, A., Azavant, F.: Looking at the Web through XML glasses. In: Proc. IFCIS Int. Conf. on Cooperative Information Systems (CoopIS 1999), pp. 148–159 (1999)

    Google Scholar 

  11. United States Patent and Trademark Office, http://www.uspto.gov/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Park, BK., Han, H., Song, IY. (2005). PIES: A Web Information Extraction System Using Ontology and Tag Patterns. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_65

Download citation

  • DOI: https://doi.org/10.1007/11563952_65

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29227-2

  • Online ISBN: 978-3-540-32087-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics