PIES: A Web Information Extraction System Using Ontology and Tag Patterns

Park, Byung-Kwon; Han, Hyoil; Song, Il-Yeol

doi:10.1007/11563952_65

PIES: A Web Information Extraction System Using Ontology and Tag Patterns

Byung-Kwon Park¹⁹,
Hyoil Han²⁰ &
Il-Yeol Song²⁰

Conference paper

773 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Abstract

We propose a new web information extraction system, PIES, to convert web information into XML documents. PIES uses a user-specified ontology and HTML tag pattern descriptions. The ontology validates the web information the pattern descriptions extract. We designed a new language to describe HTML tag patterns and extraction rules. We implemented PIES and applied it to the US patent web site for evaluation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, Seattle, pp. 283–294 (1998)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, San Diego, June 2003, pp. 337–348 (2003)
Google Scholar
Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: Proc. Int’l Conf. on World Wide Web (WWW10), Hong Kong, May 2001, pp. 681–688 (2001)
Google Scholar
Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse Engineering for Web Data: From Visual to Semantic Structures. In: Proc. Int’l Conf. on Data Engineering (ICDE 2002), San Jose, California, pp. 363–374 (2002)
Google Scholar
Crescenzi, V., Mecca, G.: Grammars Have Exceptions. Information Systems 23(8), 539–565 (1998)
Article Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. Int’l Conf. on Very Large Data Bases, Rome, pp. 109–118 (2001)
Google Scholar
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering 31(3), 227–251 (1999)
Article MATH Google Scholar
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-Based Wrappers in the TSIMMIS System. In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, AZ, USA, pp. 532–535 (1997)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Sahuguet, A., Azavant, F.: Looking at the Web through XML glasses. In: Proc. IFCIS Int. Conf. on Cooperative Information Systems (CoopIS 1999), pp. 148–159 (1999)
Google Scholar
United States Patent and Trademark Office, http://www.uspto.gov/

Download references

Author information

Authors and Affiliations

Dong-A University, Busan, Korea
Byung-Kwon Park
Drexel University, Philadelphia, PA, 19104, USA
Hyoil Han & Il-Yeol Song

Authors

Byung-Kwon Park
View author publications
You can also search for this author in PubMed Google Scholar
Hyoil Han
View author publications
You can also search for this author in PubMed Google Scholar
Il-Yeol Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh & Bell Laboratories,
Wenfei Fan
College of Computer Science, Zhejiang University, 310027, Hangzhou, Zhejiang, China
Zhaohui Wu
Dept. of E. I. E, Huazhong University of Science and Technology, Wuhan, China
Jun Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, BK., Han, H., Song, IY. (2005). PIES: A Web Information Extraction System Using Ontology and Tag Patterns. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_65

Download citation

DOI: https://doi.org/10.1007/11563952_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics