Skip to main content

A Highly Adaptable Web Information Extractor Using Graph Data Model

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Abstract

We present an approach to build highly adaptable extractor for collecting data from diverse Web sites. This approach uses Graph Model to represent content and structures as well as their various types of features. The generated graph is accompanied by a script in a special language called GQML containing the extraction rules. The running of the script transforms the graph into a specified format such as XML file that stores data from various Web sites in a uniform format. The experimental results show the presented approach is both effective and efficient.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ashish, N., Knoblock, C.: Wrapper generation for semistructured internet sources. In: Proceedings of the Worshop on Management of Semi-structured Data (1997)

    Google Scholar 

  2. Hammer, J., Mchugh, J., Garcia Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (ADBIS 1997), pp. 1–8 (1997)

    Google Scholar 

  3. Arocena, G., Mendelzon, A.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33

    Google Scholar 

  4. Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000) (2000)

    Google Scholar 

  5. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)

    Google Scholar 

  6. Embley, D.W., Campbell, D.M., et al.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  7. Main, M., Rozenberg, G.: Edge-Label controlled Graph Grammars. Journal of Computer and System Sciences 40, 188–228 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  8. Guo, Q., Guo, H., Sun, J., Zhang, Z., Zhou, L.: Technique Report on SESQ, http://dbgroup.cs.tsinghua.edu.cn/sesq

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, Q., Zhou, L., Zhang, Z., Feng, J. (2004). A Highly Adaptable Web Information Extractor Using Graph Data Model. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_105

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24655-8_105

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21371-0

  • Online ISBN: 978-3-540-24655-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics