A Highly Adaptable Web Information Extractor Using Graph Data Model

Guo, Qi; Zhou, Lizhu; Zhang, Zhiqiang; Feng, Jianhua

doi:10.1007/978-3-540-24655-8_105

A Highly Adaptable Web Information Extractor Using Graph Data Model

Qi Guo¹⁶,
Lizhu Zhou¹⁶,
Zhiqiang Zhang¹⁶ &
…
Jianhua Feng¹⁶

Conference paper

515 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Abstract

We present an approach to build highly adaptable extractor for collecting data from diverse Web sites. This approach uses Graph Model to represent content and structures as well as their various types of features. The generated graph is accompanied by a script in a special language called GQML containing the extraction rules. The running of the script transforms the graph into a specified format such as XML file that stores data from various Web sites in a uniform format. The experimental results show the presented approach is both effective and efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ashish, N., Knoblock, C.: Wrapper generation for semistructured internet sources. In: Proceedings of the Worshop on Management of Semi-structured Data (1997)
Google Scholar
Hammer, J., Mchugh, J., Garcia Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (ADBIS 1997), pp. 1–8 (1997)
Google Scholar
Arocena, G., Mendelzon, A.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33
Google Scholar
Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000) (2000)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)
Google Scholar
Embley, D.W., Campbell, D.M., et al.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)
Article MATH Google Scholar
Main, M., Rozenberg, G.: Edge-Label controlled Graph Grammars. Journal of Computer and System Sciences 40, 188–228 (1990)
Article MATH MathSciNet Google Scholar
Guo, Q., Guo, H., Sun, J., Zhang, Z., Zhou, L.: Technique Report on SESQ, http://dbgroup.cs.tsinghua.edu.cn/sesq

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Qi Guo, Lizhu Zhou, Zhiqiang Zhang & Jianhua Feng

Authors

Qi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
The University of News South Wales, NSW 2052, Australia
Xuemin Lin
Department of Computer Science, Tsinghua University, 100084, Beijing, P.R. China
Hongjun Lu
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Q., Zhou, L., Zhang, Z., Feng, J. (2004). A Highly Adaptable Web Information Extractor Using Graph Data Model. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_105

Download citation

DOI: https://doi.org/10.1007/978-3-540-24655-8_105
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics