Abstract
We present an approach to build highly adaptable extractor for collecting data from diverse Web sites. This approach uses Graph Model to represent content and structures as well as their various types of features. The generated graph is accompanied by a script in a special language called GQML containing the extraction rules. The running of the script transforms the graph into a specified format such as XML file that stores data from various Web sites in a uniform format. The experimental results show the presented approach is both effective and efficient.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ashish, N., Knoblock, C.: Wrapper generation for semistructured internet sources. In: Proceedings of the Worshop on Management of Semi-structured Data (1997)
Hammer, J., Mchugh, J., Garcia Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (ADBIS 1997), pp. 1–8 (1997)
Arocena, G., Mendelzon, A.: WEBOQL: Restructuring Documents, Databases, and Webs. In: Proceedings of the 14th IEEE International Conference on Data Engineering, pp. 24–33
Liu, L., Pu, C., Han, W.: XWrap – An XML-enabled Wrapper Construction System for Web Information Sources. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000) (2000)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001) (2001)
Embley, D.W., Campbell, D.M., et al.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering 31(3), 227–251 (1999)
Main, M., Rozenberg, G.: Edge-Label controlled Graph Grammars. Journal of Computer and System Sciences 40, 188–228 (1990)
Guo, Q., Guo, H., Sun, J., Zhang, Z., Zhou, L.: Technique Report on SESQ, http://dbgroup.cs.tsinghua.edu.cn/sesq
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, Q., Zhou, L., Zhang, Z., Feng, J. (2004). A Highly Adaptable Web Information Extractor Using Graph Data Model. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_105
Download citation
DOI: https://doi.org/10.1007/978-3-540-24655-8_105
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive