Incremental Structured Web Database Crawling via History Versions

Liu, Wei; Xiao, Jianguo

doi:10.1007/978-3-642-17616-6_46

Wei Liu¹⁹ &
Jianguo Xiao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6488))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1547 Accesses
2 Citations

Abstract

Web database crawling is one of the major kinds of design choices solution for Deep Web data integration. To the best of our knowledge, the existing works only focused on how to crawl all records in a web database at one time. Due to the high dynamic of web databases, it is not practical to always crawl the whole database in order to harvest a small proportion of new records. To this end, this paper studies the problem of incremental web database crawling, which targets at crawling the new records from a web database as many as possible while minimizing the communication costs. In our approach, a new graph model, an incremental crawling task is transformed into a graph traversal process. Based on this graph, appropriate queries are generated for crawling by analyzing the history versions of the web database. Extensive experimental evaluations over real Web databases validate the effectiveness of our techniques and provide insights for future efforts in this direction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

He, B., Patel, M., Zhang, Z.: Accessing the Deep Web: A survey. Communications of the ACM 50(5) (2007)
Google Scholar
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.Y.: Harnessing the Deep Web: Present and Future. In: CIDR 2009 (2009)
Google Scholar
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: ICDE 2006 (2006)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: VLDB 2001, pp. 129–138 (2001)
Google Scholar
Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Web Intelligence 2008 (2008)
Google Scholar
Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)
Google Scholar
Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: SBBD 2004 (2004)
Google Scholar
Liu, W., Meng, X., Ling, Y.: Graph-based approach for Web database sampling. Journal of Software(Chinese) 19(2), 179–193 (2008)
MATH Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V.: Fully automatic wrapper generation for search engines. In: WWW 2005 (2005)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 (2005)
Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD (2004)
Google Scholar
He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: an automatic integrator of Web search interfaces for E-commerce. In: VLDB (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Scientific and Technical Information of China, China, 100038
Wei Liu
Institute of Computer Science & Technology, Peking University, China, 100871
Jianguo Xiao

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
University of Patras, 26504, Patras, Greece
Peter Triantafillou
Polytechnic Institute of NYU, 11201, Brooklyn, NY, USA
Torsten Suel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Xiao, J. (2010). Incremental Structured Web Database Crawling via History Versions. In: Chen, L., Triantafillou, P., Suel, T. (eds) Web Information Systems Engineering – WISE 2010. WISE 2010. Lecture Notes in Computer Science, vol 6488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17616-6_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-17616-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17615-9
Online ISBN: 978-3-642-17616-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics