Efficient Deep Web Crawling Using Reinforcement Learning

Jiang, Lu; Wu, Zhaohui; Feng, Qian; Liu, Jun; Zheng, Qinghua

doi:10.1007/978-3-642-13657-3_46

Lu Jiang²³,
Zhaohui Wu²³,
Qian Feng²³,
Jun Liu²³ &
…
Qinghua Zheng²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4411 Accesses
25 Citations
1 Altmetric

Abstract

Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment according to Q-value. The framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and breaks through the assumption of full-text search implied by existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lawrence, S., Giles, C.L.: Searching the World Wide Web. Science 280, 98–100 (1998)
Article Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep-Web Crawl. In: Proceedings of VLDB 2008, Auckland, New Zealand, pp. 1241–1252 (2008)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of VLDB 2001, Rome, Italy, pp. 129–138 (2001)
Google Scholar
Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of JCDL 2005, Denver, USA, pp. 100–109 (2005)
Google Scholar
Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD 2004, Brasilia, Brazil, pp. 309–321 (2004)
Google Scholar
Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content Through Query Forms. In: Proceedings of WEBIST 2009, Lisbon, Portugal, pp. 634–642 (2009)
Google Scholar
Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Sydney, Australia, pp. 718–724 (2008)
Google Scholar
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Source. In: Proceedings of ICDE 2006, Atlanta, GA, pp. 47–56 (2006)
Google Scholar
Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, Hong Kong, China, pp.394–405 (2002)
Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Google Scholar
Jiang, L., Wu, Z.H., Zheng, Q.H., Liu, J.: Learning Deep Web Crawling with Diverse Features. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Milan, Italy, pp. 572–575 (2009)
Google Scholar
Yamamoto, M., Church, K.W.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)
Article Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
MATH Google Scholar
Ratsaby, J.: Incremental Learning with Sample Queries. IEEE Trans. on PAMI 20(8), 883–888 (1998)
Google Scholar
Amstrup, S.C., McDonald, T.L., Manly, B.F.J.: Handbook of capture–recapture analysis. Princeton University Press, Princeton (2005)
Google Scholar
Mandelbrot, B.B.: Fractal Geometry of Nature. W. H. Freeman and Company, New York (1988)
Google Scholar
Sutton, R.C., Barto, A.G.: Reinforcement learning: An Introduction. The MIT Press, Cambridge (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

MOE KLINNS Lab and SKLMS Lab, Xi’an Jiaotong University, No.28, Xianning West Road, Xi’an, 710049, P.R.China
Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu & Qinghua Zheng

Authors

Lu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki
The Chinese University of Hong Kong, China
Jeffrey Xu Yu
IIT Madras, Chennai, India
B. Ravindran
IIIT, Hyderabad, India
Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient Deep Web Crawling Using Reinforcement Learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-13657-3_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics