Crawling Deep Web Using a New Set Covering Algorithm

Wang, Yan; Lu, Jianguo; Chen, Jessica

doi:10.1007/978-3-642-03348-3_32

Yan Wang²⁵,
Jianguo Lu^25,26 &
Jessica Chen²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5678))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2270 Accesses
15 Citations

Abstract

Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bergman, M.K.: The deepweb: Surfacing hidden value. The Journal of Electronic Publishing 7(1) (2001)
Google Scholar
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proc. of SBBD (2004)
Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)
Chapter Google Scholar
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proc. of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)
Google Scholar
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proc. of ICDE, pp. 47–56 (2006)
Google Scholar
Lu, J., Wang, Y., Iiang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proc. of Web Intelligence, pp. 718–724 (2008)
Google Scholar
Caprara, A., Toth, P., Fishetti, M.: Algorithms for the set covering problem. Annals of Operations Research 98, 353–371 (2004)
Article MathSciNet Google Scholar
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)
Google Scholar
Nelson, M.L., Smith, J.A., Campo, I.G.D.: Efficient, automatic web resource harvesting. In: Proc. of RECOMB, pp. 43–50 (2006)
Google Scholar
Alvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008)
Article Google Scholar
Ipeirotis, P.G., Jain, P., Gravano, L.: Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems 32 (2007)
Google Scholar
Gravano, L., Ipeirotis, P.G., Sahami, M.: Query- vs. crawling-based classification of searchable web databases. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 25(1), 1–8 (2002)
Google Scholar
Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. In: Proc. of the 28th international conference on Very Large Data Bases, pp. 103–114 (2004)
Google Scholar
Ibrahim, A., Fahmi, S.A., Hashmi, S.I., Choi, H.: Addressing effective hidden web search using iterative deepening search and graph theory. In: Proc. of IEEE 8th International Conference on Computer and Information Technology Workshops, pp. 145–149 (2008)
Google Scholar
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems, 97–130 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Windsor, N9B 3P4, Windsor, Ont., Canada
Yan Wang, Jianguo Lu & Jessica Chen
Key Lab of Novel Software Technology, Nanjing, China
Jianguo Lu

Authors

Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Science & Engineering Institute, School of Education Technology, Beijing Normal University, Xinjiekouwai Ave. 19, 100875, Beijing, China
Ronghuai Huang
The Hong Kong University of Science and Technology, Clear Water Bay,, Hong Kong, Hong Kong
Qiang Yang
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Faculty of Economics, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
João Gama
School of Information, Zhongguancum, Renmin University, 100872, Beijing, China
Xiaofeng Meng
School of Information Technology and Electrical Engineering, The University of Queensland, 4072, St. Lucia, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Lu, J., Chen, J. (2009). Crawling Deep Web Using a New Set Covering Algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-03348-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics