Abstract
Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bergman, M.K.: The deepweb: Surfacing hidden value. The Journal of Electronic Publishing 7(1) (2001)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proc. of SBBD (2004)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proc. of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proc. of ICDE, pp. 47–56 (2006)
Lu, J., Wang, Y., Iiang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proc. of Web Intelligence, pp. 718–724 (2008)
Caprara, A., Toth, P., Fishetti, M.: Algorithms for the set covering problem. Annals of Operations Research 98, 353–371 (2004)
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)
Nelson, M.L., Smith, J.A., Campo, I.G.D.: Efficient, automatic web resource harvesting. In: Proc. of RECOMB, pp. 43–50 (2006)
Alvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008)
Ipeirotis, P.G., Jain, P., Gravano, L.: Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems 32 (2007)
Gravano, L., Ipeirotis, P.G., Sahami, M.: Query- vs. crawling-based classification of searchable web databases. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 25(1), 1–8 (2002)
Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. In: Proc. of the 28th international conference on Very Large Data Bases, pp. 103–114 (2004)
Ibrahim, A., Fahmi, S.A., Hashmi, S.I., Choi, H.: Addressing effective hidden web search using iterative deepening search and graph theory. In: Proc. of IEEE 8th International Conference on Computer and Information Technology Workshops, pp. 145–149 (2008)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems, 97–130 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Y., Lu, J., Chen, J. (2009). Crawling Deep Web Using a New Set Covering Algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-03348-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)