Abstract
In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the World Wide Web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries, or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. We discuss how to intelligently select features which are most useful for a given crawl. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.
- AGGARWAL,C.C.,AL-GARAWI,F.,AND YU, P. S. 2001. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the Tenth WWW Conference (WWW10). Google Scholar
- AGGARAWAL,C.C.,GATES,S.C.,AND YU, P. S. 1999. On the merits of using supervised clustering for building categorization systems. In Proceedings of the KDD Conference. Google Scholar
- BAR-YOSSEF, Z., BERG, A., CHIEN, S., FAKCHAROENPHOL,J.,AND WEITZ, D. 2000. Approximating Aggregate Queries about Web Pages via Random Walks. In Proceedings of the VLDB Conference. Google Scholar
- BHARAT,K.AND HENZINGER, M. 1998. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proceedings of the ACM SIGIR Conference. Google Scholar
- CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON,D.,AND KLEINBERG, J. M. 1998. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proceedings of the WWW Conference (WWW7). Google Scholar
- CHAKRABARTI, S., DOM, B., RAVI KUMAR, S., RAGHAVAN, P., RAJGOPALAN, S., TOMKINS, A., GIBSON,D.,AND KLEINBERG, J. M. 1999. Mining the Web's Link Structure. IEEE Computer, 32(8):60-67. Google Scholar
- CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999a. Focussed Crawling: A New Approach to Topic Specific Resource Discovery. In Proceedings of the WWW Conference. Google Scholar
- CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999b. Distributed Hypertext Resource Discovery through Examples. In Proceedings of the VLDB Conference. Google Scholar
- DILIGENTI, M., COETZEE, F., LAWRENCE, S., LEE GILES,C.,AND GORI, M. 2000. Focused Crawling Using Context Graphs. In Proceedings of the VLDB Conference. Google Scholar
- CHO, J., GARCIA-MOLINA,J.,AND PAGE, L. 1998. Efficient Crawling Through URL Ordering. In Proceedings of the Seventh WWW Conference (WWW7). Google Scholar
- CHO,J.AND GARCIA-MOLINA, J. 2000. The Evolution of the Web and Implications for an Incremental Crawler. In Proceedings of the VLDB Conference. Google Scholar
- DE BRA,P.AND POST, R. 1994. Searching for Arbitrary Information in the WWW: the Fish-Search for Mosaic. In Proceedings of the Third WWW Conference (WWW3).Google Scholar
- DING, J., GRAVANO, L., AND SHIVAKUMAR, N. 2000. Computing Geographical Scopes of Web Resources. In Proceedings of the VLDB Conference. Google Scholar
- KLEINBERG, J. 1998. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the Symposium on Discrete Algorithms. Google Scholar
- MUKHERJEA, S. 2000. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information. In Proceedings of the Ninth WWW Conference (WWW9). Google Scholar
Index Terms
- On the design of a learning crawler for topical resource discovery
Recommendations
On Leveraging User Access Patterns for Topic Specific Crawling
In recent years, there has been considerable research on constructing crawlers which find resources satisfying specific conditions called predicates. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint on the internal ...
Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs
SKG '09: Proceedings of the 2009 Fifth International Conference on Semantics, Knowledge and GridIt is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the ...
Collection synthesis
JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital librariesThe invention of the hyperlink and the HTTP transmission protocol caused an amazing new structure to appear on the Internet -- the World Wide Web. With the Web, there came spiders, robots, and Web crawlers, which go from one link to the next checking ...
Comments