article

On the design of a learning crawler for topical resource discovery

Authors:
Charu C. Aggarwal

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
Fatima Al-Garawi

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Philip S. Yu

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 19 Issue 3pp 286–309https://doi.org/10.1145/502115.502119

Published:01 July 2001Publication History

ACM Transactions on Information Systems

Abstract

In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the World Wide Web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries, or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. We discuss how to intelligently select features which are most useful for a given crawl. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.

References

AGGARWAL,C.C.,AL-GARAWI,F.,AND YU, P. S. 2001. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the Tenth WWW Conference (WWW10). Google Scholar
AGGARAWAL,C.C.,GATES,S.C.,AND YU, P. S. 1999. On the merits of using supervised clustering for building categorization systems. In Proceedings of the KDD Conference. Google Scholar
BAR-YOSSEF, Z., BERG, A., CHIEN, S., FAKCHAROENPHOL,J.,AND WEITZ, D. 2000. Approximating Aggregate Queries about Web Pages via Random Walks. In Proceedings of the VLDB Conference. Google Scholar
BHARAT,K.AND HENZINGER, M. 1998. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proceedings of the ACM SIGIR Conference. Google Scholar
CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON,D.,AND KLEINBERG, J. M. 1998. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proceedings of the WWW Conference (WWW7). Google Scholar
CHAKRABARTI, S., DOM, B., RAVI KUMAR, S., RAGHAVAN, P., RAJGOPALAN, S., TOMKINS, A., GIBSON,D.,AND KLEINBERG, J. M. 1999. Mining the Web's Link Structure. IEEE Computer, 32(8):60-67. Google Scholar
CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999a. Focussed Crawling: A New Approach to Topic Specific Resource Discovery. In Proceedings of the WWW Conference. Google Scholar
CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999b. Distributed Hypertext Resource Discovery through Examples. In Proceedings of the VLDB Conference. Google Scholar
DILIGENTI, M., COETZEE, F., LAWRENCE, S., LEE GILES,C.,AND GORI, M. 2000. Focused Crawling Using Context Graphs. In Proceedings of the VLDB Conference. Google Scholar
CHO, J., GARCIA-MOLINA,J.,AND PAGE, L. 1998. Efficient Crawling Through URL Ordering. In Proceedings of the Seventh WWW Conference (WWW7). Google Scholar
CHO,J.AND GARCIA-MOLINA, J. 2000. The Evolution of the Web and Implications for an Incremental Crawler. In Proceedings of the VLDB Conference. Google Scholar
DE BRA,P.AND POST, R. 1994. Searching for Arbitrary Information in the WWW: the Fish-Search for Mosaic. In Proceedings of the Third WWW Conference (WWW3).Google Scholar
DING, J., GRAVANO, L., AND SHIVAKUMAR, N. 2000. Computing Geographical Scopes of Web Resources. In Proceedings of the VLDB Conference. Google Scholar
KLEINBERG, J. 1998. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the Symposium on Discrete Algorithms. Google Scholar
MUKHERJEA, S. 2000. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information. In Proceedings of the Ninth WWW Conference (WWW9). Google Scholar

Index Terms

On the design of a learning crawler for topical resource discovery
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

On Leveraging User Access Patterns for Topic Specific Crawling

In recent years, there has been considerable research on constructing crawlers which find resources satisfying specific conditions called predicates. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint on the internal ...
Read More
Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs
SKG '09: Proceedings of the 2009 Fifth International Conference on Semantics, Knowledge and Grid

It is a key factor for classifier-guided topical crawler to obtain labeled training samples. Recently, many such classifiers are trained with WebPages which are labeled manually or extracted from the Open Directory Project (ODP), and then the ...
Read More
Collection synthesis
JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries

The invention of the hyperlink and the HTTP transmission protocol caused an amazing new structure to appear on the Internet -- the World Wide Web. With the Web, there came spiders, robots, and Web crawlers, which go from one link to the next checking ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 19, Issue 3
July 2001
119 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/502115
Issue’s Table of Contents

Copyright © 2001 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2001
Published in tois Volume 19, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Crawling
World Wide Web
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 1,172
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

On Leveraging User Access Patterns for Topic Specific Crawling

Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs

Collection synthesis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

On Leveraging User Access Patterns for Topic Specific Crawling

Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs

Collection synthesis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media