skip to main content
article

On the design of a learning crawler for topical resource discovery

Published:01 July 2001Publication History
Skip Abstract Section

Abstract

In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the World Wide Web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries, or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. We discuss how to intelligently select features which are most useful for a given crawl. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.

References

  1. AGGARWAL,C.C.,AL-GARAWI,F.,AND YU, P. S. 2001. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the Tenth WWW Conference (WWW10). Google ScholarGoogle Scholar
  2. AGGARAWAL,C.C.,GATES,S.C.,AND YU, P. S. 1999. On the merits of using supervised clustering for building categorization systems. In Proceedings of the KDD Conference. Google ScholarGoogle Scholar
  3. BAR-YOSSEF, Z., BERG, A., CHIEN, S., FAKCHAROENPHOL,J.,AND WEITZ, D. 2000. Approximating Aggregate Queries about Web Pages via Random Walks. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  4. BHARAT,K.AND HENZINGER, M. 1998. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proceedings of the ACM SIGIR Conference. Google ScholarGoogle Scholar
  5. CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON,D.,AND KLEINBERG, J. M. 1998. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proceedings of the WWW Conference (WWW7). Google ScholarGoogle Scholar
  6. CHAKRABARTI, S., DOM, B., RAVI KUMAR, S., RAGHAVAN, P., RAJGOPALAN, S., TOMKINS, A., GIBSON,D.,AND KLEINBERG, J. M. 1999. Mining the Web's Link Structure. IEEE Computer, 32(8):60-67. Google ScholarGoogle Scholar
  7. CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999a. Focussed Crawling: A New Approach to Topic Specific Resource Discovery. In Proceedings of the WWW Conference. Google ScholarGoogle Scholar
  8. CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999b. Distributed Hypertext Resource Discovery through Examples. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  9. DILIGENTI, M., COETZEE, F., LAWRENCE, S., LEE GILES,C.,AND GORI, M. 2000. Focused Crawling Using Context Graphs. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  10. CHO, J., GARCIA-MOLINA,J.,AND PAGE, L. 1998. Efficient Crawling Through URL Ordering. In Proceedings of the Seventh WWW Conference (WWW7). Google ScholarGoogle Scholar
  11. CHO,J.AND GARCIA-MOLINA, J. 2000. The Evolution of the Web and Implications for an Incremental Crawler. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  12. DE BRA,P.AND POST, R. 1994. Searching for Arbitrary Information in the WWW: the Fish-Search for Mosaic. In Proceedings of the Third WWW Conference (WWW3).Google ScholarGoogle Scholar
  13. DING, J., GRAVANO, L., AND SHIVAKUMAR, N. 2000. Computing Geographical Scopes of Web Resources. In Proceedings of the VLDB Conference. Google ScholarGoogle Scholar
  14. KLEINBERG, J. 1998. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the Symposium on Discrete Algorithms. Google ScholarGoogle Scholar
  15. MUKHERJEA, S. 2000. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information. In Proceedings of the Ninth WWW Conference (WWW9). Google ScholarGoogle Scholar

Index Terms

  1. On the design of a learning crawler for topical resource discovery

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 19, Issue 3
      July 2001
      119 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/502115
      Issue’s Table of Contents

      Copyright © 2001 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 July 2001
      Published in tois Volume 19, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader