skip to main content
10.1145/1571941.1572041acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The impact of crawl policy on web search effectiveness

Published:19 July 2009Publication History

ABSTRACT

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.

References

  1. ]]S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 864--872, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ]]Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW, pages 328--337, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ]]P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  6. ]]J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB, pages 200--209, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through {URL} ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ]]J. Cho and U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In Proceedings of VLDB, pages 375--386, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the web. In Proceedings of WWW '07, pages 421--430, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ]]D. Fetterly, N. Craswell, and V. Vinay. Search effectiveness with a breadth-first crawl. In Proceedings of 31st European Conference on Information Retrieval (ECIR), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of WWW, pages 669--678, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ]]M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. COMPUT. NETWORKS, 31(11):1291--1303, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ]]K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: scaling to 6 billion pages and beyond. In Proceedings of WWW 2008, pages 427--436, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits on the web: how does it compare? In Proceedings of SIGIR, pages 471--478, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of WWW, pages 1--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]S. Pandey and C. Olston. Crawl ordering by search impact. In Proceedings of WSDM, pages 3--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. la-web, 00:132, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ]]J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo's logs. In Proceedings of SIGIR, pages 151--158, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The impact of crawl policy on web search effectiveness

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
      July 2009
      896 pages
      ISBN:9781605584836
      DOI:10.1145/1571941

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader