ABSTRACT
Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.
- ]]S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- ]]R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007. Google ScholarDigital Library
- ]]R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 864--872, 2005. Google ScholarDigital Library
- ]]Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW, pages 328--337, 2004. Google ScholarDigital Library
- ]]P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.Google ScholarCross Ref
- ]]J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB, pages 200--209, 2000. Google ScholarDigital Library
- ]]J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through {URL} ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998. Google ScholarDigital Library
- ]]J. Cho and U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In Proceedings of VLDB, pages 375--386, 2007. Google ScholarDigital Library
- ]]A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the web. In Proceedings of WWW '07, pages 421--430, 2007. Google ScholarDigital Library
- ]]D. Fetterly, N. Craswell, and V. Vinay. Search effectiveness with a breadth-first crawl. In Proceedings of 31st European Conference on Information Retrieval (ECIR), 2009. Google ScholarDigital Library
- ]]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of WWW, pages 669--678, 2003. Google ScholarDigital Library
- ]]M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. COMPUT. NETWORKS, 31(11):1291--1303, 1999. Google ScholarDigital Library
- ]]K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
- ]]H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: scaling to 6 billion pages and beyond. In Proceedings of WWW 2008, pages 427--436, 2008. Google ScholarDigital Library
- ]]M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits on the web: how does it compare? In Proceedings of SIGIR, pages 471--478, 2007. Google ScholarDigital Library
- ]]A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of WWW, pages 1--12, 2004. Google ScholarDigital Library
- ]]S. Pandey and C. Olston. Crawl ordering by search impact. In Proceedings of WSDM, pages 3--14, 2008. Google ScholarDigital Library
- ]]K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. la-web, 00:132, 2003. Google ScholarDigital Library
- ]]J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo's logs. In Proceedings of SIGIR, pages 151--158, 2007. Google ScholarDigital Library
Index Terms
- The impact of crawl policy on web search effectiveness
Recommendations
Optimal Freshness Crawl Under Politeness Constraints
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalA Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep ...
Crawl ordering by search impact
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data MiningWe study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of ...
A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalLarge-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and ...
Comments