research-article

The impact of crawl policy on web search effectiveness

Authors:
Dennis Fetterly

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

,
Nick Craswell

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Vishwa Vinay

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 580–587https://doi.org/10.1145/1571941.1572041

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 580–587

ABSTRACT

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.

References

]]S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
]]R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007. Google ScholarDigital Library
]]R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 864--872, 2005. Google ScholarDigital Library
]]Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW, pages 328--337, 2004. Google ScholarDigital Library
]]P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.Google ScholarCross Ref
]]J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB, pages 200--209, 2000. Google ScholarDigital Library
]]J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through {URL} ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998. Google ScholarDigital Library
]]J. Cho and U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In Proceedings of VLDB, pages 375--386, 2007. Google ScholarDigital Library
]]A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the web. In Proceedings of WWW '07, pages 421--430, 2007. Google ScholarDigital Library
]]D. Fetterly, N. Craswell, and V. Vinay. Search effectiveness with a breadth-first crawl. In Proceedings of 31st European Conference on Information Retrieval (ECIR), 2009. Google ScholarDigital Library
]]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proceedings of WWW, pages 669--678, 2003. Google ScholarDigital Library
]]M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. COMPUT. NETWORKS, 31(11):1291--1303, 1999. Google ScholarDigital Library
]]K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
]]H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: scaling to 6 billion pages and beyond. In Proceedings of WWW 2008, pages 427--436, 2008. Google ScholarDigital Library
]]M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits on the web: how does it compare? In Proceedings of SIGIR, pages 471--478, 2007. Google ScholarDigital Library
]]A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of WWW, pages 1--12, 2004. Google ScholarDigital Library
]]S. Pandey and C. Olston. Crawl ordering by search impact. In Proceedings of WSDM, pages 3--14, 2008. Google ScholarDigital Library
]]K. M. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. la-web, 00:132, 2003. Google ScholarDigital Library
]]J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo's logs. In Proceedings of SIGIR, pages 151--158, 2007. Google ScholarDigital Library

Index Terms

The impact of crawl policy on web search effectiveness
1. Information systems
  1. Information retrieval

Recommendations

Optimal Freshness Crawl Under Politeness Constraints
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep ...
Read More
Crawl ordering by search impact
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of ...
Read More
A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
corpus selection
crawl ordering
web crawling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 696
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The impact of crawl policy on web search effectiveness

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimal Freshness Crawl Under Politeness Constraints

Crawl ordering by search impact

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking