research-article

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

Authors:

B. Barla Cambazoglu,

Wolfgang NejdlAuthors Info & Claims

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 153 - 162

https://doi.org/10.1145/2766462.2767737

Published: 09 August 2015 Publication History

Abstract

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

References

[1]

S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In Proc. 12th Int'l Conf. World Wide Web, pages 280--290, 2003.

Digital Library

[2]

E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The Web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. Web Search and Data Mining, pages 282--291, 2009.

Digital Library

[3]

L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommending links in social networks. In Proc. 4th ACM Int'l Conf. Web Search and Data Mining, pages 635--644, 2011.

Digital Library

[4]

X. Bai, B. B. Cambazoglu, and F. P. Junqueira. Discovering URLs through user feedback. In Proc. 20th ACM Int'l Conf. Information and Knowledge Management, pages 77--86, 2011.

Digital Library

[5]

B. B. Cambazoglu and R. Baeza-Yates. Scalability challenges in web search engines. In M. Melucci and R. Baeza-Yates, editors, Advanced Topics in Information Retrieval, volume 33 of The Information Retrieval Series, pages 27--50. Springer Berlin Heidelberg, 2011.

[6]

B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 411--418, 2009.

Digital Library

[7]

J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4):390--426, 2003.

Digital Library

[8]

J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1--7):161--172, 1998.

Digital Library

[9]

N. Cohen. Wikipedia vs. the small screen, 2014.

[10]

A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. World Wide Web, pages 421--430, 2007.

Digital Library

[11]

P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In Special Interest Tracks and Posters of the 14th Int'l Conf. World Wide Web, pages 1094--1095, 2005.

Digital Library

[12]

B. Efron and R. Tibshirani. Improvements on cross-validation: The .632

[13]

bootstrap method. Journal of the American Statistical Association, 92(438), 1997.

[14]

N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. World Wide Web, pages 309--318, 2004.

Digital Library

[15]

D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search effectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 580--587, 2009.

Digital Library

[16]

Z. Guan, C. Wang, C. Chen, J. Bu, and J. Wang. Guide focused crawler efficiently and effectively using on-line topical importance estimation. In Proc. 31st Annual Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 757--758, 2008.

Digital Library

[17]

Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with Trustrank. In Proc. 38th Int'l Conf. Very Large Data Bases, pages 576--587, 2004.

Digital Library

[18]

A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219--229, 1999.

Digital Library

[19]

P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. Web Search and Data Mining, pages 195--206, 2008.

Digital Library

[20]

J. M. Kleinberg. Hubs, authorities, and communities. ACM Computing Surveys, 31(4es), 1999.

Digital Library

[21]

M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: Stratified sampling via weighted random walks. In Proc. ACM SIGMETRICS Joint Int'l Conf. Measurement and Modeling of Computer Systems, pages 281--292, 2011.

Digital Library

[22]

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. World Wide Web, pages 427--436, 2008.

Digital Library

[23]

F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proc. 24th Annual Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 241--249, 2001.

Digital Library

[24]

M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. World Wide Web, pages 114--118, 2001.

Digital Library

[25]

A. Ntoulas, J. Cho, and C. Olston. What's new on the Web?: The evolution of the Web from a search engine perspective. In Proc. 13th Int'l Conf. World Wide Web, pages 1--12, 2004.

Digital Library

[26]

C. Olston and M. Najork. Web crawling. Foundations and Trends in Information Retrieval, 4(3):175--246, 2010.

Digital Library

[27]

C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In Proc. 17th Int'l Conf. World Wide Web, pages 437--446, 2008.

Digital Library

[28]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.

[29]

S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. World Wide Web, pages 401--411, 2005.

Digital Library

[30]

S. Pandey and C. Olston. Crawl ordering by search impact. In Proc. 1st Int'l Conf. Web Search and Data Mining, pages 3--14, 2008.

Digital Library

[31]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[32]

J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. World Wide Web, pages 136--147, 2002.

Digital Library

Index Terms

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
1. Information systems
  1. Information retrieval

Recommendations

A study of results overlap and uniqueness among major web search engines

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Web dynamics and their ramifications for the development of web search engines
Web dynamics

The World Wide Web has become the largest hypertext system in existence, providing an extremely rich collection of information resources. Compared with conventional information sources, the Web is highly dynamic in the following four factors: size (i.e.,...
How the dragons work: searching in a web
IWRIDL '06: Proceedings of the 2006 international workshop on Research issues in digital libraries

Search engines -- "web dragons" -- are the portals through which we access society's treasure trove of information. They do not publish the algorithms they use to sort and filter information, yet how they work is one of the most important questions of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2015

1198 pages

ISBN:9781450336215

DOI:10.1145/2766462

General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

LEADS project funded by the European Community
ERC Advanced Grant ALEXANDRIA

Conference

SIGIR '15

Sponsor:

SIGIR

SIGIR '15: The 38th International ACM SIGIR conference on research and development in Information Retrieval

August 9 - 13, 2015

Santiago, Chile

Acceptance Rates

SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
487
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten