skip to main content
10.1145/2766462.2767737acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

Published: 09 August 2015 Publication History

Abstract

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

References

[1]
S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In Proc. 12th Int'l Conf. World Wide Web, pages 280--290, 2003.
[2]
E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The Web changes everything: Understanding the dynamics of web content. In Proc. 2nd ACM Int'l Conf. Web Search and Data Mining, pages 282--291, 2009.
[3]
L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommending links in social networks. In Proc. 4th ACM Int'l Conf. Web Search and Data Mining, pages 635--644, 2011.
[4]
X. Bai, B. B. Cambazoglu, and F. P. Junqueira. Discovering URLs through user feedback. In Proc. 20th ACM Int'l Conf. Information and Knowledge Management, pages 77--86, 2011.
[5]
B. B. Cambazoglu and R. Baeza-Yates. Scalability challenges in web search engines. In M. Melucci and R. Baeza-Yates, editors, Advanced Topics in Information Retrieval, volume 33 of The Information Retrieval Series, pages 27--50. Springer Berlin Heidelberg, 2011.
[6]
B. B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and quality gains in distributed web search engines. In Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 411--418, 2009.
[7]
J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4):390--426, 2003.
[8]
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1--7):161--172, 1998.
[9]
N. Cohen. Wikipedia vs. the small screen, 2014.
[10]
A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins. The discoverability of the Web. In Proc. 16th Int'l Conf. World Wide Web, pages 421--430, 2007.
[11]
P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In Special Interest Tracks and Posters of the 14th Int'l Conf. World Wide Web, pages 1094--1095, 2005.
[12]
B. Efron and R. Tibshirani. Improvements on cross-validation: The .632
[13]
bootstrap method. Journal of the American Statistical Association, 92(438), 1997.
[14]
N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th Int'l Conf. World Wide Web, pages 309--318, 2004.
[15]
D. Fetterly, N. Craswell, and V. Vinay. The impact of crawl policy on web search effectiveness. In Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 580--587, 2009.
[16]
Z. Guan, C. Wang, C. Chen, J. Bu, and J. Wang. Guide focused crawler efficiently and effectively using on-line topical importance estimation. In Proc. 31st Annual Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 757--758, 2008.
[17]
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with Trustrank. In Proc. 38th Int'l Conf. Very Large Data Bases, pages 576--587, 2004.
[18]
A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219--229, 1999.
[19]
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proc. 1st Int'l Conf. Web Search and Data Mining, pages 195--206, 2008.
[20]
J. M. Kleinberg. Hubs, authorities, and communities. ACM Computing Surveys, 31(4es), 1999.
[21]
M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: Stratified sampling via weighted random walks. In Proc. ACM SIGMETRICS Joint Int'l Conf. Measurement and Modeling of Computer Systems, pages 281--292, 2011.
[22]
H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRLbot: Scaling to 6 billion pages and beyond. In Proc. 17th Int'l Conf. World Wide Web, pages 427--436, 2008.
[23]
F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proc. 24th Annual Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pages 241--249, 2001.
[24]
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proc. 10th Int'l Conf. World Wide Web, pages 114--118, 2001.
[25]
A. Ntoulas, J. Cho, and C. Olston. What's new on the Web?: The evolution of the Web from a search engine perspective. In Proc. 13th Int'l Conf. World Wide Web, pages 1--12, 2004.
[26]
C. Olston and M. Najork. Web crawling. Foundations and Trends in Information Retrieval, 4(3):175--246, 2010.
[27]
C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In Proc. 17th Int'l Conf. World Wide Web, pages 437--446, 2008.
[28]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.
[29]
S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th Int'l Conf. World Wide Web, pages 401--411, 2005.
[30]
S. Pandey and C. Olston. Crawl ordering by search impact. In Proc. 1st Int'l Conf. Web Search and Data Mining, pages 3--14, 2008.
[31]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[32]
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th Int'l Conf. World Wide Web, pages 136--147, 2002.

Index Terms

  1. A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2015
    1198 pages
    ISBN:9781450336215
    DOI:10.1145/2766462
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. discovery
    2. frontier ranking
    3. random walks
    4. result relevance
    5. url prioritization
    6. web crawling
    7. web frontier
    8. web search engine

    Qualifiers

    • Research-article

    Funding Sources

    • LEADS project funded by the European Community
    • ERC Advanced Grant ALEXANDRIA

    Conference

    SIGIR '15
    Sponsor:

    Acceptance Rates

    SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 487
      Total Downloads
    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media