Skip to main content

Crawling Policies Based on Web Page Popularity Prediction

  • Conference paper
Advances in Information Retrieval (ECIR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

Abstract

In this paper, we focus on crawling strategies for newly discovered URLs. Since it is impossible to crawl all the new pages right after they appear, the most important (or popular) pages should be crawled with a higher priority. One natural measure of page importance is the number of user visits. However, the popularity of newly discovered URLs cannot be known in advance, and therefore should be predicted relying on URLs’ features. In this paper, we evaluate several methods for predicting new page popularity against previously investigated crawler performance measurements, and propose a novel measurement setup aiming to evaluate crawler performance more realistically. In particular, we compare short-term and long-term popularity of new ephemeral URLs by estimating the rate of popularity decay. Our experiments show that the information about popularity decay can be effectively used for optimizing ordering policies of crawlers, but further research is required to predict it accurately enough.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proc. WWW Conference (2003)

    Google Scholar 

  2. Abramson, M., Aha, D.: What’s in a URL? Genre classification from URLs. In: Conference on Artificial Intelligence, pp. 262–263 (2012)

    Google Scholar 

  3. Bai, X., Cambazoglu, B.B., Junqueira, F.P.: Discovering urls through user feedback. In: Proc. CIKM Conference, pp. 77–86 (2011)

    Google Scholar 

  4. Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for url-based topic classification. ACM Trans. Web (2011)

    Google Scholar 

  5. Baykan, E., Henzinger, M., Weber, I.: Efficient discovery of authoritative resources. ACM Trans. Web (2013)

    Google Scholar 

  6. Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proc. VLDB (2007)

    Google Scholar 

  7. Edwards, J., McCurley, K.S., Tomlin, J.A.: Adaptive model for optimizing performance of an incremental web crawler. In: Proc. WWW Conference (2001)

    Google Scholar 

  8. Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proc. SIGIR Conference, pp. 580–587 (2009)

    Google Scholar 

  9. Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer, New York (2001)

    Google Scholar 

  10. Kan, M.Y.: Web page classification without the web page. In: Proc. WWW Conference, pp. 262–263 (2004)

    Google Scholar 

  11. Kumar, R., Lang, K., Marlow, C., Tomkins, A.: Efficient discovery of authoritative resources. Data Engineering (2008)

    Google Scholar 

  12. Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P.: Timely crawling of high-quality ephemeral new content. In: Proc. CIKM Conference, pp. 745–750 (2011)

    Google Scholar 

  13. Lei, T., Cai, R., Yang, J.M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: Proc. WWW Conference, pp. 611–620 (2010)

    Google Scholar 

  14. Liu, M., Cai, R., Zhang, M., Zhang, L.: User browsing behavior-driven web crawling. In: Proc. CIKM Conference, pp. 87–92 (2011)

    Google Scholar 

  15. Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)

    Article  MATH  Google Scholar 

  16. Pandey, S., Olston, C.: User-centric web crawling. In: Proc. WWW Conference (2005)

    Google Scholar 

  17. Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. WSDM Conference (2008)

    Google Scholar 

  18. Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E.: Modeling and predicting behavioral dynamics on the web. In: Proc. WWW Conference, pp. 599–608 (2012)

    Google Scholar 

  19. Tsur, O., Rappoport, A.: What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proc. WSDM Conference, pp. 643–652 (2012)

    Google Scholar 

  20. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. WWW Conference (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Ostroumova, L., Bogatyy, I., Chelnokov, A., Tikhonov, A., Gusev, G. (2014). Crawling Policies Based on Web Page Popularity Prediction. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06028-6_9

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06027-9

  • Online ISBN: 978-3-319-06028-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics