Skip to main content

A Dynamic Page-Refresh Index Policy for Web Crawlers

  • Conference paper
Book cover Analytical and Stochastic Modeling Techniques and Applications (ASMTA 2014)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8499))

Abstract

This paper consider a Markovian model for the optimal dynamic scheduling of page refreshes in a local repository of copies of randomly evolving remote web pages. A limited number of refresh agents, e.g., crawlers for web search engines, are used to visit the remote pages for refreshing their copies, which raises the need for effective scheduling policies. Maintaining the copies results in utilities and costs, which are incorporated into a performance objective to be optimized. The paper develops a low-complexity closed-form heuristic dynamic index policy, and an upper bound on the optimal performance, by adapting a general approach of Whittle. The existence and evaluation of the index are resolved by methods introduced earlier by the author. A numerical study provides evidence showing that the proposed policy is consistently near optimal and may substantially outperform a myopic baseline policy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. 11th Int. Conf. World Wide Web, WWW 2002, pp. 136–147. ACM, New York (2002)

    Google Scholar 

  2. Cho, J., García-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28, 390–426 (2003)

    Article  Google Scholar 

  3. Ling, Y., Mi, J.: An optimal trade-off between content freshness and refresh cost. J. Appl. Probab. 41, 721–734 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  4. Lewandowski, D.: A three-year study on the freshness of web search engine databases. J. Information Sci. 34, 817–831 (2008)

    Article  Google Scholar 

  5. Olston, C., Najork, M.: Web crawling. Found. Trends Info. Retrieval 4, 175–246 (2010)

    Article  MATH  Google Scholar 

  6. Raiss-El-Fenni, M., El-Azouzi, R., Menasché, D., Xu, Y.: Optimal sensing policies for smartphones in hybrid networks: A POMDP approach. In: Proc. 6th Int. Conf. Performance Eval. Method. Tools (VALUETOOLS 2012), pp. 89–98. ICST (2012)

    Google Scholar 

  7. Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of optimal queuing network control. Math. Oper. Res. 24, 293–305 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  8. Whittle, P.: Restless bandits: Activity allocation in a changing world. In: Gani, J. (ed.) A Celebration of Applied Probability, UK. J. Appl. Probab. Trust, Sheffield, vol. 25, pp. 287–298 (1988)

    Google Scholar 

  9. Niño-Mora, J.: Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 33, 76–98 (2001)

    Article  MATH  Google Scholar 

  10. Niño-Mora, J.: Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach. Math. Program. 93, 361–413 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  11. Niño-Mora, J.: Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Math. Oper. Res. 31, 50–84 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  12. Niño-Mora, J.: Dynamic priority allocation via restless bandit marginal productivity indices. Top 15, 161–198 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  13. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Nashua (1999)

    MATH  Google Scholar 

  14. Weber, R.R., Weiss, G.: On an index policy for restless bandits. J. Appl. Probab. 27, 637–648 (1990)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Niño-Mora, J. (2014). A Dynamic Page-Refresh Index Policy for Web Crawlers. In: Sericola, B., Telek, M., Horváth, G. (eds) Analytical and Stochastic Modeling Techniques and Applications. ASMTA 2014. Lecture Notes in Computer Science, vol 8499. Springer, Cham. https://doi.org/10.1007/978-3-319-08219-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08219-6_4

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08218-9

  • Online ISBN: 978-3-319-08219-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics