Skip to main content

Incremental Crawling

  • Reference work entry
  • First Online:
  • 24 Accesses

Synonyms

Crawler; Spidering

Definition

Part of the success of the World Wide Web arises from its lack of central control, because it allows every owner of a computer to contribute to a universally shared information space. The size and lack of central control presents a challenge for any global calculations that operate on the web as a distributed database. The scalability issue is typically handled by creating a central repository of web pages that is optimized for large-scale calculations. The process of creating this repository consists of maintaining a data structure of URLs to fetch, from which URLs are selected, the content is fetched, and the repository is updated. This process is called crawling or spidering.

Unfortunately, maintaining a consistent shadow repository is complicated by the dynamic and uncoordinated nature of the web. URLs are constantly being created or destroyed, and contents of URLs may change without notice. As a result, there will always be URLs for which the...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Cho J, Garcia-Molina H. Effective page refresh policies for web crawlers. ACM Trans Database Syst. 2003;28(4):390–426.

    Article  Google Scholar 

  2. Coffman Jr EG, Liu Z, Weber RR. Optimal robot scheduling for web search engines. J Sched. 1998;1(1):15–29.

    Article  MathSciNet  MATH  Google Scholar 

  3. Dikaiakos MD, Stassopoulou A, Papageorgiou L. An investigation of web crawler behavior: characterization and metrics. Comput Commun. 2005;28(8):880–97.

    Article  Google Scholar 

  4. Edwards J, McCurley KS, Tomlin J. An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International World Wide Web Conference; 2001. p. 106–13.

    Google Scholar 

  5. Fielding R, Gettys J, Mogul J, Frystyk H, Mastinter L, Leach P, Berners-Lee T. Hypertext transfer protocol – HTTP/1.1, RFC 2616 http://www.w3.org/Protocols/rfc2616/rfc2616.html

  6. Podlipnig S, Böszörmenyi L. A survey of web cache replacement strategies. ACM Comput Surv. 2003;35(4):374–98.

    Article  Google Scholar 

  7. Sitemap protocol specification. http://www.sitemaps.org/protocol.php

  8. Wang J. A survey of web caching schemes for the internet. ACM SIGCOMM Comput Commun Rev. 1999;29(5):36–46.

    Article  Google Scholar 

  9. Yuan X, MacGregor MH, Harms J. An efficient scheme to remove crawler traffic from the internet. In: Proceedings of the 11th International Conference on Computer Communications and Networks; 2002. p. 90–5.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kevin S. McCurley .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

McCurley, K.S. (2018). Incremental Crawling. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_196

Download citation

Publish with us

Policies and ethics