skip to main content
research-article

IRLbot: Scaling to 6 billion pages and beyond

Published: 03 July 2009 Publication History

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

References

[1]
Abiteboul, S., Preda, M., and Cobena, G. 2003. Adaptive on-line page importance computation. In Proceedings of the World Wide Web Conference (WWW'03). 280--290.
[2]
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Technol.1, 1, 2--43.
[3]
Bharat, K. and Broder, A. 1999. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proceedings of the World Wide Web Conference (WWW'99). 1579--1590.
[4]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004a. Ubicrawler: A scalable fully distributed Web crawler. Softw. Pract. Exper. 34, 8, 711--726.
[5]
Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.
[6]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the World Wide Web Conference (WWW'98). 107--117.
[7]
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13, 1157--1166.
[8]
Broder, A. Z., Najork, M., and Wiener, J. L. 2003. Efficient url caching for World Wide Web crawling. In Proceedings of the World Wide Web Conference (WWW'03). 679--689.
[9]
Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.
[10]
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). 380--388.
[11]
Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the World Wide Web Conference (WWW'02). 124--135.
[12]
Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., and Wesley, S. R. G. 2006. Stanford Web base components and applications. ACM Trans. Internet Technol. 6, 2, 153--186.
[13]
Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the World Wide Web Conference (WWW'01). 106--113.
[14]
Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.
[15]
Feng, G., Liu, T.-Y., Wang, Y., Bao, Y., Ma, Z., Zhang, X.-D., and Ma, W.-Y. 2006. Aggregaterank: Bringing order to Web sites. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82.
[16]
Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.
[17]
Gyöngyi, Z. and Garcia-Molina, H. 2005. Link spam alliances. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 517--528.
[18]
Hafri, Y. and Djeraba, C. 2004. High-performance crawling system. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR'04). 299--306.
[19]
Henzinger, M. R. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 284--291.
[20]
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229.
[21]
Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web base: A repository of Web pages. In Proceedings of the World Wide Web Conference (WWW'00). 277--293.
[22]
Internet Archive. Internet archive homepage. http://www.archive.org/.
[23]
IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.
[24]
Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.
[25]
Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003b. Extrapolation methods for accelerating pagerank computations. In Proceedings of the World Wide Web Conference (WWW'03). 261--270.
[26]
Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.
[27]
Manasse, D. F. M. and Najork, M. 2003. Evolution of clusters of near-duplicate Web pages. In Proceedings of the Latin American Web Congress (LAWEB'03). 37--45.
[28]
Manku, G. S., Jain, A., and Sarma, A. D. 2007. Detecting near duplicates for Web crawling. In Proceedings of the World Wide Web Conference (WWW'07). 141--149.
[29]
Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.
[30]
McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).
[31]
Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.
[32]
Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the World Wide Web Conference (WWW'01). 114--118.
[33]
Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
[34]
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).
[35]
Pinkerton, B. 2000. Web crawler: Finding what people want. Ph.D. thesis, University of Washington.
[36]
Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'02). 357--368.
[37]
Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.
[38]
Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.
[39]
Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Comput. Surv. 33, 2, 209--271.
[40]
Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.

Cited By

View all
  • (2024)SprinterProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691875(893-906)Online publication date: 16-Apr-2024
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2019)On Efficient External-Memory Triangle ListingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285882031:8(1555-1568)Online publication date: 1-Aug-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 3, Issue 3
June 2009
179 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1541822
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2009
Accepted: 01 March 2009
Revised: 01 February 2009
Received: 01 March 2008
Published in TWEB Volume 3, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. IRLbot
  2. crawling
  3. large scale

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SprinterProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691875(893-906)Online publication date: 16-Apr-2024
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2019)On Efficient External-Memory Triangle ListingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285882031:8(1555-1568)Online publication date: 1-Aug-2019
  • (2018)Mining the web with webcoinProceedings of the 14th International Conference on emerging Networking EXperiments and Technologies10.1145/3281411.3281415(165-177)Online publication date: 4-Dec-2018
  • (2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
  • (2018)BUbiNGACM Transactions on the Web10.1145/316001712:2(1-26)Online publication date: 1-Jun-2018
  • (2016)On Efficient External-Memory Triangle Listing2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0021(101-110)Online publication date: Dec-2016
  • (2015)Set Cover at Web ScaleProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783315(1125-1133)Online publication date: 10-Aug-2015
  • (2015)UniCrawlProceedings of the 2015 IEEE 8th International Conference on Cloud Computing10.1109/CLOUD.2015.59(389-396)Online publication date: 27-Jun-2015
  • (2014)ARCOMEM Crawling ArchitectureFuture Internet10.3390/fi60305186:3(518-541)Online publication date: 19-Aug-2014
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media