research-article

IRLbot: Scaling to 6 billion pages and beyond

Authors:

Hsin-Tsang Lee,

Dmitri LoguinovAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 3, Issue 3

Article No.: 8, Pages 1 - 34

https://doi.org/10.1145/1541822.1541823

Published: 03 July 2009 Publication History

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

References

[1]

Abiteboul, S., Preda, M., and Cobena, G. 2003. Adaptive on-line page importance computation. In Proceedings of the World Wide Web Conference (WWW'03). 280--290.

Digital Library

[2]

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Technol.1, 1, 2--43.

Digital Library

[3]

Bharat, K. and Broder, A. 1999. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proceedings of the World Wide Web Conference (WWW'99). 1579--1590.

Digital Library

[4]

Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004a. Ubicrawler: A scalable fully distributed Web crawler. Softw. Pract. Exper. 34, 8, 711--726.

Digital Library

[5]

Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.

[6]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the World Wide Web Conference (WWW'98). 107--117.

Digital Library

[7]

Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13, 1157--1166.

Digital Library

[8]

Broder, A. Z., Najork, M., and Wiener, J. L. 2003. Efficient url caching for World Wide Web crawling. In Proceedings of the World Wide Web Conference (WWW'03). 679--689.

Digital Library

[9]

Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.

[10]

Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). 380--388.

Digital Library

[11]

Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the World Wide Web Conference (WWW'02). 124--135.

Digital Library

[12]

Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., and Wesley, S. R. G. 2006. Stanford Web base components and applications. ACM Trans. Internet Technol. 6, 2, 153--186.

Digital Library

[13]

Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the World Wide Web Conference (WWW'01). 106--113.

Digital Library

[14]

Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.

[15]

Feng, G., Liu, T.-Y., Wang, Y., Bao, Y., Ma, Z., Zhang, X.-D., and Ma, W.-Y. 2006. Aggregaterank: Bringing order to Web sites. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82.

Digital Library

[16]

Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.

[17]

Gyöngyi, Z. and Garcia-Molina, H. 2005. Link spam alliances. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 517--528.

Digital Library

[18]

Hafri, Y. and Djeraba, C. 2004. High-performance crawling system. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR'04). 299--306.

Digital Library

[19]

Henzinger, M. R. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 284--291.

Digital Library

[20]

Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229.

Digital Library

[21]

Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web base: A repository of Web pages. In Proceedings of the World Wide Web Conference (WWW'00). 277--293.

Digital Library

[22]

Internet Archive. Internet archive homepage. http://www.archive.org/.

[23]

IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.

[24]

Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.

[25]

Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003b. Extrapolation methods for accelerating pagerank computations. In Proceedings of the World Wide Web Conference (WWW'03). 261--270.

Digital Library

[26]

Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.

[27]

Manasse, D. F. M. and Najork, M. 2003. Evolution of clusters of near-duplicate Web pages. In Proceedings of the Latin American Web Congress (LAWEB'03). 37--45.

Digital Library

[28]

Manku, G. S., Jain, A., and Sarma, A. D. 2007. Detecting near duplicates for Web crawling. In Proceedings of the World Wide Web Conference (WWW'07). 141--149.

Digital Library

[29]

Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.

Digital Library

[30]

McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).

[31]

Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.

[32]

Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the World Wide Web Conference (WWW'01). 114--118.

Digital Library

[33]

Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.

[34]

Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).

[35]

Pinkerton, B. 2000. Web crawler: Finding what people want. Ph.D. thesis, University of Washington.

Digital Library

[36]

Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'02). 357--368.

Digital Library

[37]

Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.

[38]

Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.

[39]

Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Comput. Surv. 33, 2, 209--271.

Digital Library

[40]

Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.

Cited By

Goel AZhu JNetravali RMadhyastha HVanbever LZhang I(2024)SprinterProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691875(893-906)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691875
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Cui YXiao DLoguinov D(2019)On Efficient External-Memory Triangle ListingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285882031:8(1555-1568)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1109/TKDE.2018.2858820
Show More Cited By

Index Terms

IRLbot: Scaling to 6 billion pages and beyond

Recommendations

IRLbot: scaling to 6 billion pages and beyond
WWW '08: Proceedings of the 17th international conference on World Wide Web

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Effective web-scale crawling through website analysis
WWW '06: Proceedings of the 15th international conference on World Wide Web

The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 3, Issue 3

June 2009

179 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/1541822

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2009

Accepted: 01 March 2009

Revised: 01 February 2009

Received: 01 March 2008

Published in TWEB Volume 3, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
965
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Goel AZhu JNetravali RMadhyastha HVanbever LZhang I(2024)SprinterProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691875(893-906)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691875
ALANOĞLU ZAKCAYOL M(2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
https://doi.org/10.29130/dubited.1097123
Cui YXiao DLoguinov D(2019)On Efficient External-Memory Triangle ListingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285882031:8(1555-1568)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1109/TKDE.2018.2858820
Klarman UFlores MKuzmanovic ADimitropoulos XDainotti AVanbever LBenson T(2018)Mining the web with webcoinProceedings of the 14th International Conference on emerging Networking EXperiments and Technologies10.1145/3281411.3281415(165-177)Online publication date: 4-Dec-2018
https://dl.acm.org/doi/10.1145/3281411.3281415
Cui YSparkman CLee HLoguinov D(2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3182180
Boldi PMarino ASantini MVigna S(2018)BUbiNGACM Transactions on the Web10.1145/316001712:2(1-26)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1145/3160017
Cui YXiao DLoguinov D(2016)On Efficient External-Memory Triangle Listing2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0021(101-110)Online publication date: Dec-2016
https://doi.org/10.1109/ICDM.2016.0021
Stergiou STsioutsiouliklis KCao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)Set Cover at Web ScaleProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783315(1125-1133)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2783315
Quoc DFetzer CFelber PRivière ÉSchiavoni VSutra P(2015)UniCrawlProceedings of the 2015 IEEE 8th International Conference on Cloud Computing10.1109/CLOUD.2015.59(389-396)Online publication date: 27-Jun-2015
https://dl.acm.org/doi/10.1109/CLOUD.2015.59
Plachouras VCarpentier FFaheem MMasanès JRisse TSenellart PSiehndel PStavrakas Y(2014)ARCOMEM Crawling ArchitectureFuture Internet10.3390/fi60305186:3(518-541)Online publication date: 19-Aug-2014
https://doi.org/10.3390/fi6030518
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents