research-article

IRLbot: scaling to 6 billion pages and beyond

Authors:

Hsin-Tsang Lee,

Dmitri LoguinovAuthors Info & Claims

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 427 - 436

https://doi.org/10.1145/1367497.1367556

Published: 21 April 2008 Publication History

Abstract

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages ($7.6$ billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.

References

[1]

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, "Searching the Web," ACM Transactions on Internet Technology, vol. 1, no. 1, pp. 2--43, Aug. 2001.

Digital Library

[2]

P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "UbiCrawler: A Scalable Fully Distributed Web Crawler," Software: Practice & Experience, vol. 34, no. 8, pp. 711--726, Jul. 2004.

Digital Library

[3]

P. Boldi, M. Santini, and S. Vigna, "Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations," LNCS: Algorithms and Models for the Web-Graph, vol. 3243, pp. 168--180, Oct. 2004.

[4]

S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," in Proc. WWW, Apr. 1998, pp. 107--117.

Digital Library

[5]

M. Burner, "Crawling Towards Eternity: Building an Archive of the World Wide Web," Web Techniques Magazine, vol. 2, no. 5, May 1997.

[6]

C. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, and S. R. G. Wesley, "Stanford WebBase Components and Applications," ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153--186, May 2006.

Digital Library

[7]

J. Edwards, K. McCurley, and J. Tomlin, "An Adaptive Model for Optimizing Performance of an Incremental Web Crawler," in Proc. WWW, May 2001, pp. 106--113.

Digital Library

[8]

D. Eichmann, "The RBSE Spider - Balancing Effective Search Against Web Load," in Proc. WWW, May 1994.

[9]

G. Feng, T.-Y. Liu, Y. Wang, Y. Bao, Z. Ma, X.-D. Zhang, and W.-Y. Ma, "AggregateRank: Bringing Order to Web Sites," in Proc. ACM SIGIR, Aug. 2006, pp. 75--82.

Digital Library

[10]

D. Gleich and L. Zhukov, "Scalable Computing for Power Law Graphs: Experience with Parallel PageRank," in Proc. SuperComputing, Nov. 2005.

[11]

Z. Gyongyi and H. Garcia-Molina, "Link Spam Alliances," in Proc. VLDB, Aug. 2005, pp. 517--528.

Digital Library

[12]

Y. Hafri and C. Djeraba, "High Performance Crawling System," in Proc. ACM MIR, Oct. 2004, pp. 299--306.

Digital Library

[13]

A. Heydon and M. Najork, "Mercator: A Scalable, Extensible Web Crawler," World Wide Web, vol. 2, no. 4, pp. 219--229, Dec. 1999.

Digital Library

[14]

J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke, "WebBase: A Repository of Web Pages," in Proc. WWW, May 2000, pp. 277--293.

Digital Library

[15]

Internet Archive. {Online}. Available: http://www.archive.org/.

[16]

IRLbot Project at Texas A&M. {Online}. Available: http://irl.cs.tamu.edu/crawler/.

[17]

S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Exploiting the Block Structure of the Web for Computing PageRank," Stanford University, Tech. Rep., Mar. 2003. {Online}. Available: http://www.stanford.edu/sdkamvar/papers/blockrank.pdf.

[18]

S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Extrapolation methods for accelerating PageRank computations," in Proc. WWW, May 2003, pp. 261--270.

Digital Library

[19]

K. Koht-arsa and S. Sanguanpong, "High Performance Large Scale Web Spider Architecture," in Proc. Internataional Symposium on Communications and Information Technology, Oct. 2002.

[20]

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond," Texas A&M University, Tech. Rep. 2008-2-2, Feb. 2008. {Online}. Available: http://irl.cs.tamu.edu/publications/.

[21]

M. Mauldin, "Lycos: Design Choices in an Internet Search Service," IEEE Expert Magazine, vol. 12, no. 1, pp. 8--11, Jan./Feb. 1997.

Digital Library

[22]

O. A. McBryan, "GENVL and WWWW: Tools for Taming the Web," in Proc. WWW, May 1994.

[23]

M. Najork and A. Heydon, "High-Performance Web Crawling," Compaq Systems Research Center, Tech. Rep. 173, Sep. 2001. {Online}. Available: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.

[24]

M. Najork and J. L. Wiener, "Breadth-First Search Crawling Yields High-Quality Pages," in Proc. WWW, May 2001, pp. 114--118.

Digital Library

[25]

B. Pinkerton, "Finding What People Want: Experiences with the Web Crawler," in Proc. WWW, Oct. 1994.

[26]

B. Pinkerton, "WebCrawler: Finding What People Want," Ph.D. dissertation, University of Washington, 2000.

Digital Library

[27]

V. Shkapenyuk and T. Suel, "Design and Implementation of a High-Performance Distributed Web Crawler," in Proc. IEEE ICDE, Mar. 2002, pp. 357--368.

Digital Library

[28]

A. Singh, M. Srivatsa, L. Liu, and T. Miller, "Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web," in Proc. SIGIR Workshop on Distributed Information Retrieval, Aug. 2003, pp. 126--142.

[29]

T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram, "ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval," in Proc. WebDB, Jun. 2003, pp. 67--72.

[30]

J. Wu and H. El-Ocla, "TCP Congestion Avoidance Model with Congestive Loss," in Proc. IEEE ICON, Nov. 2004, pp. 3--8.

Cited By

Arman ALoguinov D(2022)OrigamiProceedings of the VLDB Endowment10.14778/3489496.348950715:2(259-271)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489507
Tsai YLin CLee M(2022)Analysis of Application Data Mining to Capture Consumer Review Data on Booking WebsitesMobile Information Systems10.1155/2022/30629532022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/3062953
Moyano ASicilia MBarriocanal E(2018)On the Graph Structure of the Web of DataInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.201804010414:2(70-85)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.4018/IJSWIS.2018040104
Show More Cited By

Index Terms

IRLbot: scaling to 6 billion pages and beyond
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

IRLbot: Scaling to 6 billion pages and beyond

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Effective web-scale crawling through website analysis
WWW '06: Proceedings of the 15th international conference on World Wide Web

The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '08: Proceedings of the 17th international conference on World Wide Web

April 2008

1326 pages

ISBN:9781605580852

DOI:10.1145/1367497

General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '08

Sponsor:

ACM

WWW '08: The 17th International World Wide Web Conference

April 21 - 25, 2008

Beijing, China

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
626
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Arman ALoguinov D(2022)OrigamiProceedings of the VLDB Endowment10.14778/3489496.348950715:2(259-271)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489507
Tsai YLin CLee M(2022)Analysis of Application Data Mining to Capture Consumer Review Data on Booking WebsitesMobile Information Systems10.1155/2022/30629532022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/3062953
Moyano ASicilia MBarriocanal E(2018)On the Graph Structure of the Web of DataInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.201804010414:2(70-85)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.4018/IJSWIS.2018040104
Cambazoglu BBaeza-Yates RPerego RSebastiani FAslam JRuthven IZobel J(2016)Scalability and Efficiency Challenges in Large-Scale Web Search EnginesProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914808(1223-1226)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2914808
Naghavi MSharifi M(2015)An online system for notification of changes to blogging space to achieve information dominationJournal of Web Engineering10.5555/2871264.287126714:3-4(215-233)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.5555/2871264.2871267
Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Tran GTurk ACambazoglu BNejdl WBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)A Random Walk Model for Optimization of Search Impact in Web Frontier RankingProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767737(153-162)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767737
Ahmed SLoguinov D(2015)Modeling randomized data streams in caching, data processing, and crawling applications2015 IEEE Conference on Computer Communications (INFOCOM)10.1109/INFOCOM.2015.7218542(1625-1633)Online publication date: Apr-2015
https://doi.org/10.1109/INFOCOM.2015.7218542
Ahmed SSparkman CLee HLoguinov D(2015)Around the web in six weeks: Documenting a large-scale crawl2015 IEEE Conference on Computer Communications (INFOCOM)10.1109/INFOCOM.2015.7218539(1598-1606)Online publication date: Apr-2015
https://doi.org/10.1109/INFOCOM.2015.7218539
Li ZTan YGuo HFeng C(2014)An Improved Shark Search Algorithm Based on Domain OntologyApplied Mechanics and Materials10.4028/www.scientific.net/AMM.651-653.2252651-653(2252-2257)Online publication date: Sep-2014
https://doi.org/10.4028/www.scientific.net/AMM.651-653.2252
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten