research-article

Unsupervised Domain Ranking in Large-Scale Web Crawls

Authors:

Clint Sparkman,

Hsin-Tsang Lee,

Dmitri LoguinovAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 12, Issue 4

Article No.: 26, Pages 1 - 29

https://doi.org/10.1145/3182180

Published: 27 September 2018 Publication History

Abstract

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

References

[1]

Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. 2008. Web spam identification through content and hyperlinks. In Workshop on Adversarial Information Retrieval on the Web. 41--44.

Digital Library

[2]

Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page importance computation. In International Conference on World Wide Web (WWW). 280--290.

Digital Library

[3]

Sarker Tanzir Ahmed and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In IEEE Conference on Computer Communications (INFOCOM). 1598--1606.

[4]

alexa.com. Retrieved in 2015 from http://www.alexa.com/.

[5]

Jesse Alpert and Nissan Hajaj. 2008. We Knew the Web Was Big... (July 2008). Retrieved from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.

[6]

Shatlyk Ashyralyyev, B. Barla Cambazoglu, and Cevdet Aykanat. 2013. Incorporating the surfing behavior of web users into pagerank. In ACM International Conference on Information and Knowledge Management (CIKM). 2351--2356.

Digital Library

[7]

Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling a country: Better strategies than breadth-first for web page ordering. In International Conference on World Wide Web (WWW).

Digital Library

[8]

Xiao Bai, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Discovering URLs through user feedback. In ACM International Conference on Information and Knowledge Management (CIKM). 77--86.

Digital Library

[9]

Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Link-based characterization and detection of web spam. In Workshop on Adversarial Information Retrieval on the Web.

[10]

Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Workshop on Web Mining and Web Usage Analysis (WebKDD).

[11]

Andras Benczur, Karoly Csalogany, and Tamas Sarlos. 2006. Link-based similarity search to fight web spam. In Workshop on Adversarial Information Retrieval on the Web.

[12]

Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. Spamrank -- Fully automatic link spam detection. In Workshop on Adversarial Information Retrieval on the Web.

[13]

Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (July 2004), 711--726.

Digital Library

[14]

Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International Conference on World Wide Web (WWW). 107--117.

Digital Library

[15]

Carlos Castillo, Mauricio Marin, Andrea Rodriguez, and Ricardo Baeza-Yates. 2004. Scheduling algorithms for web crawling. In Latin American Web Conference.

Digital Library

[16]

James Caverlee and Ling Liu. 2007. Countering web spam with credibility-based link analysis. In ACM Symposium on Principles of Distributed Computing (PODC). 157--166.

Digital Library

[17]

James Caverlee, Steve Webb, and Ling Liu. 2007. Spam-resilient web rankings via influence throttling. In International Parallel and Distributed Processing Symposium (IPDPS).

[18]

Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 417 (March 1992), 210--217.

[19]

Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. I/O-efficient techniques for computing pagerank. In International Conference on Information and Knowledge Management (CIKM).

Digital Library

[20]

Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2004. Local methods for estimating pagerank values. In International Conference on Information and Knowledge Management (CIKM).

Digital Library

[21]

Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, and Sriram Raghavan Gary Wesley. 2006. Stanford webbase components and applications. ACM Transactions on Internet Technology 6, 2 (May 2006), 153--186.

Digital Library

[22]

Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In International Conference on World Wide Web (WWW). 161--172.

Digital Library

[23]

Young-Joo Chung, Masashi Toyoda, and Masaru Kitsuregawa. 2009. A study of link farm distribution and evolution using a time series of web snapshots. In Workshop on Adversarial Information Retrieval on the Web. 9--16.

Digital Library

[24]

ClueWeb09 Dataset. Retrieved in 2015 from http://www.lemurproject.org/clueweb09/.

[25]

CLuE cluster. Retrieved from http://www.nsf.gov/cise/clue/index.jsp.

[26]

ClueWeb09 documentation. Retrieved in 2015 from http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html.

[27]

Isabel Drost and Tobias Scheffer. 2005. Thwarting the nigritude ultramaring: Learning to identify link spam. In European Conference on Machine Learning (ECML).

Digital Library

[28]

N. Duffield, C. Lund, and M. Thorup. 2003. Estimating flow distributions from sampled flow statistics. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 325--336.

Digital Library

[29]

Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In International Conference on World Wide Web (WWW). 106--113.

Digital Library

[30]

David Eichmann. 1994. The RBSE spider -- Balancing effective search against web load. In International Conference on World Wide Web (WWW).

[31]

Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. 2004. Ranking the web frontier. In International Conference on World Wide Web (WWW). 309--318.

Digital Library

[32]

Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma. 2006. AggregateRank: Bringing order to web sites. In ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82.

Digital Library

[33]

Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In International Workshop on the Web and Databases (WebDB). 1--6.

Digital Library

[34]

Google Toolbar. Retrieved in 2008 from http://toolbar.google.com.

[35]

IRLbot PLD graph. Retrieved in 2018 from http://irl.cs.tamu.edu/projects/web/.

[36]

D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. 2004. How to build a webfountain: An architecture for very large-scale text analytics. IBM Systems Journal 43, 1 (2004), 64--77.

Digital Library

[37]

Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection based on mass estimation. In International Conference on Very Large Data Bases (VLDB). 439--450.

Digital Library

[38]

Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In International Conference on Very Large Data Bases (VLDB). 517--528.

Digital Library

[39]

Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web spam taxonomy. In Workshop on Adversarial Information Retrieval on the Web. 39--47.

[40]

Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustRank. In International Conference on Very Large Data Bases (VLDB). 576--587.

Digital Library

[41]

Younes Hafri and Chabane Djeraba. 2004. High performance crawling system. In ACM International Workshop on Multimedia Information Retrieval (MIR). 299--306.

Digital Library

[42]

Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (Dec. 1999), 219--229.

Digital Library

[43]

Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. In International Conference on World Wide Web (WWW). 277--293.

Digital Library

[44]

Internet Archive. Retrieved in 2018 from http://archive.org/.

[45]

Melody Ivory and Marti Hearst. 2002. Statistical profiles of highly-rated web sites. In Conference on Human Factors in Computing Systems (CHI). 367--374.

Digital Library

[46]

Kasom Koht-arsa and Surasak Sanguanpong. 2002. High performance large scale web spider architecture. In IEEE International Symposium on Communications and Information Technologies (ISCIT).

[47]

Laboratory for Web Algorithmics. Retrieved in 2018 from http://law.dsi.unimi.it/.

[48]

Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3, 3 (June 2009), 1--34.

Digital Library

[49]

Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (Apr. 2001), 131--160.

Digital Library

[50]

Yanhong Li. 1998. Toward a qualitative search engine. IEEE Internet Computing 2, 4 (July 1998), 24--29.

Digital Library

[51]

Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting web users vote for page importance. In ACM SIGIR Conference on Research and Development in Information Retrieval. 451--458.

Digital Library

[52]

Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In International Conference on World Wide Web (WWW).

[53]

Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Workshop on Adversarial Information Retrieval on the Web.

[54]

Mozilla. 2013. Public Suffix List. (May 2013). Retrieved from http://publicsuffix.org/.

[55]

Marc Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq SRC. Retrieved from http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.

[56]

Marc Najork and Janet L. Wiener. 2001. Breadth-first search crawling yields high-quality pages. In International Conference on World Wide Web (WWW). 114--118.

Digital Library

[57]

Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. 2005. Object-level ranking: Bringing order to web objects. In International Conference on World Wide Web (WWW).

Digital Library

[58]

Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In International Conference on World Wide Web (WWW). 83--92.

Digital Library

[59]

Nutch. Retrieved in 2018 from http://nutch.apache.org/.

[60]

Jothi Padmanabhan. 2010. Introduction to Grid Computing via Map-Reduce 8 Hadoop. (Dec. 2010). Retrieved from http://internationalnetworking.iu.edu/sites/internationalnetworking.iu.edu/files/indo-us-preso.pdf.

[61]

Brian Pinkerton. 1994. Finding what people want: Experiences with the web crawler. In International Conference on World Wide Web (WWW).

[62]

Brian Pinkerton. 2000. WebCrawler: Finding What People Want. Ph.D. Dissertation. University of Washington.

Digital Library

[63]

Jakub Piskorski, Marcin Sydow, and Dawid Weiss. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Workshop on Adversarial Information Retrieval on the Web. 25--28.

Digital Library

[64]

Matthew Richardson, Amit Prakash, and Eric Brill. 2006. Beyond Pagerank: Machine learning for static ranking. In International Conference on World Wide Web (WWW). 707--715.

Digital Library

[65]

Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, and Kazuyuki Aihara. 2007. A large-scale study of link spam detection by graph algorithms. In Workshop on Adversarial Information Retrieval on the Web. 45--48.

Digital Library

[66]

Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE). 357--368.

Digital Library

[67]

Amit Signhal. 2012. Breakfast with Google’s Search Team. (Aug. 2012). Retrieved from https://www.youtube.com/watch?v=8a2VmxqFg8A.

[68]

Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Information Retrieval. 126--142.

[69]

William Stearns. 2008. SpamAssassin Domain Blacklist. (Aug. 2008). Retrieved from http://www.stearns.org/sa-blacklist/.

[70]

T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB). 67--72.

[71]

The Stanford WebBase Project. 2008. WebBase Archive. (Sept. 2008). Retrieved from http://dbpubs.stanford.edu:8091/&sim;testbed/doc2/WebBase/.

[72]

John Tomlin. 2003. A new paradigm for ranking pages on the world wide web. In International Conference on World Wide Web (WWW).

Digital Library

[73]

Trystan Upstill, Nick Craswell, and David Hawking. 2003. Predicting fame and fortune: PageRank or indegree. In 8th Australasian Document Computing Symposium.

[74]

X. Wang, X. Li, and D. Loguinov. 2011. Modeling residual-geometric flow sampling. In IEEE Conference on Computer Communications (INFOCOM). 1808--1816.

[75]

Steve Webb, James Caverlee, and Calton Pu. 2007. Characterizing web spam using content and HTTP session analysis. In Conference on Email and Anti-Spam (CEAS).

[76]

Steve Webb, James Caverlee, and Calton Pu. 2008. Predicting web spam with HTTP session information. In International Conference on Information and Knowledge Management (CIKM). 339--348.

Digital Library

[77]

Baoning Wu and Kumar Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Workshop on Adversarial Information Retrieval on the Web. 37--44.

Digital Library

[78]

Baoning Wu and Brian Davison. 2005. Identifying link farm spam pages. In International Conference on World Wide Web (WWW).

Digital Library

[79]

Baoning Wu, Vinay Goel, and Brian Davison. 2006. Topical trustrank: Using topicality to combat web spam. In International Conference on World Wide Web (WWW). 63--72.

Digital Library

[80]

J. Wu and K. Aberer. 2004. Using Siterank for decentralized computation of web document ranking. In Adaptive Hypermedia. 265--274.

[81]

Gui-Rong Xue, Qiang Yang, Hua-Jun Zeng, Yong Yu, and Zheng Chen. 2005. Exploiting the hierarchical structure for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193.

Digital Library

[82]

Sandeep Yadav, Ashwath Kumar Krishna Reddy, A. L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In ACM SIGCOMM Internet Measurement Conference (IMC). ACM, 48--61.

Digital Library

[83]

Yahoo Research Barcelona. 2008. Datasets for Research on Web Spam Detection. (Sept. 2008). Retrieved from http://barcelona.research.yahoo.net/webspam/datasets/.

[84]

Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo. 2002. Architectural design and evaluation of an efficient web-crawling system. Journal of Systems and Software 60, 3 (Feb. 2002), 185--193.

Digital Library

Index Terms

Unsupervised Domain Ranking in Large-Scale Web Crawls
1. Information systems
  1. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Page and site ranking
        Web crawling

Recommendations

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
WWW '20: Proceedings of The Web Conference 2020

Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at ...
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 12, Issue 4

November 2018

215 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/3281744

Editor:
Brian D. Davison
Lehigh University, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2018

Accepted: 01 July 2018

Revised: 01 May 2018

Received: 01 October 2017

Published in TWEB Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
237
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents