skip to main content
research-article

Unsupervised Domain Ranking in Large-Scale Web Crawls

Published: 27 September 2018 Publication History

Abstract

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

References

[1]
Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. 2008. Web spam identification through content and hyperlinks. In Workshop on Adversarial Information Retrieval on the Web. 41--44.
[2]
Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page importance computation. In International Conference on World Wide Web (WWW). 280--290.
[3]
Sarker Tanzir Ahmed and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In IEEE Conference on Computer Communications (INFOCOM). 1598--1606.
[4]
alexa.com. Retrieved in 2015 from http://www.alexa.com/.
[5]
Jesse Alpert and Nissan Hajaj. 2008. We Knew the Web Was Big... (July 2008). Retrieved from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
[6]
Shatlyk Ashyralyyev, B. Barla Cambazoglu, and Cevdet Aykanat. 2013. Incorporating the surfing behavior of web users into pagerank. In ACM International Conference on Information and Knowledge Management (CIKM). 2351--2356.
[7]
Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling a country: Better strategies than breadth-first for web page ordering. In International Conference on World Wide Web (WWW).
[8]
Xiao Bai, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Discovering URLs through user feedback. In ACM International Conference on Information and Knowledge Management (CIKM). 77--86.
[9]
Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Link-based characterization and detection of web spam. In Workshop on Adversarial Information Retrieval on the Web.
[10]
Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Workshop on Web Mining and Web Usage Analysis (WebKDD).
[11]
Andras Benczur, Karoly Csalogany, and Tamas Sarlos. 2006. Link-based similarity search to fight web spam. In Workshop on Adversarial Information Retrieval on the Web.
[12]
Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. Spamrank -- Fully automatic link spam detection. In Workshop on Adversarial Information Retrieval on the Web.
[13]
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (July 2004), 711--726.
[14]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International Conference on World Wide Web (WWW). 107--117.
[15]
Carlos Castillo, Mauricio Marin, Andrea Rodriguez, and Ricardo Baeza-Yates. 2004. Scheduling algorithms for web crawling. In Latin American Web Conference.
[16]
James Caverlee and Ling Liu. 2007. Countering web spam with credibility-based link analysis. In ACM Symposium on Principles of Distributed Computing (PODC). 157--166.
[17]
James Caverlee, Steve Webb, and Ling Liu. 2007. Spam-resilient web rankings via influence throttling. In International Parallel and Distributed Processing Symposium (IPDPS).
[18]
Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 417 (March 1992), 210--217.
[19]
Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. I/O-efficient techniques for computing pagerank. In International Conference on Information and Knowledge Management (CIKM).
[20]
Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2004. Local methods for estimating pagerank values. In International Conference on Information and Knowledge Management (CIKM).
[21]
Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, and Sriram Raghavan Gary Wesley. 2006. Stanford webbase components and applications. ACM Transactions on Internet Technology 6, 2 (May 2006), 153--186.
[22]
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In International Conference on World Wide Web (WWW). 161--172.
[23]
Young-Joo Chung, Masashi Toyoda, and Masaru Kitsuregawa. 2009. A study of link farm distribution and evolution using a time series of web snapshots. In Workshop on Adversarial Information Retrieval on the Web. 9--16.
[24]
ClueWeb09 Dataset. Retrieved in 2015 from http://www.lemurproject.org/clueweb09/.
[25]
CLuE cluster. Retrieved from http://www.nsf.gov/cise/clue/index.jsp.
[26]
ClueWeb09 documentation. Retrieved in 2015 from http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html.
[27]
Isabel Drost and Tobias Scheffer. 2005. Thwarting the nigritude ultramaring: Learning to identify link spam. In European Conference on Machine Learning (ECML).
[28]
N. Duffield, C. Lund, and M. Thorup. 2003. Estimating flow distributions from sampled flow statistics. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 325--336.
[29]
Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In International Conference on World Wide Web (WWW). 106--113.
[30]
David Eichmann. 1994. The RBSE spider -- Balancing effective search against web load. In International Conference on World Wide Web (WWW).
[31]
Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. 2004. Ranking the web frontier. In International Conference on World Wide Web (WWW). 309--318.
[32]
Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma. 2006. AggregateRank: Bringing order to web sites. In ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82.
[33]
Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In International Workshop on the Web and Databases (WebDB). 1--6.
[34]
Google Toolbar. Retrieved in 2008 from http://toolbar.google.com.
[35]
IRLbot PLD graph. Retrieved in 2018 from http://irl.cs.tamu.edu/projects/web/.
[36]
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. 2004. How to build a webfountain: An architecture for very large-scale text analytics. IBM Systems Journal 43, 1 (2004), 64--77.
[37]
Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection based on mass estimation. In International Conference on Very Large Data Bases (VLDB). 439--450.
[38]
Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In International Conference on Very Large Data Bases (VLDB). 517--528.
[39]
Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web spam taxonomy. In Workshop on Adversarial Information Retrieval on the Web. 39--47.
[40]
Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustRank. In International Conference on Very Large Data Bases (VLDB). 576--587.
[41]
Younes Hafri and Chabane Djeraba. 2004. High performance crawling system. In ACM International Workshop on Multimedia Information Retrieval (MIR). 299--306.
[42]
Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (Dec. 1999), 219--229.
[43]
Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. In International Conference on World Wide Web (WWW). 277--293.
[44]
Internet Archive. Retrieved in 2018 from http://archive.org/.
[45]
Melody Ivory and Marti Hearst. 2002. Statistical profiles of highly-rated web sites. In Conference on Human Factors in Computing Systems (CHI). 367--374.
[46]
Kasom Koht-arsa and Surasak Sanguanpong. 2002. High performance large scale web spider architecture. In IEEE International Symposium on Communications and Information Technologies (ISCIT).
[47]
Laboratory for Web Algorithmics. Retrieved in 2018 from http://law.dsi.unimi.it/.
[48]
Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3, 3 (June 2009), 1--34.
[49]
Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (Apr. 2001), 131--160.
[50]
Yanhong Li. 1998. Toward a qualitative search engine. IEEE Internet Computing 2, 4 (July 1998), 24--29.
[51]
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting web users vote for page importance. In ACM SIGIR Conference on Research and Development in Information Retrieval. 451--458.
[52]
Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In International Conference on World Wide Web (WWW).
[53]
Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Workshop on Adversarial Information Retrieval on the Web.
[54]
Mozilla. 2013. Public Suffix List. (May 2013). Retrieved from http://publicsuffix.org/.
[55]
Marc Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq SRC. Retrieved from http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.
[56]
Marc Najork and Janet L. Wiener. 2001. Breadth-first search crawling yields high-quality pages. In International Conference on World Wide Web (WWW). 114--118.
[57]
Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. 2005. Object-level ranking: Bringing order to web objects. In International Conference on World Wide Web (WWW).
[58]
Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In International Conference on World Wide Web (WWW). 83--92.
[59]
Nutch. Retrieved in 2018 from http://nutch.apache.org/.
[60]
Jothi Padmanabhan. 2010. Introduction to Grid Computing via Map-Reduce 8 Hadoop. (Dec. 2010). Retrieved from http://internationalnetworking.iu.edu/sites/internationalnetworking.iu.edu/files/indo-us-preso.pdf.
[61]
Brian Pinkerton. 1994. Finding what people want: Experiences with the web crawler. In International Conference on World Wide Web (WWW).
[62]
Brian Pinkerton. 2000. WebCrawler: Finding What People Want. Ph.D. Dissertation. University of Washington.
[63]
Jakub Piskorski, Marcin Sydow, and Dawid Weiss. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Workshop on Adversarial Information Retrieval on the Web. 25--28.
[64]
Matthew Richardson, Amit Prakash, and Eric Brill. 2006. Beyond Pagerank: Machine learning for static ranking. In International Conference on World Wide Web (WWW). 707--715.
[65]
Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, and Kazuyuki Aihara. 2007. A large-scale study of link spam detection by graph algorithms. In Workshop on Adversarial Information Retrieval on the Web. 45--48.
[66]
Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE). 357--368.
[67]
Amit Signhal. 2012. Breakfast with Google’s Search Team. (Aug. 2012). Retrieved from https://www.youtube.com/watch?v=8a2VmxqFg8A.
[68]
Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Information Retrieval. 126--142.
[69]
William Stearns. 2008. SpamAssassin Domain Blacklist. (Aug. 2008). Retrieved from http://www.stearns.org/sa-blacklist/.
[70]
T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB). 67--72.
[71]
The Stanford WebBase Project. 2008. WebBase Archive. (Sept. 2008). Retrieved from http://dbpubs.stanford.edu:8091/∼testbed/doc2/WebBase/.
[72]
John Tomlin. 2003. A new paradigm for ranking pages on the world wide web. In International Conference on World Wide Web (WWW).
[73]
Trystan Upstill, Nick Craswell, and David Hawking. 2003. Predicting fame and fortune: PageRank or indegree. In 8th Australasian Document Computing Symposium.
[74]
X. Wang, X. Li, and D. Loguinov. 2011. Modeling residual-geometric flow sampling. In IEEE Conference on Computer Communications (INFOCOM). 1808--1816.
[75]
Steve Webb, James Caverlee, and Calton Pu. 2007. Characterizing web spam using content and HTTP session analysis. In Conference on Email and Anti-Spam (CEAS).
[76]
Steve Webb, James Caverlee, and Calton Pu. 2008. Predicting web spam with HTTP session information. In International Conference on Information and Knowledge Management (CIKM). 339--348.
[77]
Baoning Wu and Kumar Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Workshop on Adversarial Information Retrieval on the Web. 37--44.
[78]
Baoning Wu and Brian Davison. 2005. Identifying link farm spam pages. In International Conference on World Wide Web (WWW).
[79]
Baoning Wu, Vinay Goel, and Brian Davison. 2006. Topical trustrank: Using topicality to combat web spam. In International Conference on World Wide Web (WWW). 63--72.
[80]
J. Wu and K. Aberer. 2004. Using Siterank for decentralized computation of web document ranking. In Adaptive Hypermedia. 265--274.
[81]
Gui-Rong Xue, Qiang Yang, Hua-Jun Zeng, Yong Yu, and Zheng Chen. 2005. Exploiting the hierarchical structure for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193.
[82]
Sandeep Yadav, Ashwath Kumar Krishna Reddy, A. L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In ACM SIGCOMM Internet Measurement Conference (IMC). ACM, 48--61.
[83]
Yahoo Research Barcelona. 2008. Datasets for Research on Web Spam Detection. (Sept. 2008). Retrieved from http://barcelona.research.yahoo.net/webspam/datasets/.
[84]
Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo. 2002. Architectural design and evaluation of an efficient web-crawling system. Journal of Systems and Software 60, 3 (Feb. 2002), 185--193.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 12, Issue 4
November 2018
215 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3281744
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2018
Accepted: 01 July 2018
Revised: 01 May 2018
Received: 01 October 2017
Published in TWEB Volume 12, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web crawling
  2. frontier prioritization
  3. ranking

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 237
    Total Downloads
  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media