skip to main content
10.1145/2783258.2783315acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Set Cover at Web Scale

Published: 10 August 2015 Publication History

Abstract

The classic Set Cover problem requires selecting a minimum size subset AF from a family of finite subsets F Of U such that the elements covered by A are the ones covered by F. It naturally occurs in many settings in web search, web mining and web advertising. The greedy algorithm that iteratively selects a set in F that covers the most uncovered elements, yields an optimum (1+ln |U|)-approximation but is inherently sequential. In this work we give the first MapReduce Set Cover algorithm that scales to problem sizes of ∼ 1 trillion elements and runs in logp Δ iterations for a nearly optimum approximation ratio of p ln Δ, where Δ is the cardinality of the largest set in F
A web crawler is a system for bulk downloading of web pages. Given a set of seed URLs, the crawler downloads and extracts the hyperlinks embedded in them and schedules the crawling of the pages addressed by those hyperlinks for a subsequent iteration. While the average page out-degree is ∼ 50, the crawled corpus grows at a much smaller rate, implying a significant outlink overlap. Using our MapReduce Set Cover heuristic as a building block, we present the first large-scale seed generation algorithm that scales to ∼ 20 billion nodes and discovers new pages at a rate ∼ 4x faster than that obtained by prior art heuristics.

Supplementary Material

MP4 File (p1125.mp4)

References

[1]
R. M. Karp, "Reducibility Among Combinatorial Problems," Complexity of Computer Computations, pp. 85--103, 1972.
[2]
D. S. Johnson, "Approximation algorithms for combinatorial problems," Journal of Computer and System Sciences, vol. 9, no. 3, pp. 256--278, 1974.
[3]
U. Feige, "A Threshold of In n for Approximating Set Cover," J. ACM, vol. 45, no. 4, pp. 634--652, 1998.
[4]
B. Berger, J. Rompel, and P. W. Shor, "Efficient NC algorithms for set cover with applications to learning and geometry," Journal of Computer and System Sciences, vol. 49, no. 3, pp. 454--477, 1994. 30th IEEE Conference on Foundations of Computer Science.
[5]
G. E. Blelloch, R. Peng, and K. Tangwongsan, "Linear-work Greedy Parallel Approximate Set Cover and Variants," in Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 23--32, ACM, 2011.
[6]
F. Chierichetti, R. Kumar, and A. Tomkins, "Max-Cover in Map-Reduce," in Proceedings of the 19th International Conference on World Wide Web, pp. 231--240, ACM, 2010.
[7]
G. Cormode, H. Karloff, and A. Wirth, "Set Cover Algorithms for Very Large Datasets," in Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 479--488, ACM, 2010.
[8]
"List of spiders and crawlers."
[9]
S. Bal and R. Nath, "Filtering the Web Pages that are not Modified at Remote Site Without Downloading using Mobile Crawlers," Information Technology Journal, vol. 9, no. 2, pp. 376--380, 2010.
[10]
M. Gray, "Internet growth and statistics: Credits and background," 1993.
[11]
O. A. McBryan, "Genvl and wwww: Tools for taming the web," in Proceedings of the first international World Wide Web conference, vol. 341, 1994.
[12]
D. Eichmann, "The rbse spider-balancing effective search against web load," in Proc. 1st WWW Conf, 1994.
[13]
B. Pinkerton, "Finding what people want: Experiences with the webcrawler," in Proceedings of the Second International World Wide Web Conference, vol. 94, pp. 17--20, 1994.
[14]
R. T. Fielding, "Maintaining distributed hypertext infostructures: Welcome to MOMspider's web," Computer Networks and ISDN Systems, vol. 27, no. 2, pp. 193--204, 1994.
[15]
M. Burner, "Crawling towards eternity: Building an archive of the World Wide Web," Web Techniques Mag., vol. 2, no. 5, 1997.
[16]
S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer networks and ISDN systems, vol. 30, no. 1, pp. 107--117, 1998.
[17]
A. Heydon and M. Najork, "Mercator: A scalable, extensible web crawler," World Wide Web, vol. 2, no. 4, pp. 219--229, 1999.
[18]
M. Najork and A. Heydon, High-performance web crawling. Springer, 2002.
[19]
A. Z. Broder, M. Najork, and J. L. Wiener, "Efficient url caching for world wide web crawling," in Proceedings of the 12th international conference on World Wide Web, pp. 679--689, ACM, 2003.
[20]
D. Fetterly, M. Manasse, M. Najork, and J. Wiener, "A large-scale study of the evolution of web pages," in Proceedings of the 12th international conference on World Wide Web, pp. 669--678, ACM, 2003.
[21]
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, "Measuring index quality using random walks on the web," Computer Networks, vol. 31, no. 11, pp. 1291--1303, 1999.
[22]
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, "On near-uniform url sampling," Computer Networks, vol. 33, no. 1, pp. 295--308, 2000.
[23]
M. Najork and J. L. Wiener, "Breadth-first crawling yields high-quality pages," in Proceedings of the 10th international conference on World Wide Web, pp. 114--118, ACM, 2001.
[24]
V. Shkapenyuk and T. Suel, "Design and implementation of a high-performance distributed web crawler," in Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 357--368, IEEE, 2002.
[25]
J. Edwards, K. McCurley, and J. Tomlin, "An adaptive model for optimizing performance of an incremental web crawler," in Proceedings of the 10th international conference on World Wide Web, pp. 106--113, ACM, 2001.
[26]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "Ubicrawler: A scalable fully distributed web crawler," Software: Practice and Experience, vol. 34, no. 8, pp. 711--726, 2004.
[27]
H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "Irlbot: scaling to 6 billion pages and beyond," ACM Transactions on the Web, vol. 3, no. 3, p. 8, 2009.
[28]
G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton, "Introduction to heritrix," in 4th International Web Archiving Workshop, 2004.
[29]
R. Khare, D. Cutting, K. Sitaker, and A. Rifkin, "Nutch: A flexible and scalable open-source web search engine," Oregon State University, vol. 1, pp. 32--32, 2004.
[30]
Z. Bar-Yossef and M. Gurevich, "Random sampling from a search engine's index," Journal of the ACM (JACM), vol. 55, no. 5, p. 24, 2008.
[31]
C. Olston and M. Najork, "Web crawling," Foundations and Trends in Information Retrieval, vol. 4, no. 3, pp. 175--246, 2010.
[32]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, "Graph structure in the web," in Proceedings of the 9th International World Wide Web Conference, 2000.
[33]
A. Ntoulas, J. Cho, and C. Olston, "What's new on the web?: the evolution of the web from a search engine perspective," in Proceedings of the 13th international conference on World Wide Web, pp. 1--12, ACM, 2004.
[34]
A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins, "The discoverability of the web," in Proceedings of the 16th International Conference on World Wide Web, pp. 421--430, 2007.
[35]
S. Zheng, P. Dmitriev, and C. L. Giles, "Graph-based seed selection for web-scale crawlers," in Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1967--1970, ACM, 2009.
[36]
J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107--113, Jan. 2008.
[37]
H. Karloff, S. Suri, and S. Vassilvitskii, "A model of computation for mapreduce," in Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 938--948, Society for Industrial and Applied Mathematics, 2010.
[38]
L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web.," tech. rep., 1999.

Cited By

View all
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2023)Set Cover in the One-pass Edge-arrival Streaming ModelProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588678(127-139)Online publication date: 18-Jun-2023
  • (2020)Continuously Tracking Core Items in Data Streams with Probabilistic Decays2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00072(769-780)Online publication date: Apr-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. map-reduce
  2. max cover
  3. set cover

Qualifiers

  • Research-article

Conference

KDD '15
Sponsor:

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive ReviewDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.109712311:3(1399-1423)Online publication date: 31-Jul-2023
  • (2023)Set Cover in the One-pass Edge-arrival Streaming ModelProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588678(127-139)Online publication date: 18-Jun-2023
  • (2020)Continuously Tracking Core Items in Data Streams with Probabilistic Decays2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00072(769-780)Online publication date: Apr-2020
  • (2019)Enumerating Minimal Weight Set Covers2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00053(518-529)Online publication date: Apr-2019
  • (2018)Greedy and Local Ratio Algorithms in the MapReduce ModelProceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures10.1145/3210377.3210386(43-52)Online publication date: 11-Jul-2018
  • (2018)A restart local search algorithm for solving maximum set k-covering problemNeural Computing and Applications10.1007/s00521-016-2599-729:10(755-765)Online publication date: 1-May-2018
  • (2017)JulienneProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087580(293-304)Online publication date: 24-Jul-2017
  • (2016)Fast distributed submodular cover: public-private data summarizationProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157382.3157500(3601-3609)Online publication date: 5-Dec-2016
  • (2016)Connected placement of disaster shelters in modern citiesProceedings of the Eleventh ACM Workshop on Challenged Networks10.1145/2979683.2979690(75-80)Online publication date: 3-Oct-2016
  • (2015)Distributed submodular coverProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 210.5555/2969442.2969562(2881-2889)Online publication date: 7-Dec-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media