skip to main content
10.1145/1871437.1871501acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Set cover algorithms for very large datasets

Published: 26 October 2010 Publication History

Abstract

The problem of Set Cover - to find the smallest subcollection of sets that covers some universe - is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.
However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.

References

[1]
B. Berger, J. Rompel, and P. Shor. Efficient NC algorithms for set cover with applications to learning and geometry. Journal of Computer and System Sciences, 49(3):454--477, 1994.
[2]
T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--60, 1999.
[3]
A. Broder and M. Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 2003.
[4]
F. Chierichetti, R. Kumar, and A. Tomkins. Max-Cover in Map-Reduce. In Proceedings of the 19th International Conference on World Wide Web, pages 231--240. ACM, 2010.
[5]
U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634--652, 1998.
[6]
K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, page 18pp, January 2003.
[7]
B. Goethals. Frequent itemset mining dataset repository. http://fimi.cs.helsinki.fi/dat.
[8]
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In Proceedings of Very Large Databases Conference (VLDB), 2008.
[9]
F. Gomes, C. Meneses, P. Pardalos, and G. Viana. Experimental analysis of approximation algorithms for the vertex cover and set covering problems. Computers & Operations Research, 33(12):3520--3534, 2006.
[10]
T. Grossman and A. Wool. Computational experience with approximation algorithms for the set covering problem. European Journal of Operational Research, 101(1):81--92, 1997.
[11]
D. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9(3):256--278, 1974.
[12]
C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In Proceedings of the ICDM Workshop in Frequent Itemset Mining Implementations, 2004.
[13]
M. Mihail. Set cover with requirements and costs evolving over time. In Proceedings of the Second International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: RANDOM-APPROX, pages 63--72, 1999.
[14]
K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Technical Report 2003-65, Stanford InfoLab, October 2003.
[15]
B. Saha and L. Getoor. On Maximum Coverage in the streaming model & application to multi-topic blog-watch. In 2009 SIAM International Conference on Data Mining (SDM09), April 2009.

Cited By

View all
  • (2024)Novel Trip Agglomeration Methods for Efficient Extraction of Urban Mobility PatternsNetworks and Spatial Economics10.1007/s11067-024-09641-324:4(897-926)Online publication date: 13-Aug-2024
  • (2023)Set Cover in the One-pass Edge-arrival Streaming ModelProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588678(127-139)Online publication date: 18-Jun-2023
  • (2023)A General Approach to Generate Test Packets with Network ConfigurationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.3241433(1-14)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. disk friendly
  2. greedy heuristic
  3. set cover

Qualifiers

  • Research-article

Conference

CIKM '10

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)5
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Novel Trip Agglomeration Methods for Efficient Extraction of Urban Mobility PatternsNetworks and Spatial Economics10.1007/s11067-024-09641-324:4(897-926)Online publication date: 13-Aug-2024
  • (2023)Set Cover in the One-pass Edge-arrival Streaming ModelProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588678(127-139)Online publication date: 18-Jun-2023
  • (2023)A General Approach to Generate Test Packets with Network ConfigurationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.3241433(1-14)Online publication date: 2023
  • (2023)Improved local search for the minimum weight dominating set problem in massive graphs by using a deep optimization mechanismArtificial Intelligence10.1016/j.artint.2022.103819314(103819)Online publication date: Jan-2023
  • (2022)Randoop-TSR: Random-based Test Generator with Test Suite ReductionProceedings of the 13th Asia-Pacific Symposium on Internetware10.1145/3545258.3545280(221-230)Online publication date: 11-Jun-2022
  • (2022)Zeph & Iris map the internetACM SIGCOMM Computer Communication Review10.1145/3523230.352323252:1(2-9)Online publication date: 1-Mar-2022
  • (2022)INODEACM SIGMOD Record10.1145/3516431.351643650:4(23-29)Online publication date: 31-Jan-2022
  • (2022)Metaphoraction: Support Gesture-based Interaction Design with Metaphorical MeaningsACM Transactions on Computer-Human Interaction10.1145/351189229:5(1-33)Online publication date: 20-Oct-2022
  • (2022)IOHanalyzer: Detailed Performance Analyses for Iterative Optimization HeuristicsACM Transactions on Evolutionary Learning and Optimization10.1145/35104262:1(1-29)Online publication date: 5-Apr-2022
  • (2022)Saddle Point Optimization with Approximate Minimization Oracle and Its Application to Robust Berthing ControlACM Transactions on Evolutionary Learning and Optimization10.1145/35104252:1(1-32)Online publication date: 5-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media