skip to main content
research-article

Fast Greedy Algorithms in MapReduce and Streaming

Published:08 September 2015Publication History
Skip Abstract Section

Abstract

Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power.

Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p-system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p-system constraints.

References

  1. R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. 2009. Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, NY, 5--14. DOI:http://dx.doi.org/10.1145/1498759.1498766 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, S. Chaudhuri, and R. Kaushik. 2009. Learning string transformations from examples. PVLDB 2, 1 (2009), 514--525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Babaioff, N. Immorlica, and R. Kleinberg. 2007. Matroids, secretary problems, and online mechanisms. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07), New Orleans, LA, January 7-9, 2007. 434--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Bahmani, R. Kumar, and S. Vassilvitskii. 2012a. Densest subgraph in streaming and Map-Reduce. PVLDB 5, 1 (2012), 454--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. 2012b. Scalable K-Means++. PVLDB 5, 7 (2012), 622--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. H. Bateni, M. T. Hajiaghayi, and M. Zadimoghaddam. 2013. Submodular secretary problem and extensions. ACM Trans. Algorithms 9, 4 (2013), 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Berger, J. Rompel, and P. W. Shor. 1994. Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci. 49, 3 (1994), 454--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. E. Blelloch, R. Peng, and K. Tangwongsan. 2011. Linear-work greedy parallel approximate set cover and variants. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11), San Jose, CA, June 4-6, 2011 (Co-located with FCRC 2011). 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Călinescu, C. Chekuri, M. Pál, and J. Vondrák. 2011. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput. 40, 6 (2011), 1740--1766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri. 2011. Efficient diversification of web search results. PVLDB 4, 7 (2011), 451--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Chakrabarti, S. Khot, and X. Sun. 2003. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th Annual IEEE Conference on Computational Complexity (Complexity 2003), Aarhus, Denmark, July 7-10, 2003. 107--117.Google ScholarGoogle Scholar
  13. M. Charikar. 2000. Greedy approximation algorithms for finding dense components in a graph. In Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX’00), Saarbrücken, Germany, September 5-8, 2000. 84--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Chen, Y. Wang, and S. Yang. 2009. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. 199--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Chierichetti, R. Kumar, and A. Tomkins. 2010. Max-cover in map-reduce. In Proceedings of the 19th International Conference on World Wide Web (WWW’10), Raleigh, NC, April 26-30, 2010. 231--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Cormode, H. J. Karloff, and A. Wirth. 2010. Set cover algorithms for very large datasets. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM’10), Toronto, Ontario, Canada, October 26-30, 2010. 479--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. C. Dean, M. X. Goemans, and J. Vondrák. 2005. Adaptivity and approximation for stochastic packing problems. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’05), Vancouver, British Columbia, Canada, January 23-25, 2005. 395--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, December 6-8, 2004. 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. W. Dijkstra. 1959. A note on two problems in connexion with graphs. Numer. Math. 1, 1 (1959), 269--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Edmonds. 1971. Matroids and the greedy algorithm. Math. Program. 1 (1971), 126--136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Ene, S. Im, and B. Moseley. 2011. Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 21-24, 2011. 681--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. U. Feige. 1998. A threshold of Ln N for approximating set cover. J. ACM 45, 4 (July 1998), 634--652. DOI:http://dx.doi.org/10.1145/285055.285059 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey. 1978. An analysis of approximation for maximizing submodular set functions II. Math. Prog. Study 8 (1978), 73--87.Google ScholarGoogle ScholarCross RefCross Ref
  24. L. R. Ford and D. R. Fulkerson. 1956. Maximal flow through a network. Can. J. Math. 8 (1956), 399--404.Google ScholarGoogle ScholarCross RefCross Ref
  25. D. Gale and L. S. Shapley. 1962. College admissions and the stability of marriage. Am. Math. Monthly 69, 1 (1962), 9--15. http://www.jstor.org/stable/2312726.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Goel, S. Guha, and K. Munagala. 2006. Asking the right questions: Model-driven optimization using probes. In Proceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Chicago, IL, June 26-28, 2006. 203--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. T. Goodrich, N. Sitchinava, and Q. Zhang. 2011. Sorting, searching, and simulation in the MapReduce framework. In Proceedings of the 22nd International Symposium on Algorithms and Computation (ISAAC’11), Yokohama, Japan, December 5-8, 2011. 374--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Goyal, F. Bonchi, and L. V. S. Lakshmana. 2012. A data-based approach to social influence maximization. PVLDB 5, 1 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. 2010. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In Proceedings of the 6th International Workshop on Internet and Network Economics (WINE’10), Stanford, CA, December 13-17, 2010. 246--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Hazan, S. Safra, and O. Schwartz. 2006. On the complexity of approximating k-set packing. Comput. Complexity 15, 1 (2006), 20--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers 40, 9 (1952), 1098--1101.Google ScholarGoogle Scholar
  32. T. A. Jenkyns. 1976. The efficacy of the “greedy” algorithm. In Proceedings of 7th South Eastern Conference on Combinatorics, Graph Theory and Computing. 341--350.Google ScholarGoogle Scholar
  33. H. J. Karloff, S. Suri, and S. Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10), Austin, Texas, January 17-19, 2010. 938--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Kempe, J. M. Kleinberg, and É. Tardos. 2003. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, August 24-27, 2003. 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Korte and D. Hausmann. 1978. An analysis of the greedy heuristic for independence systems. Ann. Discrete Math. 2 (1978), 65--74.Google ScholarGoogle ScholarCross RefCross Ref
  36. R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. 2013. Fast greedy algorithms in mapreduce and streaming. In 25th ACM Symposium on Parallelism in Algorithms and Architectures, (SPAA’13), Montreal, QC, Canada, July 23-25, 2013. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. 2011. Filtering: A method for solving graph problems in MapReduce. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11), San Jose, CA, June 4-6, 2011 (Co-located with FCRC’11). 85--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. M. VanBriesen, and N. S. Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, August 12-15, 2007. 420--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. 2013. Distributed submodular maximization: Identifying representative elements in massive data. In Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  40. S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations Trends Theor. Comput. Sci. 1, 2 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. 1978. An analysis of approximation for maximizing submodular set functions I. Math. Prog. 14 (1978), 265--294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. 2012. Space-round tradeoffs for MapReduce computations. In International Conference on Supercomputing (ICS’12), Venice, Italy, June 25-29, 2012. 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Rokach and O. Maimon. 2005. Top-down induction of decision trees classifiers - a survey. IEEE Tran. Syst. Man Cybernetics Part C 35, 4 (2005), 476--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. Saha and L. Getoor. 2009. On maximum coverage in the streaming model & application to multi-topic blog-watch. In Proceedings of the SIAM International Conference on Data Mining (SDM’’09), Sparks, Nevada, April 30-May 2, 2009. 697--708.Google ScholarGoogle Scholar
  45. A. D. Sarma, A. Lall, D. Nanongkai, R. J. Lipton, and J. (Jim) Xu. 2011. Representative skylines using threshold-based preference distributions. In Proceedings of the 27th International Conference on Data Engineering (ICDE’11), Hannover, Germany, April 11-16, 2011. 387--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Schrijver. 2002. Combinatorial Optimization. Vol. B. Springer.Google ScholarGoogle Scholar
  47. L. G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Vondrak. 2007. Submodularity in Combinatorial Optimization. Ph.D. Dissertation. Charles University, Prague.Google ScholarGoogle Scholar
  49. C. Wang, W. Chen, and Y. Wang. 2012. Scalable influence maximization for independent cascade model in large-scale social networks. Data Min. Knowl. Discov. 25, 3 (2012), 545--576.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Fast Greedy Algorithms in MapReduce and Streaming

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Parallel Computing
      ACM Transactions on Parallel Computing  Volume 2, Issue 3
      Special Issue for SPAA 2013
      October 2015
      196 pages
      ISSN:2329-4949
      EISSN:2329-4957
      DOI:10.1145/2821462
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 September 2015
      • Accepted: 1 May 2015
      • Revised: 1 April 2015
      • Received: 1 November 2013
      Published in topc Volume 2, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader