skip to main content
10.1145/3159652.3159719acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Fast Coreset-based Diversity Maximization under Matroid Constraints

Published: 02 February 2018 Publication History

Abstract

Max-sum diversity is a fundamental primitive for web search and data mining. For a given set S of n elements, it returns a subset of k«l n representatives maximizing the sum of their pairwise distances, where distance models dissimilarity. An important variant of the primitive prescribes that the desired subset of representatives satisfies an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k). While unconstrained max-sum diversity admits efficient coreset-based strategies, the only known approaches dealing with the additional matroid constraint are inherently sequential and are based on an expensive local search over the entire input set. We devise the first coreset constructions for max-sum diversity under various matroid constraints, together with efficient sequential, MapReduce and Streaming implementations. By running the local-search on the coreset rather than on the entire input, we obtain the first practical solutions for large instances. Technically, our coresets are subsets of S containing a feasible solution which is no more than a factor 1-ε away from the optimal solution, for any fixed ε <1, and, for spaces of bounded doubling dimension, they have a small size independent of n. Extensive experiments show that, with respect to full-blown local search, our coreset-based approach yields solutions of comparable quality, with improvements of up to two orders of magnitude in the running time, also for input sets of unknown dimensionality.

References

[1]
Zeinab Abbassi, Vahab S. Mirrokni, and Mayur Thakur. 2013. Diversity maximization under matroid constraints. Proc. KDD. 32--40.
[2]
Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. 2010. Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms Vol. 6, 4 (2010), 59:1--59:26.
[3]
Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. 2005. Geometric approximation via coresets. J. of Combinatorial and Computational Geometry Vol. 52 (2005), 1--30.
[4]
Sepideh Aghamolaei, Majid Farhadi, and Hamid Zarrabi-Zadeh. 2015. Diversity Maximization via Composable Coresets. In Proc. CCCG.
[5]
Aditya Bhaskara, Mehrdad Ghadiri, Vahab S. Mirrokni, and Ola Svensson. 2016. Linear Relaxations for Finding Diverse Elements in Metric Spaces Proc. NIPS. 4098--4106. http://papers.nips.cc/paper/6500-linear-relaxations-for-finding-diverse-elements-in-metric-spaces
[6]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. of Machine Learning Research Vol. 3 (2003), 993--1022. http://www.jmlr.org/papers/v3/blei03a.html
[7]
Allan Borodin, Hyun Chul Lee, and Yuli Ye. 2012. Max-Sum diversification, monotone submodular functions and dynamic updates Proc. PODS. 155--166.
[8]
Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal. 2017. MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension. PVLDB, Vol. 10, 5 (2017), 469--480. http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf
[9]
Alfonso Cevallos, Friedrich Eisenbrand, and Rico Zenklusen. 2017. Local Search for Max-Sum Diversification. In Proc. SODA. 130--142.
[10]
Barun Chandra and Magnús M. Halldórsson. 2001. Approximation Algorithms for Dispersion Problems. J. Algorithms, Vol. 38, 2 (2001), 438--465.
[11]
Richard Cole and Lee-Ad Gottlieb. 2006. Searching dynamic point sets in spaces with bounded doubling dimension Proc. STOC. 574--583.
[12]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters Proc. OSDI. 137--150. http://www.usenix.org/events/osdi04/tech/dean.html
[13]
Teofilo F. Gonzalez. 1985. Clustering to Minimize the Maximum Intercluster Distance. Theoretical Computer Science Vol. 38 (1985), 293--306.
[14]
Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. 2014. Efficient Classification for Metric Data. IEEE Trans. Information Theory Vol. 60, 9 (2014), 5750--5759.
[15]
Anupam Gupta, Robert Krauthgamer, and James R. Lee. 2003. Bounded Geometries, Fractals, and Low-Distortion Embeddings Proc. FOCS. 534--543.
[16]
Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. 1998. Computing on data streams. In Proc. DIMACS. 107--118.
[17]
Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. 2014. Composable core-sets for diversity and coverage maximization Proc. PODS. 100--108.
[18]
Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A Model of Computation for MapReduce. In Proc. SODA. 938--948.
[19]
Goran Konjevod, Andréa W. Richa, and Donglin Xia. 2008. Dynamic Routing and Location Services in Metrics of Low Doubling Dimension Proc. DISC. 379--393.
[20]
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets, 2nd Ed. Cambridge University Press.
[21]
Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley. 2015. Fast Distributed k-Center Clustering with Outliers on Massive Data Proc. NIPS. 1063--1071. http://papers.nips.cc/paper/5997-fast-distributed-k-center-clustering-with-outliers-on-massive-data
[22]
Richard Matthew McCutchen and Samir Khuller. 2008. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity Proc. APPROX-RANDOM. 165--178.
[23]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR Vol. abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
[24]
James G. Oxley. 2006. Matroid Theory. Oxford University Press.
[25]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proc. HotCloud. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets

Cited By

View all
  • (2024)Faster Algorithms for Fair Max-Min Diversification in RdProceedings of the ACM on Management of Data10.1145/36549402:3(1-26)Online publication date: 30-May-2024
  • (2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
  • (2024)MapReduce Algorithms for Robust Center-Based Clustering in Doubling MetricsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104966(104966)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
February 2018
821 pages
ISBN:9781450355810
DOI:10.1145/3159652
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximation algorithms
  2. coresets
  3. diversity maximization
  4. doubling dimension
  5. doubling spaces
  6. mapreduce
  7. matroids
  8. streaming

Qualifiers

  • Research-article

Funding Sources

  • Università degli Studi di Padova

Conference

WSDM 2018

Acceptance Rates

WSDM '18 Paper Acceptance Rate 81 of 514 submissions, 16%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Faster Algorithms for Fair Max-Min Diversification in RdProceedings of the ACM on Management of Data10.1145/36549402:3(1-26)Online publication date: 30-May-2024
  • (2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
  • (2024)MapReduce Algorithms for Robust Center-Based Clustering in Doubling MetricsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104966(104966)Online publication date: Aug-2024
  • (2023)Core-sets for fair and diverse data summarizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669576(78987-79011)Online publication date: 10-Dec-2023
  • (2023)Fair Max–Min Diversity Maximization in Streaming and Sliding-Window ModelsEntropy10.3390/e2507106625:7(1066)Online publication date: 14-Jul-2023
  • (2023)Scalable and space-efficient Robust Matroid Center algorithmsJournal of Big Data10.1186/s40537-023-00717-410:1Online publication date: 17-Apr-2023
  • (2023)Distributed k-Means with Outliers in General MetricsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_32(474-488)Online publication date: 24-Aug-2023
  • (2023)Fully Dynamic Clustering and Diversity Maximization in Doubling MetricsAlgorithms and Data Structures10.1007/978-3-031-38906-1_41(620-636)Online publication date: 28-Jul-2023
  • (2022)Streaming Algorithms for Diversity Maximization with Fairness Constraints2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00008(41-53)Online publication date: May-2022
  • (2022)Adaptive k-center and diameter estimation in sliding windowsInternational Journal of Data Science and Analytics10.1007/s41060-022-00318-z14:2(155-173)Online publication date: 2-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media