research-article

Computing A Well-Representative Summary of Conjunctive Query Results

Authors:

Pankaj K. Agarwal,

Aryan Esmailpour,

Stavros Sintos,

Jun YangAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 5

Article No.: 217, Pages 1 - 27

https://doi.org/10.1145/3695835

Published: 07 November 2024 Publication History

Abstract

Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be N^O(1). We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*k^O(1) ), or even O(N+ k^O(1) ) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles.

References

[1]

https://db-engines.com/en/ranking_categories.

[2]

Z. Abbassi, V. S. Mirrokni, and M. Thakur. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32--40, 2013.

Digital Library

[3]

M. Abrahamsen, M. de Berg, K. Buchin, M. Mehr, and A. D. Mehrabi. Range-clustering queries. In Proceedings of the 33rd International Symposium on Computational Geometry, pages 5:1--5:16, 2017.

[4]

R. Addanki, A. McGregor, A. Meliou, and Z. Moumoulidou. Improved approximation and scalability for fair max-min diversification. In Range-clustering queries 25th International Conference on Database Theory, pages 7:1--7:21, 2022.

[5]

P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems, 38(4):1--28, 2013.

Digital Library

[6]

P. K. Agarwal, S. Har-Peled, and H. Yu. Robust shape fitting via peeling and grating coresets. Discrete & Computational Geometry, 39(1--3):38--58, 2008.

[7]

P. K. Agarwal, J. Matou?ek, and S. Suri. Farthest neighbors, maximum spanning trees and related problems in higher dimensions. Computational Geometry, 1(4):189--201, 1992.

Digital Library

[8]

P. K. Agarwal, J. Pach, and M. Sharir. State of the union (of geometric objects). pages 9--48. 2008.

[9]

P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. Algorithmica, 33:201--226, 2002.

Digital Library

[10]

P. K. Agarwal and M. Sharir. Arrangements and their applications. In Handbook of Computational Geometry, pages 49--119. Elsevier, 2000.

[11]

P. K. Agarwal, S. Sintos, and A. Steiger. Efficient indexes for diverse top-k range queries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 213--227, 2020.

Digital Library

[12]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42, 2013.

Digital Library

[13]

M. Arenas, T. C. Merkl, R. Pichler, and C. Riveros. Towards tractability of the diversity of query answers: Ultrametrics to the rescue. arXiv preprint arXiv:2408.01657, 2024.

[14]

A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737--1767, 2013.

Digital Library

[15]

G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proceedings of the International Workshop on Computer Science Logic, pages 208--222. Springer, 2007.

[16]

C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. Journal of the ACM, 30(3):479--513, 1983.

Digital Library

[17]

B. Birnbaum and K. J. Goldman. An improved analysis for a greedy remote-clique algorithm using factor-revealing lps. Algorithmica, 55(1):42--59, 2009.

Digital Library

[18]

A. Borodin, H. C. Lee, and Y. Ye. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 155--166, 2012.

Digital Library

[19]

A. Cevallos. Approximation algorithms for geometric dispersion. Technical report, EPFL, 2016.

[20]

A. Cevallos, F. Eisenbrand, and R. Zenklusen. Local search for max-sum diversification. In Proceedings of the Twenty- Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 130--142. SIAM, 2017.

[21]

A. Cevallos, F. Eisenbrand, and R. Zenklusen. An improved analysis of local search for max-sum diversification. Mathematics of Operations Research, 44(4):1494--1509, 2019.

Digital Library

[22]

T. M. Chan. Approximating the diameter, width, smallest enclosing cylinder, and minimum-width annulus. In Proceedings of the 16th Annual Symposium on Computational Geometry, pages 300--309, 2000.

Digital Library

[23]

S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 32(2):9--es, 2007.

Digital Library

[24]

S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. ACM SIGMOD Record, 28(2):263--274, 1999.

Digital Library

[25]

J. Chen, Q. Yang, R. Huang, and H. Ding. Coresets for relational data and the applications. Advances in Neural Information Processing Systems, 35:434--448, 2022.

[26]

Y. Chen and K. Yi. Random sampling and size estimation over cyclic joins. In Proceedings of the 23rd International Conference on Database Theory, pages 7:1--7:18, 2020.

[27]

G. Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, page 15, 2011.

[28]

G. Cormode. Data sketching. Communications of the ACM, 60(9):48--55, 2017.

Digital Library

[29]

G. Cormode, M. Garofalakis, P. J. Haas, C. Jermaine, et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 4(1--3):1--294, 2011.

[30]

G. Cormode and K. Yi. Small Summaries for Big Data. Cambridge University Press, 2020.

[31]

R. Curtin, B. Moseley, H. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. Rk-means: Fast clustering for relational data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 2742--2752, 2020.

[32]

S. Deep, X. Hu, and P. Koutris. Ranked enumeration of join queries with projections. Proceedings of the VLDB Endowment, 15(5):1024--1037, 2022.

Digital Library

[33]

S. Deep and P. Koutris. Ranked enumeration of conjunctive query results. In Proceedings of the 24th International Conference on Database Theory, pages 5:1--5:19, 2021.

[34]

S. Deng, S. Lu, and Y. Tao. On join sampling and hardness of combinatorial output-sensitive join algorithms. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 99--111, 2023.

Digital Library

[35]

M. Deza and H. Maehara. Metric transforms and euclidean embeddings. Transactions of the American Mathematical Society, 317(2):661--671, 1990.

[36]

A. Esmailpour and S. Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5), 2025.

[37]

R. Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. Journal of the ACM, 30(3):514--550, 1983.

Digital Library

[38]

T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 434--444, 1988.

Digital Library

[39]

M. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 476--487, 2002.

Digital Library

[40]

P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 331--342, 1998.

Digital Library

[41]

T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293--306, 1985.

[42]

G. Gottlob, G. Greco, and F. Scarcello. Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1, 2014.

[43]

S. Har-Peled, N. Kumar, D. M. Mount, and B. Raichel. Space exploration via proximity search. Discrete & Computational Geometry, 56:357--376, 2016.

Digital Library

[44]

S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for euclidean distance problems. Journal of the ACM, 62(6):1--35, 2015.

Digital Library

[45]

R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Letters, 21(3):133--137, 1997.

Digital Library

[46]

X. Hu and S. Sintos. Finding smallest witnesses for conjunctive queries. In Proceedings of the 27th International Conference on Database Theory, pages 24:1--24:20, 2024.

[47]

Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases, pages 19--30, 2003.

[48]

H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In Proceedings of the 24rd International Conference on Very Large Data Bases, pages 275--286, 1998.

Digital Library

[49]

M. Jones, H. Nguyen, and T. Nguyen. Fair k-centers via maximum matching. In Proceedings of the International Conference on Machine Learning, pages 4940--4949, 2020.

[50]

K. Kim, J. Ha, G. Fletcher, and W.-S. Han. Guaranteeing the o(agm/out) runtime for uniform sampling and size estimation over joins. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 113--125, 2023.

Digital Library

[51]

M. Kleindessner, P. Awasthi, and J. Morgenstern. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448--3457. PMLR, 2019.

[52]

Y. Kurkure, M. Shamo, J. Wiseman, S. Galhotra, and S. Sintos. Faster algorithms for fair max-min diversification in ???? . Proceedings of the ACM on Management of Data, 2(3):1--26, 2024.

Digital Library

[53]

X. Liang, S. Sintos, Z. Shang, and S. Krishnan. Combining aggregation and sampling (nearly) optimally for approximate query processing. In Proceedings of the 2021 International Conference on Management of Data, pages 1129--1141, 2021.

Digital Library

[54]

T. C. Merkl, R. Pichler, and S. Skritek. Diversity of answers to conjunctive queries. In Proceedings of the 26th International Conference on Database Theory, pages 10:1--10:19, 2023.

[55]

B. Moseley, K. Pruhs, A. Samadian, and Y. Wang. Relational algorithms for k-means clustering. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, pages 97:1--97:21, 2021.

[56]

Z. Moumoulidou, A. McGregor, and A. Meliou. Diverse data selection under fairness constraints. In Proceedings of the 24th International Conference on Database Theory, pages 13:1--13:25, 2021.

[57]

E. Oh and H.-K. Ahn. Approximate range queries for clustering. In Proceedings of the 34th International Symposium on Computational Geometry, pages 62:1--62:14, 2018.

[58]

D. Olteanu and J. Závodny. Factorised representations of query results: size bounds and readability. In Proceedings of the 15th International Conference on Database Theory, pages 285--298, 2012.

Digital Library

[59]

S. S. Ravi, D. J. Rosenkrantz, and G. K. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299--310, 1994.

Digital Library

[60]

J. S. Salowe. L-infinity interdistance selection by parametric search. Information Processing Letters, 30(1):9--14, 1989.

Digital Library

[61]

I. J. Schoenberg. Remarks to Maurice Frechet's article'sur la definition axiomatique d'une classe d'espace distances vectoriellement applicable sur l'espace de hilbert. Annals of Mathematics, pages 724--732, 1935.

[62]

I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, pages 811--841, 1938.

[63]

I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3):522--536, 1938.

[64]

A. Tamir. Obnoxious facility location on graphs. SIAM Journal on Discrete Mathematics, 4(4):550--567, 1991.

Digital Library

[65]

M. Yannakakis. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases, volume 81, pages 82--94, 1981.

[66]

Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi. Random sampling over joins revisited. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, pages 1525--1539, 2018.

Digital Library

[67]

Z. Zhao, F. Li, and Y. Liu. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2027--2042, 2020.

Digital Library

Cited By

Deep SZhao HFan AKoutris P(2024)Output-sensitive Conjunctive Query EvaluationProceedings of the ACM on Management of Data10.1145/36958382:5(1-24)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695838
Esmailpour ASintos S(2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695831

Recommendations

Improved Approximation Algorithms for Relational Clustering
PODS

Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, ...
k-Clustering with Comparison and Distance Oracles
PODS

In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data ...
Output-sensitive Conjunctive Query Evaluation
PODS

Join evaluation is one of the most fundamental operations performed by database systems and arguably the most well-studied problem in the Database community. A staggering number of join algorithms have been developed, and commercial database engines use ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 5

PODS

November 2024

363 pages

EISSN:2836-6573

DOI:10.1145/3703846

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Published in PACMMOD Volume 2, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

US-Israel Binational Science Foundation Grant
NSF
NSERC Discovery Grant

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
51
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)13

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Deep SZhao HFan AKoutris P(2024)Output-sensitive Conjunctive Query EvaluationProceedings of the ACM on Management of Data10.1145/36958382:5(1-24)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695838
Esmailpour ASintos S(2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3695831

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents