skip to main content
research-article

Computing A Well-Representative Summary of Conjunctive Query Results

Published: 07 November 2024 Publication History

Abstract

Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be NO(1). We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*kO(1) ), or even O(N+ kO(1) ) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles.

References

[1]
https://db-engines.com/en/ranking_categories.
[2]
Z. Abbassi, V. S. Mirrokni, and M. Thakur. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32--40, 2013.
[3]
M. Abrahamsen, M. de Berg, K. Buchin, M. Mehr, and A. D. Mehrabi. Range-clustering queries. In Proceedings of the 33rd International Symposium on Computational Geometry, pages 5:1--5:16, 2017.
[4]
R. Addanki, A. McGregor, A. Meliou, and Z. Moumoulidou. Improved approximation and scalability for fair max-min diversification. In Range-clustering queries 25th International Conference on Database Theory, pages 7:1--7:21, 2022.
[5]
P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems, 38(4):1--28, 2013.
[6]
P. K. Agarwal, S. Har-Peled, and H. Yu. Robust shape fitting via peeling and grating coresets. Discrete & Computational Geometry, 39(1--3):38--58, 2008.
[7]
P. K. Agarwal, J. Matou?ek, and S. Suri. Farthest neighbors, maximum spanning trees and related problems in higher dimensions. Computational Geometry, 1(4):189--201, 1992.
[8]
P. K. Agarwal, J. Pach, and M. Sharir. State of the union (of geometric objects). pages 9--48. 2008.
[9]
P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. Algorithmica, 33:201--226, 2002.
[10]
P. K. Agarwal and M. Sharir. Arrangements and their applications. In Handbook of Computational Geometry, pages 49--119. Elsevier, 2000.
[11]
P. K. Agarwal, S. Sintos, and A. Steiger. Efficient indexes for diverse top-k range queries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 213--227, 2020.
[12]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42, 2013.
[13]
M. Arenas, T. C. Merkl, R. Pichler, and C. Riveros. Towards tractability of the diversity of query answers: Ultrametrics to the rescue. arXiv preprint arXiv:2408.01657, 2024.
[14]
A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737--1767, 2013.
[15]
G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proceedings of the International Workshop on Computer Science Logic, pages 208--222. Springer, 2007.
[16]
C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. Journal of the ACM, 30(3):479--513, 1983.
[17]
B. Birnbaum and K. J. Goldman. An improved analysis for a greedy remote-clique algorithm using factor-revealing lps. Algorithmica, 55(1):42--59, 2009.
[18]
A. Borodin, H. C. Lee, and Y. Ye. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 155--166, 2012.
[19]
A. Cevallos. Approximation algorithms for geometric dispersion. Technical report, EPFL, 2016.
[20]
A. Cevallos, F. Eisenbrand, and R. Zenklusen. Local search for max-sum diversification. In Proceedings of the Twenty- Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 130--142. SIAM, 2017.
[21]
A. Cevallos, F. Eisenbrand, and R. Zenklusen. An improved analysis of local search for max-sum diversification. Mathematics of Operations Research, 44(4):1494--1509, 2019.
[22]
T. M. Chan. Approximating the diameter, width, smallest enclosing cylinder, and minimum-width annulus. In Proceedings of the 16th Annual Symposium on Computational Geometry, pages 300--309, 2000.
[23]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 32(2):9--es, 2007.
[24]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. ACM SIGMOD Record, 28(2):263--274, 1999.
[25]
J. Chen, Q. Yang, R. Huang, and H. Ding. Coresets for relational data and the applications. Advances in Neural Information Processing Systems, 35:434--448, 2022.
[26]
Y. Chen and K. Yi. Random sampling and size estimation over cyclic joins. In Proceedings of the 23rd International Conference on Database Theory, pages 7:1--7:18, 2020.
[27]
G. Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, page 15, 2011.
[28]
G. Cormode. Data sketching. Communications of the ACM, 60(9):48--55, 2017.
[29]
G. Cormode, M. Garofalakis, P. J. Haas, C. Jermaine, et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 4(1--3):1--294, 2011.
[30]
G. Cormode and K. Yi. Small Summaries for Big Data. Cambridge University Press, 2020.
[31]
R. Curtin, B. Moseley, H. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. Rk-means: Fast clustering for relational data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 2742--2752, 2020.
[32]
S. Deep, X. Hu, and P. Koutris. Ranked enumeration of join queries with projections. Proceedings of the VLDB Endowment, 15(5):1024--1037, 2022.
[33]
S. Deep and P. Koutris. Ranked enumeration of conjunctive query results. In Proceedings of the 24th International Conference on Database Theory, pages 5:1--5:19, 2021.
[34]
S. Deng, S. Lu, and Y. Tao. On join sampling and hardness of combinatorial output-sensitive join algorithms. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 99--111, 2023.
[35]
M. Deza and H. Maehara. Metric transforms and euclidean embeddings. Transactions of the American Mathematical Society, 317(2):661--671, 1990.
[36]
A. Esmailpour and S. Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5), 2025.
[37]
R. Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. Journal of the ACM, 30(3):514--550, 1983.
[38]
T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 434--444, 1988.
[39]
M. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 476--487, 2002.
[40]
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 331--342, 1998.
[41]
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293--306, 1985.
[42]
G. Gottlob, G. Greco, and F. Scarcello. Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1, 2014.
[43]
S. Har-Peled, N. Kumar, D. M. Mount, and B. Raichel. Space exploration via proximity search. Discrete & Computational Geometry, 56:357--376, 2016.
[44]
S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for euclidean distance problems. Journal of the ACM, 62(6):1--35, 2015.
[45]
R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Letters, 21(3):133--137, 1997.
[46]
X. Hu and S. Sintos. Finding smallest witnesses for conjunctive queries. In Proceedings of the 27th International Conference on Database Theory, pages 24:1--24:20, 2024.
[47]
Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases, pages 19--30, 2003.
[48]
H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In Proceedings of the 24rd International Conference on Very Large Data Bases, pages 275--286, 1998.
[49]
M. Jones, H. Nguyen, and T. Nguyen. Fair k-centers via maximum matching. In Proceedings of the International Conference on Machine Learning, pages 4940--4949, 2020.
[50]
K. Kim, J. Ha, G. Fletcher, and W.-S. Han. Guaranteeing the o(agm/out) runtime for uniform sampling and size estimation over joins. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 113--125, 2023.
[51]
M. Kleindessner, P. Awasthi, and J. Morgenstern. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448--3457. PMLR, 2019.
[52]
Y. Kurkure, M. Shamo, J. Wiseman, S. Galhotra, and S. Sintos. Faster algorithms for fair max-min diversification in ???? . Proceedings of the ACM on Management of Data, 2(3):1--26, 2024.
[53]
X. Liang, S. Sintos, Z. Shang, and S. Krishnan. Combining aggregation and sampling (nearly) optimally for approximate query processing. In Proceedings of the 2021 International Conference on Management of Data, pages 1129--1141, 2021.
[54]
T. C. Merkl, R. Pichler, and S. Skritek. Diversity of answers to conjunctive queries. In Proceedings of the 26th International Conference on Database Theory, pages 10:1--10:19, 2023.
[55]
B. Moseley, K. Pruhs, A. Samadian, and Y. Wang. Relational algorithms for k-means clustering. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, pages 97:1--97:21, 2021.
[56]
Z. Moumoulidou, A. McGregor, and A. Meliou. Diverse data selection under fairness constraints. In Proceedings of the 24th International Conference on Database Theory, pages 13:1--13:25, 2021.
[57]
E. Oh and H.-K. Ahn. Approximate range queries for clustering. In Proceedings of the 34th International Symposium on Computational Geometry, pages 62:1--62:14, 2018.
[58]
D. Olteanu and J. Závodny. Factorised representations of query results: size bounds and readability. In Proceedings of the 15th International Conference on Database Theory, pages 285--298, 2012.
[59]
S. S. Ravi, D. J. Rosenkrantz, and G. K. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42(2):299--310, 1994.
[60]
J. S. Salowe. L-infinity interdistance selection by parametric search. Information Processing Letters, 30(1):9--14, 1989.
[61]
I. J. Schoenberg. Remarks to Maurice Frechet's article'sur la definition axiomatique d'une classe d'espace distances vectoriellement applicable sur l'espace de hilbert. Annals of Mathematics, pages 724--732, 1935.
[62]
I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, pages 811--841, 1938.
[63]
I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3):522--536, 1938.
[64]
A. Tamir. Obnoxious facility location on graphs. SIAM Journal on Discrete Mathematics, 4(4):550--567, 1991.
[65]
M. Yannakakis. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases, volume 81, pages 82--94, 1981.
[66]
Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi. Random sampling over joins revisited. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, pages 1525--1539, 2018.
[67]
Z. Zhao, F. Li, and Y. Liu. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2027--2042, 2020.

Cited By

View all
  • (2024)Output-sensitive Conjunctive Query EvaluationProceedings of the ACM on Management of Data10.1145/36958382:5(1-24)Online publication date: 7-Nov-2024
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 5
PODS
November 2024
363 pages
EISSN:2836-6573
DOI:10.1145/3703846
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024
Published in PACMMOD Volume 2, Issue 5

Permissions

Request permissions for this article.

Author Tags

  1. conjunctive queries
  2. coresets
  3. diversity
  4. oracles
  5. relational data

Qualifiers

  • Research-article

Funding Sources

  • US-Israel Binational Science Foundation Grant
  • NSF
  • NSERC Discovery Grant

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)13
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Output-sensitive Conjunctive Query EvaluationProceedings of the ACM on Management of Data10.1145/36958382:5(1-24)Online publication date: 7-Nov-2024
  • (2024)Improved Approximation Algorithms for Relational ClusteringProceedings of the ACM on Management of Data10.1145/36958312:5(1-27)Online publication date: 7-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media