research-article

Current Trends in Data Summaries

Author:

Graham CormodeAuthors Info & Claims

ACM SIGMOD Record, Volume 50, Issue 4

Pages 6 - 15

https://doi.org/10.1145/3516431.3516433

Published: 31 January 2022 Publication History

Abstract

The research area of data summarization seeks to find small data structures that can be updated flexibly, and answer certain queries on the input accurately. Summaries are widely used across the area of data management, and are studied from both theoretical and practical perspectives. They are the subject of ongoing research to improve their performance and broaden their applicability. In this column, recent developments in data summarization are surveyed, with the intent of inspiring further advances.

References

[1]

N. Alon, O. Ben-Eliezer, Y. Dagan, S. Moran, M. Naor, and E. Yogev. Adversarial laws of large numbers and optimal regret in online classification. In STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 447--455. ACM, 2021.

Digital Library

[2]

Apple Differential Privacy Team. Learning with privacy at scale. https://docs-assets.developer.apple. com/ml-research/papers/ learning-with-privacy-at-scale.pdf, 2017.

[3]

R. B. Basat, M. Mitzenmacher, and S. Vargaftik. How to send a real number using a single bit (and some shared randomness). In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 25:1--25:20. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.

[4]

R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy 12 SIGMOD Record, December 2021 (Vol. 50, No. 4) hitters. J. Mach. Learn. Res., 21:16:1--16:42, 2020.

[5]

O. Ben-Eliezer, R. Jayaram, D. P. Woodruff, and E. Yogev. A framework for adversarially robust streaming algorithms. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 63--80. ACM, 2020.

Digital Library

[6]

O. Ben-Eliezer and E. Yogev. The adversarial robustness of sampling. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 49--62. ACM, 2020.

Digital Library

[7]

D. W. Blalock and J. V. Guttag. Multiplying matrices without multiplying. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 992--1004. PMLR, 2021.

[8]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.

Digital Library

[9]

M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448--461, August 1973.

Digital Library

[10]

V. Braverman, A. Hasidim, Y. Matias, M. Schain, S. Silwal, and S. Zhou. Adversarial robustness of streaming algorithms through importance sampling. In NeurIPS, 2021.

[11]

T. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1--26:24, 2011.

Digital Library

[12]

M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004.

Digital Library

[13]

S. G. Choi, D. Dachman-Soled, M. Kulkarni, and A. Yerukhimovich. Differentially-private multi-party sketching for large-scale statistics. Proc. Priv. Enhancing Technol., 2020(3):153--174, 2020.

[14]

E. Cohen, O. Geri, T. Sarl´os, and U. Stemmer. Differentially private weighted sampling. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS, volume 130 of Proceedings of Machine Learning Research, pages 2404--2412. PMLR, 2021.

[15]

G. Cormode, Z. S. Karnin, E. Liberty, J. Thaler, and P. Vesel´y. Relative error streaming quantiles. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 96--108. ACM, 2021.

Digital Library

[16]

G. Cormode, A. Mishra, J. Ross, and P. Vesel´y. Theory meets practice at the median: A worst case comparison of relative error quantile algorithms. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2722--2731. ACM, 2021.

Digital Library

[17]

G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the 2005 SIAM International Conference on Data Mining, SDM, pages 44--55. SIAM, 2005.

[18]

G. Cormode and P. Vesel´y. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 81--93. ACM, 2020.

Digital Library

[19]

G. Cormode and K. Yi. Small summaries for big data. CUP, 2020.

[20]

D. Desfontaines, A. Lochbihler, and D. A. Basin. Cardinality estimators do not preserve privacy. Proc. Priv. Enhancing Technol., 2019(2):26--46, 2019.

[21]

´U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1054--1067. ACM, 2014.

Digital Library

[22]

O. Ertl. Setsketch: Filling the gap between minhash and hyperloglog. Proc. VLDB Endow., 14(11):2244--2257, 2021.

Digital Library

[23]

P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics and Theoretical Computer Science Proceedings, page 127--146, 2007.

[24]

P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985.

Digital Library

[25]

M. Fragkoulis, P. Carbone, V. Kalavri, and A. Katsifodimos. A survey on the evolution of stream processing systems. CoRR, abs/2008.00842, 2020.

[26]

M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 58--66. ACM, 2001.

Digital Library

[27]

S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM J. Comput., 38(5):2044--2059, 2009.

Digital Library

[28]

F. Haddadpour, B. Karimi, P. Li, and X. Li. Fedsketch: Communication-efficient and private federated learning via sketching. CoRR, abs/2008.04975, 2020. SIGMOD Record, December 2021 (Vol. 50, No. 4) 13

[29]

A. Hassidim, H. Kaplan, Y. Mansour, Y. Matias, and U. Stemmer. Adversarially robust streaming algorithms via differential privacy. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020.

[30]

C. Hsu, P. Indyk, D. Katabi, and A. Vakilian. Learning-based frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, 2019.

[31]

Z. Huang, Y. Qiu, K. Yi, and G. Cormode. Frequency estimation under multiparty differential privacy: One-shot and streaming. CoRR, abs/2104.01808, 2021.

[32]

R. Y. S. Hung and H. Ting. An !( 1 log 1 ) space lower bound for finding -approximate quantiles in a data stream. In Frontiers in Algorithmics, 4th International Workshop, FAW 2010, volume 6213 of Lecture Notes in Computer Science, pages 89--100. Springer, 2010.

Digital Library

[33]

P. Indyk, A. Vakilian, and Y. Yuan. Learning-based low-rank approximations. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 7400--7410, 2019.

Digital Library

[34]

R. Jayaram, A. Samadian, D. P. Woodruff, and P. Ye. In-database regression in input sparsity time. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 4797--4806. PMLR, 2021.

[35]

T. Jiang, Y. Li, H. Lin, Y. Ruan, and D. P. Woodruff. Learning-augmented data stream algorithms. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020.

[36]

P. Kairouz, B. McMahan, S. Song, O. Thakkar, A. Thakurta, and Z. Xu. Practical and private (deep) learning without sampling or shuffling. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 5213--5225. PMLR, 2021.

[37]

P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. A. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D'Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecn´y, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. ¨Ozg¨ur, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tram'er, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1--2):1--210, 2021.

[38]

H. Kaplan, Y. Mansour, K. Nissim, and U. Stemmer. Separating adaptive streaming from oblivious streaming using the bounded storage model. In Advances in Cryptology - CRYPTO 2021 - 41st Annual International Cryptology Conference, volume 12827 of Lecture Notes in Computer Science, pages 94--121. Springer, 2021.

Digital Library

[39]

Z. S. Karnin, K. J. Lang, and E. Liberty. Optimal quantile approximation in streams. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS, pages 71--78. IEEE Computer Society, 2016.

[40]

T. Kociumaka, E. Porat, and T. Starikovskaya. Small space and streaming pattern matching with k edits. CoRR, abs/2106.06037, 2021.

[41]

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pages 489--504. ACM, 2018.

Digital Library

[42]

K. J. Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm. CoRR, abs/1708.06839, 2017.

[43]

K. Li and G. Li. Approximate query processing: What is new and where to go? - A survey on approximate query processing. Data Sci. Eng., 3(4):379--397, 2018.

[44]

Y. Li, H. Lin, and D. P. Woodruff. Learning-augmented sketches for hessians. CoRR, abs/2102.12317, 2021.

[45]

E. Liberty and J. Nelson. Streaming data mining. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 2012.

[46]

D. Mandal, N. Shah, and D. P. Woodruff. Optimal communication-distortion tradeoff in voting. In EC '20: The 21st ACM Conference on Economics and Computation, pages 795--813. ACM, 2020.

Digital Library

[47]

R. H. Morris Sr. Counting large numbers of events in small registers. Commun. ACM, 21(10):840--842, 1978.

Digital Library

[48]

J. Nelson and H. Yu. Optimal bounds for approximate counting. CoRR, abs/2010.02116, 2020. 14 SIGMOD Record, December 2021 (Vol. 50, No. 4)

[49]

R. Pagh and N. M. Stausholm. Efficient differentially private F0 linear sketching. In 24th International Conference on Database Theory, ICDT, volume 186 of LIPIcs, pages 18:1--18:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.

[50]

S. Pettie and D. Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 556--569. ACM, 2021.

Digital Library

[51]

S. Pettie, D. Wang, and L. Yin. Non-mergeable sketching for cardinality estimation. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 104:1--104:20. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.

[52]

D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. Fetchsgd: Communication-efficient federated learning with sketching. In Proceedings of the 37th International Conference on Machine Learning, ICML, volume 119 of Proceedings of Machine Learning Research, pages 8253--8265. PMLR, 2020.

[53]

A. D. Smith, S. Song, and A. Thakurta. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. In Annual Conference on Neural Information Processing Systems, 2020.

[54]

K. S. Tai, V. Sharan, P. Bailis, and G. Valiant. Sketching linear classifiers over data streams. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pages 757--772. ACM, 2018.

Digital Library

[55]

D. Ting. Count-min: Optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pages 2319--2328. ACM, 2018.

Digital Library

[56]

D. Ting, J. Malkin, and L. Rhodes. Data sketching for real time analytics: Theory and practice. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3567--3568. ACM, 2020.

Digital Library

[57]

J. van den Brand, Y. T. Lee, A. Sidford, and Z. Song. Solving tall dense linear programs in nearly linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 775--788. ACM, 2020.

Digital Library

[58]

S. Vargaftik, R. B. Basat, A. Portnoy, G. Mendelson, Y. Ben-Itzhak, and M. Mitzenmacher. DRIVE: one-bit distributed mean estimation. CoRR, abs/2105.08339, 2021.

[59]

D. P. Woodruff and S. Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. CoRR, abs/2011.07471, 2020.

[60]

K. Yi. Random sampling on big data: Techniques and applications. In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, 2017.

[61]

F. Zhao, S. Maiyya, R. Weiner, D. Agrawal, and A. E. Abbadi. KLL±: Approximate quantile sketches over dynamic datasets. Proc. VLDB Endow., 14(7):1215--1227, 2021.

Digital Library

[62]

W. Zhu, P. Kairouz, B. McMahan, H. Sun, and W. Li. Federated heavy hitters discovery with differential privacy. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, volume 108, pages 3837--3847. PMLR, 2020.

Cited By

Chen ZSong S(2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709717
Zheng LXiao QCai X(2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698799
Heddes MNunes IGivargis TNicolau A(2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654932
Show More Cited By

Index Terms

Current Trends in Data Summaries
1. Information systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Current Trends in Research Data Management
Abstract
This article presents an analysis of policies, guidelines and requirements set by governments of a number of countries in the field of data openness, grantors, and publishers, as well as a review of publications, which allows tracing the main ...
Trends in data administration

The field of data administration originated as an adjunct to the expanding use of database management systems. With the increasing amount of data that data processing personnel are faced with managing, and the realization that data is an important ...
Clustering categorical data using data summaries and spectral techniques

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 50, Issue 4

December 2021

48 pages

ISSN:0163-5808

DOI:10.1145/3516431

Editors:
Rada Chirkova
North Carolina State University
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Wim Martens
University of Bayreuth
,
Manos Athanassoulis
DBrainstorming
,
Marcelo Arenas
Research Highlights
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
,
Susan B. Davidson
The Future of Data(base) Education
,
Lyublena Antova
Datometry
,
Aaron J. Elmore
University of Chicago
,
Kyriakos Mouratidis
Singapore Management University
,
Dan Olteanu
University of Oxford
,
Immanuel Trummer
Cornell University
,
Yannis Velegrakis
Utrecht University
,
Renata Borovica-Gajic
Surveys
,
Tamer Özsu
University of Waterloo
,
Pınar Tözün
IT University of Copenhagen
,
Wook-Shin Han
Research and Vision columns
,
Kenneth Ross
Research Highlights

Issue’s Table of Contents

Copyright © 2022 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2022

Published in SIGMOD Volume 50, Issue 4

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
396
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)5

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZSong S(2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709717
Zheng LXiao QCai X(2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698799
Heddes MNunes IGivargis TNicolau A(2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654932
Phungtua-eng TSako SNishikawa YYamamoto Y(2023)Elastic Data Binning: Time-Series Sketching for Time-Domain Astrophysics AnalysisACM SIGAPP Applied Computing Review10.1145/3610409.361041023:2(5-22)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3610409.3610410
(2023)Elastic Data Binning: Time-Series Sketching for Time-Domain Astrophysics AnalysisACM SIGAPP Applied Computing Review10.1145/3610019.361002023:2(5-22)Online publication date: 17-Jul-2023
https://doi.org/10.1145/3610019.3610020
Moustakas TKolomvatsos K(2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1007/s10586-023-04116-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents