skip to main content
research-article

Current Trends in Data Summaries

Published: 31 January 2022 Publication History

Abstract

The research area of data summarization seeks to find small data structures that can be updated flexibly, and answer certain queries on the input accurately. Summaries are widely used across the area of data management, and are studied from both theoretical and practical perspectives. They are the subject of ongoing research to improve their performance and broaden their applicability. In this column, recent developments in data summarization are surveyed, with the intent of inspiring further advances.

References

[1]
N. Alon, O. Ben-Eliezer, Y. Dagan, S. Moran, M. Naor, and E. Yogev. Adversarial laws of large numbers and optimal regret in online classification. In STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 447--455. ACM, 2021.
[2]
Apple Differential Privacy Team. Learning with privacy at scale. https://docs-assets.developer.apple. com/ml-research/papers/ learning-with-privacy-at-scale.pdf, 2017.
[3]
R. B. Basat, M. Mitzenmacher, and S. Vargaftik. How to send a real number using a single bit (and some shared randomness). In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 25:1--25:20. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.
[4]
R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy 12 SIGMOD Record, December 2021 (Vol. 50, No. 4) hitters. J. Mach. Learn. Res., 21:16:1--16:42, 2020.
[5]
O. Ben-Eliezer, R. Jayaram, D. P. Woodruff, and E. Yogev. A framework for adversarially robust streaming algorithms. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 63--80. ACM, 2020.
[6]
O. Ben-Eliezer and E. Yogev. The adversarial robustness of sampling. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 49--62. ACM, 2020.
[7]
D. W. Blalock and J. V. Guttag. Multiplying matrices without multiplying. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 992--1004. PMLR, 2021.
[8]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.
[9]
M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448--461, August 1973.
[10]
V. Braverman, A. Hasidim, Y. Matias, M. Schain, S. Silwal, and S. Zhou. Adversarial robustness of streaming algorithms through importance sampling. In NeurIPS, 2021.
[11]
T. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1--26:24, 2011.
[12]
M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004.
[13]
S. G. Choi, D. Dachman-Soled, M. Kulkarni, and A. Yerukhimovich. Differentially-private multi-party sketching for large-scale statistics. Proc. Priv. Enhancing Technol., 2020(3):153--174, 2020.
[14]
E. Cohen, O. Geri, T. Sarl´os, and U. Stemmer. Differentially private weighted sampling. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS, volume 130 of Proceedings of Machine Learning Research, pages 2404--2412. PMLR, 2021.
[15]
G. Cormode, Z. S. Karnin, E. Liberty, J. Thaler, and P. Vesel´y. Relative error streaming quantiles. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 96--108. ACM, 2021.
[16]
G. Cormode, A. Mishra, J. Ross, and P. Vesel´y. Theory meets practice at the median: A worst case comparison of relative error quantile algorithms. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2722--2731. ACM, 2021.
[17]
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the 2005 SIAM International Conference on Data Mining, SDM, pages 44--55. SIAM, 2005.
[18]
G. Cormode and P. Vesel´y. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 81--93. ACM, 2020.
[19]
G. Cormode and K. Yi. Small summaries for big data. CUP, 2020.
[20]
D. Desfontaines, A. Lochbihler, and D. A. Basin. Cardinality estimators do not preserve privacy. Proc. Priv. Enhancing Technol., 2019(2):26--46, 2019.
[21]
´U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1054--1067. ACM, 2014.
[22]
O. Ertl. Setsketch: Filling the gap between minhash and hyperloglog. Proc. VLDB Endow., 14(11):2244--2257, 2021.
[23]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics and Theoretical Computer Science Proceedings, page 127--146, 2007.
[24]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985.
[25]
M. Fragkoulis, P. Carbone, V. Kalavri, and A. Katsifodimos. A survey on the evolution of stream processing systems. CoRR, abs/2008.00842, 2020.
[26]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 58--66. ACM, 2001.
[27]
S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM J. Comput., 38(5):2044--2059, 2009.
[28]
F. Haddadpour, B. Karimi, P. Li, and X. Li. Fedsketch: Communication-efficient and private federated learning via sketching. CoRR, abs/2008.04975, 2020. SIGMOD Record, December 2021 (Vol. 50, No. 4) 13
[29]
A. Hassidim, H. Kaplan, Y. Mansour, Y. Matias, and U. Stemmer. Adversarially robust streaming algorithms via differential privacy. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020.
[30]
C. Hsu, P. Indyk, D. Katabi, and A. Vakilian. Learning-based frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, 2019.
[31]
Z. Huang, Y. Qiu, K. Yi, and G. Cormode. Frequency estimation under multiparty differential privacy: One-shot and streaming. CoRR, abs/2104.01808, 2021.
[32]
R. Y. S. Hung and H. Ting. An !( 1 log 1 ) space lower bound for finding -approximate quantiles in a data stream. In Frontiers in Algorithmics, 4th International Workshop, FAW 2010, volume 6213 of Lecture Notes in Computer Science, pages 89--100. Springer, 2010.
[33]
P. Indyk, A. Vakilian, and Y. Yuan. Learning-based low-rank approximations. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 7400--7410, 2019.
[34]
R. Jayaram, A. Samadian, D. P. Woodruff, and P. Ye. In-database regression in input sparsity time. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 4797--4806. PMLR, 2021.
[35]
T. Jiang, Y. Li, H. Lin, Y. Ruan, and D. P. Woodruff. Learning-augmented data stream algorithms. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020.
[36]
P. Kairouz, B. McMahan, S. Song, O. Thakkar, A. Thakurta, and Z. Xu. Practical and private (deep) learning without sampling or shuffling. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 5213--5225. PMLR, 2021.
[37]
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. A. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D'Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecn´y, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. ¨Ozg¨ur, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tram'er, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1--2):1--210, 2021.
[38]
H. Kaplan, Y. Mansour, K. Nissim, and U. Stemmer. Separating adaptive streaming from oblivious streaming using the bounded storage model. In Advances in Cryptology - CRYPTO 2021 - 41st Annual International Cryptology Conference, volume 12827 of Lecture Notes in Computer Science, pages 94--121. Springer, 2021.
[39]
Z. S. Karnin, K. J. Lang, and E. Liberty. Optimal quantile approximation in streams. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS, pages 71--78. IEEE Computer Society, 2016.
[40]
T. Kociumaka, E. Porat, and T. Starikovskaya. Small space and streaming pattern matching with k edits. CoRR, abs/2106.06037, 2021.
[41]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pages 489--504. ACM, 2018.
[42]
K. J. Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm. CoRR, abs/1708.06839, 2017.
[43]
K. Li and G. Li. Approximate query processing: What is new and where to go? - A survey on approximate query processing. Data Sci. Eng., 3(4):379--397, 2018.
[44]
Y. Li, H. Lin, and D. P. Woodruff. Learning-augmented sketches for hessians. CoRR, abs/2102.12317, 2021.
[45]
E. Liberty and J. Nelson. Streaming data mining. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 2012.
[46]
D. Mandal, N. Shah, and D. P. Woodruff. Optimal communication-distortion tradeoff in voting. In EC '20: The 21st ACM Conference on Economics and Computation, pages 795--813. ACM, 2020.
[47]
R. H. Morris Sr. Counting large numbers of events in small registers. Commun. ACM, 21(10):840--842, 1978.
[48]
J. Nelson and H. Yu. Optimal bounds for approximate counting. CoRR, abs/2010.02116, 2020. 14 SIGMOD Record, December 2021 (Vol. 50, No. 4)
[49]
R. Pagh and N. M. Stausholm. Efficient differentially private F0 linear sketching. In 24th International Conference on Database Theory, ICDT, volume 186 of LIPIcs, pages 18:1--18:19. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.
[50]
S. Pettie and D. Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 556--569. ACM, 2021.
[51]
S. Pettie, D. Wang, and L. Yin. Non-mergeable sketching for cardinality estimation. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 104:1--104:20. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021.
[52]
D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. Fetchsgd: Communication-efficient federated learning with sketching. In Proceedings of the 37th International Conference on Machine Learning, ICML, volume 119 of Proceedings of Machine Learning Research, pages 8253--8265. PMLR, 2020.
[53]
A. D. Smith, S. Song, and A. Thakurta. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. In Annual Conference on Neural Information Processing Systems, 2020.
[54]
K. S. Tai, V. Sharan, P. Bailis, and G. Valiant. Sketching linear classifiers over data streams. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, pages 757--772. ACM, 2018.
[55]
D. Ting. Count-min: Optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pages 2319--2328. ACM, 2018.
[56]
D. Ting, J. Malkin, and L. Rhodes. Data sketching for real time analytics: Theory and practice. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3567--3568. ACM, 2020.
[57]
J. van den Brand, Y. T. Lee, A. Sidford, and Z. Song. Solving tall dense linear programs in nearly linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 775--788. ACM, 2020.
[58]
S. Vargaftik, R. B. Basat, A. Portnoy, G. Mendelson, Y. Ben-Itzhak, and M. Mitzenmacher. DRIVE: one-bit distributed mean estimation. CoRR, abs/2105.08339, 2021.
[59]
D. P. Woodruff and S. Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. CoRR, abs/2011.07471, 2020.
[60]
K. Yi. Random sampling on big data: Techniques and applications. In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, 2017.
[61]
F. Zhao, S. Maiyya, R. Weiner, D. Agrawal, and A. E. Abbadi. KLL±: Approximate quantile sketches over dynamic datasets. Proc. VLDB Endow., 14(7):1215--1227, 2021.
[62]
W. Zhu, P. Kairouz, B. McMahan, H. Sun, and W. Li. Federated heavy hitters discovery with differential privacy. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, volume 108, pages 3837--3847. PMLR, 2020.

Cited By

View all
  • (2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 50, Issue 4
December 2021
48 pages
ISSN:0163-5808
DOI:10.1145/3516431
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2022
Published in SIGMOD Volume 50, Issue 4

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)5
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • (2023)Elastic Data Binning: Time-Series Sketching for Time-Domain Astrophysics AnalysisACM SIGAPP Applied Computing Review10.1145/3610409.361041023:2(5-22)Online publication date: 19-Jul-2023
  • (2023)Elastic Data Binning: Time-Series Sketching for Time-Domain Astrophysics AnalysisACM SIGAPP Applied Computing Review10.1145/3610019.361002023:2(5-22)Online publication date: 17-Jul-2023
  • (2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media