skip to main content
research-article

Mergeable summaries

Published: 04 December 2013 Publication History

Abstract

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O((1/ε) log(ε n)) that has a restricted form of mergeability, and a randomized one of size O((1/ε) log3/2(1/ε)) with full mergeability. We also extend our results to geometric summaries such as ε-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.
We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O((1/ε) log3/2(1/ε)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

References

[1]
Agarwal, P. K., Cormode, G., Huang, Z., Phillips, J. M., Wei, Z., and Yi, K. 2012. Mergeable summaries. In Proceedings of the 31st ACM Symposium on Principals of Database Systems. 23--34.
[2]
Ahn, K. J., Guha, S., and McGregor, A. 2012. Analyzing graph structure via linear measurements. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
[3]
Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
[4]
Bansal, N. 2010. Constructive algorithms for discrepancy minimization. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 3--10.
[5]
Bansal, N. 2012. Semidefinite optimization in discrepancy theory. Math. Program. 134, 1, 5--22.
[6]
Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RandOM'02). 1--10.
[7]
Berinde, R., Cormode, G., Indyk, P., and Strauss, M. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Datab. Syst. 35, 4.
[8]
Chazelle, B. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press.
[9]
Chazelle, B. and Matousek, J. 1996. On linear-time deterministic algorithms for optimization problems in fixed dimension. J. Algor. 21, 3, 579--597.
[10]
Cormode, G. and Hadjieleftheriou, M. 2008a. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541.
[11]
Cormode, G. and Hadjieleftheriou, M. 2008b. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases.
[12]
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75.
[13]
Feigenbaum, J., Kannan, S., Strauss, M. J., and Viswanathan, M. 2003. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151.
[14]
Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., and Svitkina, Z. 2008. On distributing symmetric streaming computations. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
[15]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases.
[16]
Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[17]
Greenwald, M. and Khanna, S. 2004. Power conserving computation of order-statistics over sensor networks. In Proceedings of the ACM Symposium on Principles of Database Systems.
[18]
Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the International Conference on Database Theory. ACM Press, New York, 268--275.
[19]
Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366.
[20]
Huang, Z., Wang, L., Yi, K., and Liu, Y. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[21]
Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53, 307--323.
[22]
Kane, D. M., Nelson, J., Porat, E., and Woodruff, D. P. 2011. Fast moment estimation in data streams in optimal space. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing.
[23]
Larsen, K. 2011. On range searching in the group model and combinatorial discrepancy. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 542--549.
[24]
Li, Y., Long, P., and Srinivasan, A. 2001. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 3, 516--527.
[25]
Lovett, S. and Meka, R. 2012. Constructive discrepancy minimization by walking on the edges. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science.
[26]
Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: A tiny aggregation service for ad-hoc sensor networks. In Proceedings of the Symposium on Operating Systems Design and Implementation.
[27]
Manjhi, A., Nath, S., and Gibbons, P. B. 2005a. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[28]
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005b. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering.
[29]
Manku, G. S., Rajagopalan, S., and Lindsay, B. G. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[30]
Matousek, J. 1991. Approximations and optimal geometric divide-and-conquer. In Proceedings of the ACM Symposium on Theory of Computing. ACM Press, New York, 505--511.
[31]
Matousek, J. 1995. Tight upper bounds for the discrepancy of half-spaces. Discr. Comput. Geom. 13, 593--601.
[32]
Matousek, J. 2010. Geometric Discrepancy: An Illustrated Guide, vol. 18. Springer http://bookshelf.theopensourcelibrary.org/2010_CharlesUniversity_GeometricDiscrepancy.pdf.
[33]
Metwally, A., Agrawal, D., and Abbadi, A. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133.
[34]
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.
[35]
Nelson, J., Nguyen, H. L., and Woodruff, D. P. 2012. On deterministic sketching and streaming for sparse recovery and norm estimation. In Proceedings of the 16th International Workshop on Randomization and Computation (RandOM'12).
[36]
Phillips, J. 2008. Algorithms for approximations of terrains. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP'08). 447--458.
[37]
Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). 239-249.
[38]
Suri, S., Toth, C., and Zhou, Y. 2006. Range counting over multidimensional data streams. Discr. Comput. Geom. 36, 4, 633--655.
[39]
Talagrand, M. 1994. Sharper bounds for gaussian and empirical processes. Ann. Probab. 22, 1, 28--76.
[40]
Vapnik, V. and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264--280.

Cited By

View all
  • (2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
  • (2025)Skip index: Supporting efficient inter-block queries and query authentication on the blockchainFuture Generation Computer Systems10.1016/j.future.2024.107556164(107556)Online publication date: Mar-2025
  • (2024)Parallel and Distributed Frugal Tracking of a QuantileFuture Internet10.3390/fi1609033516:9(335)Online publication date: 13-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 38, Issue 4
Invited papers issue
November 2013
294 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2539032
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2013
Accepted: 01 June 2013
Revised: 01 April 2013
Received: 01 October 2012
Published in TODS Volume 38, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data summarization
  2. heavy hitters
  3. quantiles

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)132
  • Downloads (Last 6 weeks)28
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Randomized Sketches for Quantile in LSM-tree based StoreProceedings of the ACM on Management of Data10.1145/37097173:1(1-26)Online publication date: 11-Feb-2025
  • (2025)Skip index: Supporting efficient inter-block queries and query authentication on the blockchainFuture Generation Computer Systems10.1016/j.future.2024.107556164(107556)Online publication date: Mar-2025
  • (2024)Parallel and Distributed Frugal Tracking of a QuantileFuture Internet10.3390/fi1609033516:9(335)Online publication date: 13-Sep-2024
  • (2024)Computing A Well-Representative Summary of Conjunctive Query ResultsProceedings of the ACM on Management of Data10.1145/36958352:5(1-27)Online publication date: 7-Nov-2024
  • (2024)Differentially Private Hierarchical Heavy HittersProceedings of the ACM on Management of Data10.1145/36958262:5(1-25)Online publication date: 7-Nov-2024
  • (2024)Parallel and Distributed Frugal Tracking of a QuantileProceedings of the Seventh International Workshop on Systems and Network Telemetry and Analytics10.1145/3660320.3660332(1-6)Online publication date: 3-Jun-2024
  • (2024)Determining Exact Quantiles with Randomized SummariesProceedings of the ACM on Management of Data10.1145/36392802:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Spectral Guarantees for Adversarial Streaming PCA2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00108(1768-1785)Online publication date: 27-Oct-2024
  • (2024)Optimal Quantile Estimation: Beyond the Comparison Model2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00075(1137-1158)Online publication date: 27-Oct-2024
  • (2024) Randomized counter-based algorithms for frequency estimation over data streams in space Theoretical Computer Science10.1016/j.tcs.2023.114317984(114317)Online publication date: Feb-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media