skip to main content
10.1145/2745754.2745781acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Compact Summaries over Large Datasets

Published: 20 May 2015 Publication History

Abstract

A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This tutorial introduces the concepts and examples of compact summaries.

References

[1]
Pankaj Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zheiwei Wei, and Ke Yi. Mergeable summaries. In ACM Principles of Database Systems, 2012.
[2]
Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In ACM-SIAM Symposium on Discrete Algorithms, 2012.
[3]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In ACM Symposium on Theory of Computing, pages 20--29, 1996.
[4]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002.
[5]
G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In ACM SIGMOD International Conference on Management of Data, pages 35--46, 2004.
[6]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
[7]
Kathleen Cravedi, Tera Randall, and Larry Thompson. 1000 genomes project data available on Amazon Cloud. NIH News, March 2012.
[8]
Kenneth Cukier. Data, data everywhere. The Economist, February 2010.
[9]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31:182--209, 1985.
[10]
Philippe Flajolet, É. Fusy, Olivier Gandouet, and Frederic Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In International Conference on Analysis of Algorithms, 2007.
[11]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD International Conference on Management of Data, 2001.
[12]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. In International Conference on Very Large Data Bases, pages 330--339, 2010.
[13]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.
[14]
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
[15]
Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840--842, 1977.
[16]
S. Muthukrishnan. Data Streams: Algorithms and Applications. Now Publishers, 2005.
[17]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Dynamic Grids and Worldwide Computing, 13(4):277--298, 2005.
[18]
David Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1--2):1--157, 2014.

Cited By

View all
  • (2020)Everything you always wanted to know about a datasetInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.10.004135:COnline publication date: 1-Mar-2020
  • (2019)Scalable machine learning computing a data summarization matrix with a parallel array DBMSDistributed and Parallel Databases10.1007/s10619-018-7229-137:3(329-350)Online publication date: 1-Sep-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '15: Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
May 2015
358 pages
ISBN:9781450327572
DOI:10.1145/2745754
  • General Chair:
  • Tova Milo,
  • Program Chair:
  • Diego Calvanese
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate counting
  2. sketches
  3. summaries

Qualifiers

  • Research-article

Funding Sources

  • European Research Council
  • Royal Society
  • Yahoo Research

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

PODS '15 Paper Acceptance Rate 25 of 80 submissions, 31%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Everything you always wanted to know about a datasetInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.10.004135:COnline publication date: 1-Mar-2020
  • (2019)Scalable machine learning computing a data summarization matrix with a parallel array DBMSDistributed and Parallel Databases10.1007/s10619-018-7229-137:3(329-350)Online publication date: 1-Sep-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media