research-article

Compact Summaries over Large Datasets

Author:

Graham CormodeAuthors Info & Claims

PODS '15: Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pages 157 - 158

https://doi.org/10.1145/2745754.2745781

Published: 20 May 2015 Publication History

Abstract

A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This tutorial introduces the concepts and examples of compact summaries.

References

[1]

Pankaj Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zheiwei Wei, and Ke Yi. Mergeable summaries. In ACM Principles of Database Systems, 2012.

Digital Library

[2]

Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In ACM-SIAM Symposium on Discrete Algorithms, 2012.

Digital Library

[3]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In ACM Symposium on Theory of Computing, pages 20--29, 1996.

Digital Library

[4]

M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002.

Digital Library

[5]

G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In ACM SIGMOD International Conference on Management of Data, pages 35--46, 2004.

Digital Library

[6]

G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.

Digital Library

[7]

Kathleen Cravedi, Tera Randall, and Larry Thompson. 1000 genomes project data available on Amazon Cloud. NIH News, March 2012.

[8]

Kenneth Cukier. Data, data everywhere. The Economist, February 2010.

[9]

P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31:182--209, 1985.

Digital Library

[10]

Philippe Flajolet, É. Fusy, Olivier Gandouet, and Frederic Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In International Conference on Analysis of Algorithms, 2007.

[11]

M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD International Conference on Management of Data, 2001.

Digital Library

[12]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. In International Conference on Very Large Data Bases, pages 330--339, 2010.

[13]

A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, 2005.

Digital Library

[14]

J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.

[15]

Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840--842, 1977.

Digital Library

[16]

S. Muthukrishnan. Data Streams: Algorithms and Applications. Now Publishers, 2005.

[17]

Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Dynamic Grids and Worldwide Computing, 13(4):277--298, 2005.

Digital Library

[18]

David Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1--2):1--157, 2014.

Digital Library

Cited By

Koesten LSimperl EBlount TKacprzak ETennison J(2020)Everything you always wanted to know about a datasetInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.10.004135:COnline publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1016/j.ijhcs.2019.10.004
Ordonez CZhang YJohnsson S(2019)Scalable machine learning computing a data summarization matrix with a parallel array DBMSDistributed and Parallel Databases10.1007/s10619-018-7229-137:3(329-350)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1007/s10619-018-7229-1

Index Terms

Compact Summaries over Large Datasets

Recommendations

Mining of Massive Datasets
Large-scale complex analytics on semi-structured datasets using asterixDB and spark

Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of ...
Evaluation of a traceability approach for informal freehand sketches

Most engineers and designers prefer to use large drawing boards such as whiteboards or flip charts for the initial collaborative sketching of a system's models. Large interactive displays have recently begun to replace these physical drawing boards, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '15: Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

May 2015

358 pages

ISBN:9781450327572

DOI:10.1145/2745754

General Chair:
Tova Milo
Tel Aviv University, Israel
,
Program Chair:
Diego Calvanese
Free University of Bozen-Bolzano, Italy

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council
Royal Society
Yahoo Research

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

PODS '15 Paper Acceptance Rate 25 of 80 submissions, 31%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
267
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koesten LSimperl EBlount TKacprzak ETennison J(2020)Everything you always wanted to know about a datasetInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2019.10.004135:COnline publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1016/j.ijhcs.2019.10.004
Ordonez CZhang YJohnsson S(2019)Scalable machine learning computing a data summarization matrix with a parallel array DBMSDistributed and Parallel Databases10.1007/s10619-018-7229-137:3(329-350)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1007/s10619-018-7229-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten