Abstract
A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed, streaming data. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This talk introduces the concepts and examples of compact summaries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, P., Cormode, G., Huang, Z., Phillips, J., Wei, Z.: Mergeable summaries. ACM Principles Database Sys. 38(4), 1–28 (2012)
Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms (2012)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. ACM Symp. Theor. Comput. 46(2), 20–29 (1996)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, O.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the Count-Min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Cravedi, K., Randall, T., Thompson. L.: 1000 genomes project data available on Amazon Cloud. NIH News, March 2012
Cukier, K.: Data, data everywhere. The Economist, February 2010
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applications. J. Comput. Syst. Sci. 31, 182–209 (1985)
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: International Conference on Analysis of Algorithms (2007)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD International Conference on Management of Data (2001)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: International Conference on Very Large Data Bases, pp. 330–339 (2010)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-\(k\) elements in data streams. In: International Conference on Database Theory (2005)
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)
Morris, R.: Counting large numbers of events in small registers. Commun. ACM 21(10), 840–842 (1977)
Muthukrishnan, S.: Data Streams: Algorithms and Applications. Now Publishers, Norwell (2005)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldwide Comput. 13(4), 277–298 (2005)
Woodruff, D.: Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci. 10(1–2), 1–157 (2014)
Acknowledgments
This work supported in part by a Royal Society Wolfson Research Merit Award, funding from the Yahoo Research Faculty Research and Engagement Program, and European Research Council (ERC) Consolidator Grant ERC-CoG-2014-647557.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cormode, G. (2015). Streaming Methods in Data Analysis. In: Maneth, S. (eds) Data Science. BICOD 2015. Lecture Notes in Computer Science(), vol 9147. Springer, Cham. https://doi.org/10.1007/978-3-319-20424-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-20424-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20423-9
Online ISBN: 978-3-319-20424-6
eBook Packages: Computer ScienceComputer Science (R0)