Abstract
Prompted by the need to compute holistic properties of increasingly large data sets, the notion of the “summary” data structure has emerged in recent years as an important concept. Summary structures can be built over large, distributed data, and provide guaranteed performance for a variety of data summarization tasks. Various types of summaries are known: summaries based on random sampling; summaries formed as linear sketches of the input data; and other summaries designed for a specific problem at hand.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agarwal, P., Cormode, G., Huang, Z., Phillips, J., Wei, Z., Yi, K.: Mergeable summaries. In: ACM Principles of Database Systems (2012)
Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms (2012)
Alon, N., Gibbons, P., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In: ACM Principles of Database Systems, pp. 10–20 (1999)
Bloom, B.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: ACM Principles of Database Systems, pp. 268–279 (2000)
Clarkson, K.L., Woodruff, D.P.: Numerical linear algebra in the streaming model. In: ACM Symposium on Theory of Computing, pp. 205–214 (2009)
Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: International Conference on Very Large Data Bases (2005)
Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. Foundations and Trends in Databases. NOW publishers (2012)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms 55(1), 58–75 (2005)
Cukier, K.: Data, data everywhere. The Economist (February 2010)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences 31, 182–209 (1985)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD International Conference on Management of Data (2001)
Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-median clustering and their applications. In: ACM Symposium on Theory of Computing, pp. 291–300 (2004)
Metwally, A., Agrawal, D.P., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)
Misra, J., Gries, D.: Finding repeated elements. Science of Computer Programming 2, 143–152 (1982)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press (1995)
Cisco NetFlow, More details at http://www.cisco.com/warp/public/732/Tech/netflow/
Olken, F.: Random Sampling from Databases. PhD thesis, Berkeley (1997)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Dynamic Grids and Worldwide Computing 13(4), 277–298 (2005)
Schechter, S., Herley, C., Mitzenmacher, M.: Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. In: Proceedings of HotNets (2010)
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: New aggregation techniques for sensor networks. In: ACM SenSys (2004)
To, K., Ye, T., Bhattacharyya, S.: CMON: A general purpose continuous IP backbone traffic analysis platform. Technical Report RR04-ATL-110309, Sprint ATL (2004)
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.J., Attenberg, J.: Feature hashing for large scale multitask learning. In: International Conference on Machine Learning (ICML) (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cormode, G. (2013). Summary Data Structures for Massive Data. In: Bonizzoni, P., Brattka, V., Löwe, B. (eds) The Nature of Computation. Logic, Algorithms, Applications. CiE 2013. Lecture Notes in Computer Science, vol 7921. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39053-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-39053-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39052-4
Online ISBN: 978-3-642-39053-1
eBook Packages: Computer ScienceComputer Science (R0)