Definition
Stream sampling is the process of collecting a representative sample of the elements of a data stream. The sample is usually much smaller than the entire stream, but can be designed to retain many important characteristics of the stream, and can be used to estimate many important aggregates on the stream. Unlike sampling from a stored data set, stream sampling must be performed online, when the data arrives. Any element that is not stored within the sample is lost forever, and cannot be retrieved. This article discusses various methods of sampling from a data stream and applications of these methods.
Historical Background
An early algorithm to maintain a random sample of a data stream is the reservoir sampling algorithm due to Vitter [15]. More recent random sampling based algorithms have been inspired by the work of Alon et al. [1]. Random sampling has for a long time been used to process data within stored databases – the reader is referred to [13] for a survey.
Foundations
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.
Babcock B., Datar M., and Motwani R. Sampling from a moving window over streaming data. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 2002, pp. 633–634.
Chakrabarti A., Cormode G., and McGregor A. A near-optimal algorithm for computing the entropy of a stream. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 2007, pp. 328–335.
Cohen E. and Strauss M. Maintaining time-decaying stream aggregates. In Proc. 22nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2003, pp. 223–233.
Cormode G., Muthukrishnan S., and Rozenbaum I. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 25–36.
Frahling G., Indyk P., and Sohler C. Sampling in dynamic data streams and applications. In Proc. 21st Annual ACM Symp. on Computational Geometry, 2005, pp. 142–149.
Ganguly S. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211–222, 2007.
Gibbons P. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 541–550.
Gibbons P. and Tirthapura S. Estimating simple functions on the union of data streams. In Proc. ACM Symp. on Parallel Algorithms and Architectures, 2001, pp. 281–291.
Gibbons P. and Tirthapura S. Distributed streams algorithms for sliding windows. Theor. Comput. Syst., 37:457–478, 2004.
Manku G.S. and Motwani R. Approximate frequency counts over data streams. In Proc. of the 28th Int. Conf. on Very Large Data Bases, 2002, pp. 346–357.
Manku G.S., Rajagopalan S., and Lindsay B.G. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 251–262.
Olken F. and Rotem D. Random sampling from databases - a survey. Stat. Comput., 5(1):43–57, 1995.
Pavan A. and Tirthapura S. Range-efficient counting of distinct elements in a massive data stream. SIAM J. Comput., 37(2):359–379, 2007.
Vitter J.S. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this entry
Cite this entry
Lahiri, B., Tirthapura, S. (2009). Stream Sampling. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_372
Download citation
DOI: https://doi.org/10.1007/978-0-387-39940-9_372
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-35544-3
Online ISBN: 978-0-387-39940-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering