Data-Stream Sampling: Basic Techniques and Results

Haas, Peter J.

doi:10.1007/978-3-540-28608-0_2

Data-Stream Sampling: Basic Techniques and Results

Peter J. Haas⁶

Chapter
First Online: 12 July 2016

3562 Accesses
20 Citations

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Perhaps the most basic synopsis of a data stream is a sample of elements from the stream. A key benefit of such a sample is its flexibility: the sample can serve as input to a wide variety of analytical procedures and can be reduced further to provide many additional data synopses. If, in particular, the sample is collected using random sampling techniques, then the sample can form a basis for statistical inference about the contents of the stream. This chapter surveys some basic sampling and inference techniques for data streams. We focus on general methods for materializing a sample; later chapters provide specialized sampling methods for specific analytic tasks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

W.G. Cochran, Sampling Techniques, 3rd edn. (Wiley, New York, 1977)
MATH Google Scholar
L. Kish, Survey Sampling (Wiley, New York, 1965)
MATH Google Scholar
C.E. Särndal, B. Swensson, J. Wretman, Model Assisted Survey Sampling (Springer, New York, 1992)
Book MATH Google Scholar
M.E. Thompson, Theory of Sample Surveys (Chapman & Hall, London, 1997)
Book MATH Google Scholar
S.K. Thompson, Sampling (Wiley, New York, 2002)
MATH Google Scholar
B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall, New York, 1993)
Book MATH Google Scholar
R. Gemulla, W. Lehner, Deferred maintenance of disk-based random samples, in Proc. EDBT. Lecture Notes in Computer Science (Springer, Berlin, 2006), pp. 423–441
Google Scholar
A. Pol, C.M. Jermaine, S. Arumugam, Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)
Article Google Scholar
P.G. Brown, P.J. Haas, Techniques for warehousing of sample data, in Proc. 22nd ICDE (2006)
Google Scholar
P.J. Haas, The need for speed: speeding up DB2 using sampling. IDUG Solut. J. 10, 32–34 (2003)
Google Scholar
S. Chaudhuri, R. Motwani, V.R. Narasayya, Random sampling for histogram construction: how much is enough? in Proc. ACM SIGMOD (1998), pp. 436–447
Google Scholar
D. DeWitt, J.F. Naughton, D.A. Schneider, S. Seshadri, Practical skew handling algorithms for parallel joins, in Proc. 19th VLDB (1992), pp. 27–40
Google Scholar
P.J. Haas, C. König, A bi-level Bernoulli scheme for database sampling, in Proc. ACM SIGMOD (2004), pp. 275–286
Google Scholar
W. Hou, G. Ozsoyoglu, B. Taneja, Statistical estimators for relational algebra expressions, in Proc. Seventh PODS (1988), pp. 276–287
Google Scholar
P.J. Haas, J.M. Hellerstein, Ripple joins for online aggregation, in Proc. ACM SIGMOD (1999), pp. 287–298
Google Scholar
S. Acharya, P. Gibbons, V. Poosala, S. Ramaswamy, Join synopses for approximate query answering, in Proc. ACM SIGMOD (1999), pp. 275–286
Google Scholar
S. Acharya, P. Gibbons, V. Poosala, Congressional samples for approximate answering of group-by queries, in Proc. ACM SIGMOD (2000), pp. 487–498
Google Scholar
S. Chaudhuri, G. Das, M. Datar, R. Motwani, V.R. Narasayya, Overcoming limitations of sampling for aggregation queries, in Proc. Seventeenth ICDE (2001), pp. 534–542
Google Scholar
S. Chaudhuri, R. Motwani, V.R. Narasayya, On random sampling over joins, in Proc. ACM SIGMOD (1999), pp. 263–274
Google Scholar
S. Ganguly, P.B. Gibbons, Y. Matias, A. Silberschatz, Bifocal sampling for skew-resistant join size estimation, in Proc. ACM SIGMOD (1996), pp. 271–281
Google Scholar
V. Ganti, M.L. Lee, R. Ramakrishnan, ICICLES: self-tuning samples for approximate query answering, in Proc. 26th VLDB (2000), pp. 176–187
Google Scholar
P.J. Haas, A.N. Swami, Sampling-based selectivity estimation using augmented frequent value statistics, in Proc. Eleventh ICDE (1995), pp. 522–531
Google Scholar
C. Jermaine, Robust estimation with sampling and approximate pre-aggregation, in Proc. 29th VLDB (2003), pp. 886–897
Google Scholar
W. Hou, G. Ozsoyoglu, B. Taneja, Processing aggregate relational queries with hard time constraints, in Proc. ACM SIGMOD (1989), pp. 68–77
Google Scholar
F. Olken, D. Rotem, Simple random sampling from relational databases, in Proc. 12th VLDB (1986), pp. 160–169
Google Scholar
F. Olken, D. Rotem, Random sampling from \(B^{+}\) trees, in Proc. 15th VLDB (1989), pp. 269–277
Google Scholar
F. Olken, D. Rotem, Maintenance of materialized views of sampling queries, in Proc. Eighth ICDE (1992), pp. 632–641
Google Scholar
F. Olken, D. Rotem, Sampling from spatial databases, in Proc. Ninth ICDE (1993), pp. 199–208
Google Scholar
F. Olken, D. Rotem, P. Xu, Random sampling from hash files, in Proc. ACM SIGMOD (1990), pp. 375–386
Google Scholar
P.J. Haas, J.F. Naughton, S. Seshadri, A.N. Swami, Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52, 550–569 (1996)
Article MathSciNet MATH Google Scholar
P.J. Haas, J.F. Naughton, A.N. Swami, On the relative cost of sampling for join selectivity estimation, in Proc. Thirteenth PODS (1994), pp. 14–24
Google Scholar
P.J. Haas, A.N. Swami, Sequential sampling procedures for query size estimation, in Proc. ACM SIGMOD (1992), pp. 1–11
Google Scholar
W. Hou, G. Ozsoyoglu, E. Dogdu, Error-constrained COUNT query evaluation in relational databases, in Proc. ACM SIGMOD (1991), pp. 278–287
Google Scholar
R.J. Lipton, J.F. Naughton, Query size estimation by adaptive sampling, in Proc. Ninth PODS (1990), pp. 40–46
Google Scholar
R.J. Lipton, J.F. Naughton, D.A. Schneider, Practical selectivity estimation through adaptive sampling, in Proc. ACM SIGMOD (1990), pp. 1–11
Google Scholar
R.J. Lipton, J.F. Naughton, D.A. Schneider, S. Seshadri, Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116, 195–226 (1993)
Article MathSciNet MATH Google Scholar
K.D. Seppi, J.W. Barnes, C.N. Morris, A Bayesian approach to database query optimization. ORSA J. Comput. 5, 410–419 (1993)
Article MATH Google Scholar
J.M. Hellerstein, P.J. Haas, H.J. Wang, Online aggregation, in Proc. ACM SIGMOD (1997), pp. 171–182
Google Scholar
C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, A. Pol, A disk-based join with probabilistic guarantees, in Proc. ACM SIGMOD (2005)
Google Scholar
C.M. Jermaine, S. Arumugam, A. Pol, A. Dobra, Scalable approximate query processing with the DBO engine, in Proc. ACM SIGMOD (2007), pp. 725–736
Google Scholar
C. Jermaine, A. Dobra, A. Pol, S. Joshi, Online estimation for subset-based SQL queries, in Proc. 31st VLDB (2005), pp. 745–756
Google Scholar
G. Luo, C. Ellman, P.J. Haas, J.F. Naughton, A scalable hash ripple join algorithm, in Proc. ACM SIGMOD (2002), pp. 252–262
Google Scholar
A. Pol, C. Jermaine, Relational confidence bounds are easy with the bootstrap, in Proc. ACM SIGMOD (2005)
Google Scholar
P.G. Brown, P.J. Haas, BHUNT: automatic discovery of fuzzy algebraic constraints in relational data, in Proc. 29th VLDB (2003), pp. 668–679
Google Scholar
I.F. Ilyas, V. Markl, P.J. Haas, P.G. Brown, A. Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, in Proc. ACM SIGMOD (2004), pp. 647–658
Google Scholar
P. Brown, P. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, Y. Sismanis, Toward automated large-scale information integration and discovery, in Data Management in a Connected World, ed. by T. Härder, W. Lehner (Springer, New York, 2005)
Google Scholar
M. Charikar, S. Chaudhuri, R. Motwani, V.R. Narasayya, Towards estimation error guarantees for distinct values, in Proc. Nineteenth PODS (2000), pp. 268–279
Google Scholar
P.J. Haas, L. Stokes, Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93, 1475–1487 (1998)
Article MathSciNet MATH Google Scholar
M. Wu, C. Jermaine, A Bayesian method for guessing the extreme values in a data set, in Proc. 33rd VLDB (2007), pp. 471–482
Google Scholar
P. Billingsley, Probability and Measure, 2nd edn. (Wiley, New York, 1986)
MATH Google Scholar
A.M. Law, Simulation Modeling and Analysis, 4th edn. (McGraw-Hill, New York, 2007)
Google Scholar
D.E. Knuth, The Art of Computer Programming, vol. 2: Seminumerical Algorithms (Addison-Wesley, Reading, 1969)
MATH Google Scholar
A.I. McLeod, D.R. Bellhouse, A convenient algorithm for drawing a simple random sample. Appl. Stat. 32, 182–184 (1983)
Article MATH Google Scholar
J.S. Vitter, Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 37–57 (1985)
Article MathSciNet MATH Google Scholar
J.S. Vitter, Faster methods for random sampling. Commun. ACM 27, 703–718 (1984)
Article MathSciNet MATH Google Scholar
M.T. Chao, A general purpose unequal probability sampling plan. Biometrika 69, 653–656 (1982)
Article MathSciNet MATH Google Scholar
H. Brönnimann, B. Chen, M. Dash, P.J. Haas, Y. Qiao, P. Scheuermann, Efficient data reduction methods for on-line association rule discovery, in Data Mining: Next Generation Challenges and Future Directions, ed. by H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha (AAAI Press, Menlo Park, 2004)
Google Scholar
B. Babcock, M. Datar, R. Motwani, Sampling from a moving window over streaming data, in Proc. 13th SODA (2002), pp. 633–634
Google Scholar
T. Hagerup, C. Rub, A guided tour of Chernoff bounds. Inf. Process. Lett. 33, 305–308 (1990)
Article MathSciNet MATH Google Scholar
W. Feller, An Introduction to Probability Theory and Its Applications, 3rd edn., vol. 1 (Wiley, New York, 1968)
MATH Google Scholar
R. Gemulla, W. Lehner, P.J. Haas, Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–202 (2008)
Article Google Scholar
R. Gemulla, W. Lehner, P.J. Haas, Maintaining Bernoulli samples over evolving multisets, in Proc. Twenty Sixth PODS (2007), pp. 93–102
Google Scholar
G. Cormode, S. Muthukrishnan, I. Rozenbaum, Summarizing and mining inverse distributions on data streams via dynamic inverse sampling, in Proc. 31st VLDB (2005), pp. 25–36
Google Scholar
G. Frahling, P. Indyk, C. Sohler, Sampling in dynamic data streams and applications, in Proc. 21st ACM Symp. Comput. Geom. (2005), pp. 142–149
Google Scholar
P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in Proc. 27th VLDB (2001), pp. 541–550
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, CA, USA
Peter J. Haas

Authors

Peter J. Haas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter J. Haas .

Editor information

Editors and Affiliations

University Campus - Kounoupidiana, School of ECE, Techn. Univ. of Crete University Campus - Kounoupidiana, Chania, Greece
Minos Garofalakis
Microsoft Corporation, Redmond, Washington, USA
Johannes Gehrke
Amazon India , Bangalore, India
Rajeev Rastogi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Haas, P.J. (2016). Data-Stream Sampling: Basic Techniques and Results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-28608-0_2
Published: 12 July 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics