Skip to main content

Data-Stream Sampling: Basic Techniques and Results

  • Chapter
  • First Online:

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Perhaps the most basic synopsis of a data stream is a sample of elements from the stream. A key benefit of such a sample is its flexibility: the sample can serve as input to a wide variety of analytical procedures and can be reduced further to provide many additional data synopses. If, in particular, the sample is collected using random sampling techniques, then the sample can form a basis for statistical inference about the contents of the stream. This chapter surveys some basic sampling and inference techniques for data streams. We focus on general methods for materializing a sample; later chapters provide specialized sampling methods for specific analytic tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. W.G. Cochran, Sampling Techniques, 3rd edn. (Wiley, New York, 1977)

    MATH  Google Scholar 

  2. L. Kish, Survey Sampling (Wiley, New York, 1965)

    MATH  Google Scholar 

  3. C.E. Särndal, B. Swensson, J. Wretman, Model Assisted Survey Sampling (Springer, New York, 1992)

    Book  MATH  Google Scholar 

  4. M.E. Thompson, Theory of Sample Surveys (Chapman & Hall, London, 1997)

    Book  MATH  Google Scholar 

  5. S.K. Thompson, Sampling (Wiley, New York, 2002)

    MATH  Google Scholar 

  6. B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall, New York, 1993)

    Book  MATH  Google Scholar 

  7. R. Gemulla, W. Lehner, Deferred maintenance of disk-based random samples, in Proc. EDBT. Lecture Notes in Computer Science (Springer, Berlin, 2006), pp. 423–441

    Google Scholar 

  8. A. Pol, C.M. Jermaine, S. Arumugam, Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)

    Article  Google Scholar 

  9. P.G. Brown, P.J. Haas, Techniques for warehousing of sample data, in Proc. 22nd ICDE (2006)

    Google Scholar 

  10. P.J. Haas, The need for speed: speeding up DB2 using sampling. IDUG Solut. J. 10, 32–34 (2003)

    Google Scholar 

  11. S. Chaudhuri, R. Motwani, V.R. Narasayya, Random sampling for histogram construction: how much is enough? in Proc. ACM SIGMOD (1998), pp. 436–447

    Google Scholar 

  12. D. DeWitt, J.F. Naughton, D.A. Schneider, S. Seshadri, Practical skew handling algorithms for parallel joins, in Proc. 19th VLDB (1992), pp. 27–40

    Google Scholar 

  13. P.J. Haas, C. König, A bi-level Bernoulli scheme for database sampling, in Proc. ACM SIGMOD (2004), pp. 275–286

    Google Scholar 

  14. W. Hou, G. Ozsoyoglu, B. Taneja, Statistical estimators for relational algebra expressions, in Proc. Seventh PODS (1988), pp. 276–287

    Google Scholar 

  15. P.J. Haas, J.M. Hellerstein, Ripple joins for online aggregation, in Proc. ACM SIGMOD (1999), pp. 287–298

    Google Scholar 

  16. S. Acharya, P. Gibbons, V. Poosala, S. Ramaswamy, Join synopses for approximate query answering, in Proc. ACM SIGMOD (1999), pp. 275–286

    Google Scholar 

  17. S. Acharya, P. Gibbons, V. Poosala, Congressional samples for approximate answering of group-by queries, in Proc. ACM SIGMOD (2000), pp. 487–498

    Google Scholar 

  18. S. Chaudhuri, G. Das, M. Datar, R. Motwani, V.R. Narasayya, Overcoming limitations of sampling for aggregation queries, in Proc. Seventeenth ICDE (2001), pp. 534–542

    Google Scholar 

  19. S. Chaudhuri, R. Motwani, V.R. Narasayya, On random sampling over joins, in Proc. ACM SIGMOD (1999), pp. 263–274

    Google Scholar 

  20. S. Ganguly, P.B. Gibbons, Y. Matias, A. Silberschatz, Bifocal sampling for skew-resistant join size estimation, in Proc. ACM SIGMOD (1996), pp. 271–281

    Google Scholar 

  21. V. Ganti, M.L. Lee, R. Ramakrishnan, ICICLES: self-tuning samples for approximate query answering, in Proc. 26th VLDB (2000), pp. 176–187

    Google Scholar 

  22. P.J. Haas, A.N. Swami, Sampling-based selectivity estimation using augmented frequent value statistics, in Proc. Eleventh ICDE (1995), pp. 522–531

    Google Scholar 

  23. C. Jermaine, Robust estimation with sampling and approximate pre-aggregation, in Proc. 29th VLDB (2003), pp. 886–897

    Google Scholar 

  24. W. Hou, G. Ozsoyoglu, B. Taneja, Processing aggregate relational queries with hard time constraints, in Proc. ACM SIGMOD (1989), pp. 68–77

    Google Scholar 

  25. F. Olken, D. Rotem, Simple random sampling from relational databases, in Proc. 12th VLDB (1986), pp. 160–169

    Google Scholar 

  26. F. Olken, D. Rotem, Random sampling from \(B^{+}\) trees, in Proc. 15th VLDB (1989), pp. 269–277

    Google Scholar 

  27. F. Olken, D. Rotem, Maintenance of materialized views of sampling queries, in Proc. Eighth ICDE (1992), pp. 632–641

    Google Scholar 

  28. F. Olken, D. Rotem, Sampling from spatial databases, in Proc. Ninth ICDE (1993), pp. 199–208

    Google Scholar 

  29. F. Olken, D. Rotem, P. Xu, Random sampling from hash files, in Proc. ACM SIGMOD (1990), pp. 375–386

    Google Scholar 

  30. P.J. Haas, J.F. Naughton, S. Seshadri, A.N. Swami, Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52, 550–569 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  31. P.J. Haas, J.F. Naughton, A.N. Swami, On the relative cost of sampling for join selectivity estimation, in Proc. Thirteenth PODS (1994), pp. 14–24

    Google Scholar 

  32. P.J. Haas, A.N. Swami, Sequential sampling procedures for query size estimation, in Proc. ACM SIGMOD (1992), pp. 1–11

    Google Scholar 

  33. W. Hou, G. Ozsoyoglu, E. Dogdu, Error-constrained COUNT query evaluation in relational databases, in Proc. ACM SIGMOD (1991), pp. 278–287

    Google Scholar 

  34. R.J. Lipton, J.F. Naughton, Query size estimation by adaptive sampling, in Proc. Ninth PODS (1990), pp. 40–46

    Google Scholar 

  35. R.J. Lipton, J.F. Naughton, D.A. Schneider, Practical selectivity estimation through adaptive sampling, in Proc. ACM SIGMOD (1990), pp. 1–11

    Google Scholar 

  36. R.J. Lipton, J.F. Naughton, D.A. Schneider, S. Seshadri, Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116, 195–226 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  37. K.D. Seppi, J.W. Barnes, C.N. Morris, A Bayesian approach to database query optimization. ORSA J. Comput. 5, 410–419 (1993)

    Article  MATH  Google Scholar 

  38. J.M. Hellerstein, P.J. Haas, H.J. Wang, Online aggregation, in Proc. ACM SIGMOD (1997), pp. 171–182

    Google Scholar 

  39. C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, A. Pol, A disk-based join with probabilistic guarantees, in Proc. ACM SIGMOD (2005)

    Google Scholar 

  40. C.M. Jermaine, S. Arumugam, A. Pol, A. Dobra, Scalable approximate query processing with the DBO engine, in Proc. ACM SIGMOD (2007), pp. 725–736

    Google Scholar 

  41. C. Jermaine, A. Dobra, A. Pol, S. Joshi, Online estimation for subset-based SQL queries, in Proc. 31st VLDB (2005), pp. 745–756

    Google Scholar 

  42. G. Luo, C. Ellman, P.J. Haas, J.F. Naughton, A scalable hash ripple join algorithm, in Proc. ACM SIGMOD (2002), pp. 252–262

    Google Scholar 

  43. A. Pol, C. Jermaine, Relational confidence bounds are easy with the bootstrap, in Proc. ACM SIGMOD (2005)

    Google Scholar 

  44. P.G. Brown, P.J. Haas, BHUNT: automatic discovery of fuzzy algebraic constraints in relational data, in Proc. 29th VLDB (2003), pp. 668–679

    Google Scholar 

  45. I.F. Ilyas, V. Markl, P.J. Haas, P.G. Brown, A. Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, in Proc. ACM SIGMOD (2004), pp. 647–658

    Google Scholar 

  46. P. Brown, P. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, Y. Sismanis, Toward automated large-scale information integration and discovery, in Data Management in a Connected World, ed. by T. Härder, W. Lehner (Springer, New York, 2005)

    Google Scholar 

  47. M. Charikar, S. Chaudhuri, R. Motwani, V.R. Narasayya, Towards estimation error guarantees for distinct values, in Proc. Nineteenth PODS (2000), pp. 268–279

    Google Scholar 

  48. P.J. Haas, L. Stokes, Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93, 1475–1487 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  49. M. Wu, C. Jermaine, A Bayesian method for guessing the extreme values in a data set, in Proc. 33rd VLDB (2007), pp. 471–482

    Google Scholar 

  50. P. Billingsley, Probability and Measure, 2nd edn. (Wiley, New York, 1986)

    MATH  Google Scholar 

  51. A.M. Law, Simulation Modeling and Analysis, 4th edn. (McGraw-Hill, New York, 2007)

    Google Scholar 

  52. D.E. Knuth, The Art of Computer Programming, vol. 2: Seminumerical Algorithms (Addison-Wesley, Reading, 1969)

    MATH  Google Scholar 

  53. A.I. McLeod, D.R. Bellhouse, A convenient algorithm for drawing a simple random sample. Appl. Stat. 32, 182–184 (1983)

    Article  MATH  Google Scholar 

  54. J.S. Vitter, Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  55. J.S. Vitter, Faster methods for random sampling. Commun. ACM 27, 703–718 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  56. M.T. Chao, A general purpose unequal probability sampling plan. Biometrika 69, 653–656 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  57. H. Brönnimann, B. Chen, M. Dash, P.J. Haas, Y. Qiao, P. Scheuermann, Efficient data reduction methods for on-line association rule discovery, in Data Mining: Next Generation Challenges and Future Directions, ed. by H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha (AAAI Press, Menlo Park, 2004)

    Google Scholar 

  58. B. Babcock, M. Datar, R. Motwani, Sampling from a moving window over streaming data, in Proc. 13th SODA (2002), pp. 633–634

    Google Scholar 

  59. T. Hagerup, C. Rub, A guided tour of Chernoff bounds. Inf. Process. Lett. 33, 305–308 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  60. W. Feller, An Introduction to Probability Theory and Its Applications, 3rd edn., vol. 1 (Wiley, New York, 1968)

    MATH  Google Scholar 

  61. R. Gemulla, W. Lehner, P.J. Haas, Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–202 (2008)

    Article  Google Scholar 

  62. R. Gemulla, W. Lehner, P.J. Haas, Maintaining Bernoulli samples over evolving multisets, in Proc. Twenty Sixth PODS (2007), pp. 93–102

    Google Scholar 

  63. G. Cormode, S. Muthukrishnan, I. Rozenbaum, Summarizing and mining inverse distributions on data streams via dynamic inverse sampling, in Proc. 31st VLDB (2005), pp. 25–36

    Google Scholar 

  64. G. Frahling, P. Indyk, C. Sohler, Sampling in dynamic data streams and applications, in Proc. 21st ACM Symp. Comput. Geom. (2005), pp. 142–149

    Google Scholar 

  65. P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in Proc. 27th VLDB (2001), pp. 541–550

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter J. Haas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Haas, P.J. (2016). Data-Stream Sampling: Basic Techniques and Results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-28608-0_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28607-3

  • Online ISBN: 978-3-540-28608-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics