Skip to main content
Log in

Stream sampling over windows with worst-case optimality and \(\ell \)-overlap independence

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Sampling provides fundamental support to numerous applications that cannot afford to materialize all the objects arriving at a rapid speed. Existing stream sampling algorithms guarantee small space and query overhead, but all require worst-case update time proportional to the number of samples. This creates a performance issue when a large sample set is required. In this paper, we propose a new sampling algorithm that is optimal simultaneously in all the three aspects: space, query time, and update time. In particular, the algorithm handles an update in O(1) worst-case time with a very small hidden constant. Our algorithm also ensures a strong independence guarantee: the sample sets of all the queries are mutually independent as long as the overlap between two query windows is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Such intervals are sometime termed “dyadic intervals” and are also used by the algorithms in [3, 10].

  2. There cannot be two materializable buckets sharing the same highest level; otherwise, there would be a materializable bucket at an even higher level.

  3. The value of s can be calculated in O(1) time as the difference between \(Z^ new _ anc \) and the largest multiple of \(2^{i-1} r\) at most \(Z^ new _ anc \).

  4. Salkind noted in the book entitled Statistics for People Who (Think They) Hate Statistics that most researchers suggest that the number of repeats should be no less than 30 before the theorem can be applied.

  5. If the ground true is \( act \), then the absolute relative error is \(| est - act |/ act \).

References

  1. Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)

    Article  Google Scholar 

  2. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)

  3. Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. JCSS 78(1), 260–272 (2012)

    MathSciNet  MATH  Google Scholar 

  4. Chaudhuri, K., Mishra, N.: When random sampling preserves privacy. In: CRYPTO, pp. 198–213 (2006)

  5. Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)

    Article  Google Scholar 

  6. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comp. 31(6), 1794–1813 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  7. Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. Int. J. Comput. Geometry Appl. 18(1/2), 3–28 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  8. Fuller, W.A.: Sampling Statistics. Wiley, New York (2009)

    Book  MATH  Google Scholar 

  9. Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: EDBT, pp. 423–441 (2006)

  10. Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD, pp. 379–392 (2008)

  11. Hu, X., Qiao, M., Tao, Y.: External memory stream sampling. In: PODS, pp.229–239 (2015)

  12. Lall, A., Sekar, V., Ogihara, M., Xu, J.J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: SIGMETRICS, pp. 145–156 (2006)

  13. Nath, S., Gibbons, P.B.: Online maintenance of very large random samples on flash storage. VLDB J. 19(1), 67–90 (2010)

    Article  Google Scholar 

  14. Pavan, A., Tangwongsan, K., Tirthapura, S., Wu, K.: Counting and sampling triangles from a graph stream. PVLDB 6(14), 1870–1881 (2013)

    Google Scholar 

  15. Pol, A., Jermaine, C.M., Arumugam, S.: Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)

    Article  Google Scholar 

  16. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their insightful comments, suggestions for improving the paper, and the very interesting interaction. The review process was one of the best that we have ever experienced.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yufei Tao.

Appendices

Appendix 1: Space lower bound for our problem of Sect. 2 under disjoint independence

We will need the following mathematical fact:

Lemma 11

Let xy be any positive real values satisfying \(x \ge y\) and \(x \ge 1\). Then \(1 - (1 - 1/x)^y = \varOmega (y/x)\).

Proof

It is fundamental to verify that, for any real value \(z, 1 + z \le e^z\), and for any real value \(z \in [0, 1], e^{-z} \le 1 - (1-1/e) z\). Therefore:

$$\begin{aligned} 1 - (1 - 1/x)^y\ge & {} 1 - e^{-y/x} \\\ge & {} 1 - \Big (1 - \Big (1 - \frac{1}{e}\Big ) \frac{y}{x}\Big ) = \Big (1 - \frac{1}{e}\Big ) \frac{y}{x}. \end{aligned}$$

\(\square \)

Let \(\mathcal {A}_3\) be an algorithm solving our problem under \(\ell = 0\). Suppose that \(n \ge r\) stream elements have been received. Consider the i-th element \(e_i\) where \(i \in [1, n - r]\). Define a random variable \(E_i\) to be 1 if \(e_i\) is retained by \(\mathcal {A}_3\) at this moment, or 0 otherwise. Motivated by Gemulla and Lehner, we look at the query with parameter \(w = n - i + 1\). As each WR sample of the query picks \(e_i\) with probability \(1/w, e_i\) is picked by at least one of its r samples with probability \(1 - (1 - 1/w)^r\). It thus follows that

$$\begin{aligned} \mathbf {Pr}[E_i = 1]\ge & {} 1 - \left( 1 - \frac{1}{n - i +1} \right) ^r \\ \text {(by Lemma 11)}= & {} \varOmega \left( \frac{r}{n-i}\right) . \end{aligned}$$

Hence, the expected space used by \(\mathcal {A}_3\) is at least

$$\begin{aligned} \sum _{i=1}^{n-r} \mathbf {E}[E_i] = \sum _{i=1}^{n-r} \varOmega \left( \frac{r}{n-i}\right) = \varOmega (r \log (n/r)). \end{aligned}$$

The worst-case space of \(\mathcal {A}_3\) cannot be smaller, and thus, must also be \(\varOmega (r \log (n/r))\).

Appendix 2: Proof of Lemma 10

The lemma is trivial if \(Z + 1 > y\); next, we assume \(Z + 1 \le y\).

Consider first \(i = 2\). Let \(b_1, b_2\) be the level-1 buckets covered by b. All the elements in \(b_1, b_2\) are directly retained. The lemma holds on b because Lemma 2 obtains \(R_b[1]\) with a single random number generated after the entire \(b_2\) has been received, i.e., at or after \(n = y \ge Z + 1\).

Consider \(i = j \ge 3\). Redefine \(b_1, b_2\) as the level-(\(i-1\)) buckets covered by b, whose size-1 sample sets are \(R_{b_1}, R_{b_2}\), respectively. Inductively assume that the lemma holds on \(R_{b_1}(Z)\) and \(R_{b_2}(Z)\). Lemma 2 generates a random number, at or after \(n = y\), to decide whether \(R_b[1]\) equals \(R_{b_1}[1]\) or \(R_{b_2}[1]\). Hence, given the number, \(R_b(Z)\) is fully determined by \(R_{b_1}(Z)\) and \(R_{b_2}(Z)\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tao, Y., Hu, X. & Qiao, M. Stream sampling over windows with worst-case optimality and \(\ell \)-overlap independence. The VLDB Journal 26, 493–510 (2017). https://doi.org/10.1007/s00778-017-0461-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0461-x

Keywords

Navigation