Stream sampling over windows with worst-case optimality and $$\ell $$ -overlap independence

Tao, Yufei; Hu, Xiaocheng; Qiao, Miao

doi:10.1007/s00778-017-0461-x

Stream sampling over windows with worst-case optimality and $\ell $-overlap independence

Regular Paper
Published: 03 April 2017

Volume 26, pages 493–510, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yufei Tao¹,
Xiaocheng Hu² &
Miao Qiao³

332 Accesses
1 Citation
Explore all metrics

Abstract

Sampling provides fundamental support to numerous applications that cannot afford to materialize all the objects arriving at a rapid speed. Existing stream sampling algorithms guarantee small space and query overhead, but all require worst-case update time proportional to the number of samples. This creates a performance issue when a large sample set is required. In this paper, we propose a new sampling algorithm that is optimal simultaneously in all the three aspects: space, query time, and update time. In particular, the algorithm handles an update in O(1) worst-case time with a very small hidden constant. Our algorithm also ensures a strong independence guarantee: the sample sets of all the queries are mutually independent as long as the overlap between two query windows is small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling in Space Restricted Settings

Article 14 June 2017

Sampling in Space Restricted Settings

Stratified random sampling from streaming and stored data

Article 23 October 2020

Notes

Such intervals are sometime termed “dyadic intervals” and are also used by the algorithms in [3, 10].
There cannot be two materializable buckets sharing the same highest level; otherwise, there would be a materializable bucket at an even higher level.
The value of s can be calculated in O(1) time as the difference between $Z^ new _ anc $ and the largest multiple of $2^{i-1} r$ at most $Z^ new _ anc $.
Salkind noted in the book entitled Statistics for People Who (Think They) Hate Statistics that most researchers suggest that the number of repeats should be no less than 30 before the theorem can be applied.
If the ground true is $ act $, then the absolute relative error is $| est - act |/ act $.

References

Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)
Article Google Scholar
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)
Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. JCSS 78(1), 260–272 (2012)
MathSciNet MATH Google Scholar
Chaudhuri, K., Mishra, N.: When random sampling preserves privacy. In: CRYPTO, pp. 198–213 (2006)
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)
Article Google Scholar
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comp. 31(6), 1794–1813 (2002)
Article MathSciNet MATH Google Scholar
Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. Int. J. Comput. Geometry Appl. 18(1/2), 3–28 (2008)
Article MathSciNet MATH Google Scholar
Fuller, W.A.: Sampling Statistics. Wiley, New York (2009)
Book MATH Google Scholar
Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: EDBT, pp. 423–441 (2006)
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD, pp. 379–392 (2008)
Hu, X., Qiao, M., Tao, Y.: External memory stream sampling. In: PODS, pp.229–239 (2015)
Lall, A., Sekar, V., Ogihara, M., Xu, J.J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: SIGMETRICS, pp. 145–156 (2006)
Nath, S., Gibbons, P.B.: Online maintenance of very large random samples on flash storage. VLDB J. 19(1), 67–90 (2010)
Article Google Scholar
Pavan, A., Tangwongsan, K., Tirthapura, S., Wu, K.: Counting and sampling triangles from a graph stream. PVLDB 6(14), 1870–1881 (2013)
Google Scholar
Pol, A., Jermaine, C.M., Arumugam, S.: Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)
Article Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their insightful comments, suggestions for improving the paper, and the very interesting interaction. The review process was one of the best that we have ever experienced.

Author information

Authors and Affiliations

University of Queensland, Brisbane, OLD, Australia
Yufei Tao
Chinese University of Hong Kong, Shatin, Hong Kong
Xiaocheng Hu
Massey University, Palmerston North, New Zealand
Miao Qiao

Authors

Yufei Tao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaocheng Hu
View author publications
You can also search for this author in PubMed Google Scholar
Miao Qiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufei Tao.

Appendices

Appendix 1: Space lower bound for our problem of Sect. 2 under disjoint independence

We will need the following mathematical fact:

Lemma 11

Let x, y be any positive real values satisfying $x \ge y$ and $x \ge 1$. Then $1 - (1 - 1/x)^y = \varOmega (y/x)$.

Proof

It is fundamental to verify that, for any real value $z, 1 + z \le e^z$, and for any real value $z \in [0, 1], e^{-z} \le 1 - (1-1/e) z$. Therefore:

$$\begin{aligned} 1 - (1 - 1/x)^y\ge & {} 1 - e^{-y/x} \\\ge & {} 1 - \Big (1 - \Big (1 - \frac{1}{e}\Big ) \frac{y}{x}\Big ) = \Big (1 - \frac{1}{e}\Big ) \frac{y}{x}. \end{aligned}$$

$\square $

Let $\mathcal {A}_3$ be an algorithm solving our problem under $\ell = 0$. Suppose that $n \ge r$ stream elements have been received. Consider the i-th element $e_i$ where $i \in [1, n - r]$. Define a random variable $E_i$ to be 1 if $e_i$ is retained by $\mathcal {A}_3$ at this moment, or 0 otherwise. Motivated by Gemulla and Lehner, we look at the query with parameter $w = n - i + 1$. As each WR sample of the query picks $e_i$ with probability $1/w, e_i$ is picked by at least one of its r samples with probability $1 - (1 - 1/w)^r$. It thus follows that

$$\begin{aligned} \mathbf {Pr}[E_i = 1]\ge & {} 1 - \left( 1 - \frac{1}{n - i +1} \right) ^r \\ \text {(by Lemma 11)}= & {} \varOmega \left( \frac{r}{n-i}\right) . \end{aligned}$$

Hence, the expected space used by $\mathcal {A}_3$ is at least

$$\begin{aligned} \sum _{i=1}^{n-r} \mathbf {E}[E_i] = \sum _{i=1}^{n-r} \varOmega \left( \frac{r}{n-i}\right) = \varOmega (r \log (n/r)). \end{aligned}$$

The worst-case space of $\mathcal {A}_3$ cannot be smaller, and thus, must also be $\varOmega (r \log (n/r))$.

Appendix 2: Proof of Lemma 10

The lemma is trivial if $Z + 1 > y$; next, we assume $Z + 1 \le y$.

Consider first $i = 2$. Let $b_1, b_2$ be the level-1 buckets covered by b. All the elements in $b_1, b_2$ are directly retained. The lemma holds on b because Lemma 2 obtains $R_b[1]$ with a single random number generated after the entire $b_2$ has been received, i.e., at or after $n = y \ge Z + 1$.

Consider $i = j \ge 3$. Redefine $b_1, b_2$ as the level-($i-1$) buckets covered by b, whose size-1 sample sets are $R_{b_1}, R_{b_2}$, respectively. Inductively assume that the lemma holds on $R_{b_1}(Z)$ and $R_{b_2}(Z)$. Lemma 2 generates a random number, at or after $n = y$, to decide whether $R_b[1]$ equals $R_{b_1}[1]$ or $R_{b_2}[1]$. Hence, given the number, $R_b(Z)$ is fully determined by $R_{b_1}(Z)$ and $R_{b_2}(Z)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tao, Y., Hu, X. & Qiao, M. Stream sampling over windows with worst-case optimality and $\ell $-overlap independence. The VLDB Journal 26, 493–510 (2017). https://doi.org/10.1007/s00778-017-0461-x

Download citation

Received: 07 April 2016
Revised: 04 March 2017
Accepted: 21 March 2017
Published: 03 April 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00778-017-0461-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stream sampling over windows with worst-case optimality and \(\ell \)-overlap independence

Abstract

Access this article

Similar content being viewed by others

Sampling in Space Restricted Settings

Sampling in Space Restricted Settings

Stratified random sampling from streaming and stored data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Space lower bound for our problem of Sect. 2 under disjoint independence

Lemma 11

Proof

Appendix 2: Proof of Lemma 10

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stream sampling over windows with worst-case optimality and \(\ell \)-overlap independence

Abstract

Access this article

Similar content being viewed by others

Sampling in Space Restricted Settings

Sampling in Space Restricted Settings

Stratified random sampling from streaming and stored data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Space lower bound for our problem of Sect. 2 under disjoint independence

Lemma 11

Proof

Appendix 2: Proof of Lemma 10

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation