Abstract
Sampling provides fundamental support to numerous applications that cannot afford to materialize all the objects arriving at a rapid speed. Existing stream sampling algorithms guarantee small space and query overhead, but all require worst-case update time proportional to the number of samples. This creates a performance issue when a large sample set is required. In this paper, we propose a new sampling algorithm that is optimal simultaneously in all the three aspects: space, query time, and update time. In particular, the algorithm handles an update in O(1) worst-case time with a very small hidden constant. Our algorithm also ensures a strong independence guarantee: the sample sets of all the queries are mutually independent as long as the overlap between two query windows is small.
Similar content being viewed by others
Notes
There cannot be two materializable buckets sharing the same highest level; otherwise, there would be a materializable bucket at an even higher level.
The value of s can be calculated in O(1) time as the difference between \(Z^ new _ anc \) and the largest multiple of \(2^{i-1} r\) at most \(Z^ new _ anc \).
Salkind noted in the book entitled Statistics for People Who (Think They) Hate Statistics that most researchers suggest that the number of repeats should be no less than 30 before the theorem can be applied.
If the ground true is \( act \), then the absolute relative error is \(| est - act |/ act \).
References
Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)
Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. JCSS 78(1), 260–272 (2012)
Chaudhuri, K., Mishra, N.: When random sampling preserves privacy. In: CRYPTO, pp. 198–213 (2006)
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comp. 31(6), 1794–1813 (2002)
Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. Int. J. Comput. Geometry Appl. 18(1/2), 3–28 (2008)
Fuller, W.A.: Sampling Statistics. Wiley, New York (2009)
Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: EDBT, pp. 423–441 (2006)
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD, pp. 379–392 (2008)
Hu, X., Qiao, M., Tao, Y.: External memory stream sampling. In: PODS, pp.229–239 (2015)
Lall, A., Sekar, V., Ogihara, M., Xu, J.J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: SIGMETRICS, pp. 145–156 (2006)
Nath, S., Gibbons, P.B.: Online maintenance of very large random samples on flash storage. VLDB J. 19(1), 67–90 (2010)
Pavan, A., Tangwongsan, K., Tirthapura, S., Wu, K.: Counting and sampling triangles from a graph stream. PVLDB 6(14), 1870–1881 (2013)
Pol, A., Jermaine, C.M., Arumugam, S.: Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments, suggestions for improving the paper, and the very interesting interaction. The review process was one of the best that we have ever experienced.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Space lower bound for our problem of Sect. 2 under disjoint independence
We will need the following mathematical fact:
Lemma 11
Let x, y be any positive real values satisfying \(x \ge y\) and \(x \ge 1\). Then \(1 - (1 - 1/x)^y = \varOmega (y/x)\).
Proof
It is fundamental to verify that, for any real value \(z, 1 + z \le e^z\), and for any real value \(z \in [0, 1], e^{-z} \le 1 - (1-1/e) z\). Therefore:
\(\square \)
Let \(\mathcal {A}_3\) be an algorithm solving our problem under \(\ell = 0\). Suppose that \(n \ge r\) stream elements have been received. Consider the i-th element \(e_i\) where \(i \in [1, n - r]\). Define a random variable \(E_i\) to be 1 if \(e_i\) is retained by \(\mathcal {A}_3\) at this moment, or 0 otherwise. Motivated by Gemulla and Lehner, we look at the query with parameter \(w = n - i + 1\). As each WR sample of the query picks \(e_i\) with probability \(1/w, e_i\) is picked by at least one of its r samples with probability \(1 - (1 - 1/w)^r\). It thus follows that
Hence, the expected space used by \(\mathcal {A}_3\) is at least
The worst-case space of \(\mathcal {A}_3\) cannot be smaller, and thus, must also be \(\varOmega (r \log (n/r))\).
Appendix 2: Proof of Lemma 10
The lemma is trivial if \(Z + 1 > y\); next, we assume \(Z + 1 \le y\).
Consider first \(i = 2\). Let \(b_1, b_2\) be the level-1 buckets covered by b. All the elements in \(b_1, b_2\) are directly retained. The lemma holds on b because Lemma 2 obtains \(R_b[1]\) with a single random number generated after the entire \(b_2\) has been received, i.e., at or after \(n = y \ge Z + 1\).
Consider \(i = j \ge 3\). Redefine \(b_1, b_2\) as the level-(\(i-1\)) buckets covered by b, whose size-1 sample sets are \(R_{b_1}, R_{b_2}\), respectively. Inductively assume that the lemma holds on \(R_{b_1}(Z)\) and \(R_{b_2}(Z)\). Lemma 2 generates a random number, at or after \(n = y\), to decide whether \(R_b[1]\) equals \(R_{b_1}[1]\) or \(R_{b_2}[1]\). Hence, given the number, \(R_b(Z)\) is fully determined by \(R_{b_1}(Z)\) and \(R_{b_2}(Z)\).
Rights and permissions
About this article
Cite this article
Tao, Y., Hu, X. & Qiao, M. Stream sampling over windows with worst-case optimality and \(\ell \)-overlap independence. The VLDB Journal 26, 493–510 (2017). https://doi.org/10.1007/s00778-017-0461-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-017-0461-x