Abstract
In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations a t-(N-1), a t-(N-2), . . . , a t, each a i ε 1, . . . , u; we are required to support queries about the data in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, so not all items within the window can be archived.
We study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. Our second problem is one of estimating similarity between two data stream windows using the Jacard’s coefficient. The problems of estimating rarity and similarity have many applications in mining massive data sets. We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor 1 ± ε using space only logarithmic in the window size.
This work was done while the author was a DIMACS visitor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. In Proc. Twenty-Eighth Annual ACM Symp. on Theory of Computing, 1996.
A. Broder. Filtering Near-Duplicate Documents. In Proc. of FUN, 1998.
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. In Proc. of Principles of Database Systems, Madison, Wisconsin, June 3–5, 2002.
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise Independent Permutations. In Proc. of STOC, 1998.
B. Babcock, M. Datar, and R. Motwani. Sampling from a Moving Window over Streaming Data. In Proc. of Thirteenth Annual ACM-SIAM Symp. on Discrete Algorithms, 2002.
E. Cohen. Size-Estimation Framework with Applications to Transitive Closure and Reachability. Journal of Computer and System Sciences 55 (1997): 441–453.
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. In Proc. of the 16th International Conference on Data Engineering, San Diego, USA, 2000.
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. In Proc. of Thirteenth Annual ACM-SIAM Symp. on Discrete Algorithms.
P. Domingos, G. Hulten, and L. Spencer. Mining time-changing data streams. In Proc. of the 7th International Conference on Knowledge Discovery and Data Mining, 2001.
T. Dasu, T. Johnson, S. Muthukrishnan and V. Shkapenyuk Mining database structure, or to How to build a data quality browser. In Proc. of the SIGMOD, 2002.
J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. An Approximate L1-Difference Algorithm for Massive Data Streams. In Proc. 40th IEEE Symp. on Foundations of Computer Science, 1999.
P. Flajolet, G. Martin. Probabilistic Counting. In Proc. 24th IEEE Symp. on Foundations of Computer Science, 1983.
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Symp. on Parallel Algorithms and Architectures, 2001.
A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. In ACM Symp on Theory of Computing (STOC), 2002.
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. of VLDB, 2001.
S. Guha, N. Mishra, R. Motwani, L. O’Callaghan. Clustering data streams. In Proc. 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 359–366, 2000.
M.R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams. Technical Report TR 1998-011, Compaq Systems Research Center, Palo Alto, California, May 1998.
P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. In Proc. 41st IEEE Symp. on Foundations of Computer Science, 2000.
P. Indyk A Small Approximately Min-Wise Independent Family of Hash Functions. In Journal of Algorithms 38(1): 84–90 (2001).
S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proceedings of ICDE, 2002.
K. Mulmuley An Introduction through Randomized Algorithms. Prentice Hall, 1993.
R. Motwani, P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
NOAA. U. S. national weather service. http://www.nws.noaa.gov/.
R. Seidel, and C. Aragon. Randomized Search Trees. In Algorithmica (1996) 16, pp 464–497.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Datar, M., Muthukrishnan, S. (2002). Estimating Rarity and Similarity over Data Stream Windows. In: Möhring, R., Raman, R. (eds) Algorithms — ESA 2002. ESA 2002. Lecture Notes in Computer Science, vol 2461. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45749-6_31
Download citation
DOI: https://doi.org/10.1007/3-540-45749-6_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44180-9
Online ISBN: 978-3-540-45749-7
eBook Packages: Springer Book Archive