Skip to main content

Estimating Rarity and Similarity over Data Stream Windows

  • Conference paper
  • First Online:
Algorithms — ESA 2002 (ESA 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2461))

Included in the following conference series:

  • 1902 Accesses

Abstract

In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations a t-(N-1), a t-(N-2), . . . , a t, each a i ε 1, . . . , u; we are required to support queries about the data in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, so not all items within the window can be archived.

We study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. Our second problem is one of estimating similarity between two data stream windows using the Jacard’s coefficient. The problems of estimating rarity and similarity have many applications in mining massive data sets. We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor 1 ± ε using space only logarithmic in the window size.

This work was done while the author was a DIMACS visitor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. In Proc. Twenty-Eighth Annual ACM Symp. on Theory of Computing, 1996.

    Google Scholar 

  2. A. Broder. Filtering Near-Duplicate Documents. In Proc. of FUN, 1998.

    Google Scholar 

  3. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. In Proc. of Principles of Database Systems, Madison, Wisconsin, June 3–5, 2002.

    Google Scholar 

  4. A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise Independent Permutations. In Proc. of STOC, 1998.

    Google Scholar 

  5. B. Babcock, M. Datar, and R. Motwani. Sampling from a Moving Window over Streaming Data. In Proc. of Thirteenth Annual ACM-SIAM Symp. on Discrete Algorithms, 2002.

    Google Scholar 

  6. E. Cohen. Size-Estimation Framework with Applications to Transitive Closure and Reachability. Journal of Computer and System Sciences 55 (1997): 441–453.

    Article  MATH  MathSciNet  Google Scholar 

  7. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. In Proc. of the 16th International Conference on Data Engineering, San Diego, USA, 2000.

    Google Scholar 

  8. M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. In Proc. of Thirteenth Annual ACM-SIAM Symp. on Discrete Algorithms.

    Google Scholar 

  9. P. Domingos, G. Hulten, and L. Spencer. Mining time-changing data streams. In Proc. of the 7th International Conference on Knowledge Discovery and Data Mining, 2001.

    Google Scholar 

  10. T. Dasu, T. Johnson, S. Muthukrishnan and V. Shkapenyuk Mining database structure, or to How to build a data quality browser. In Proc. of the SIGMOD, 2002.

    Google Scholar 

  11. J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. An Approximate L1-Difference Algorithm for Massive Data Streams. In Proc. 40th IEEE Symp. on Foundations of Computer Science, 1999.

    Google Scholar 

  12. P. Flajolet, G. Martin. Probabilistic Counting. In Proc. 24th IEEE Symp. on Foundations of Computer Science, 1983.

    Google Scholar 

  13. P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Symp. on Parallel Algorithms and Architectures, 2001.

    Google Scholar 

  14. A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. In ACM Symp on Theory of Computing (STOC), 2002.

    Google Scholar 

  15. A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. of VLDB, 2001.

    Google Scholar 

  16. S. Guha, N. Mishra, R. Motwani, L. O’Callaghan. Clustering data streams. In Proc. 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 359–366, 2000.

    Google Scholar 

  17. M.R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams. Technical Report TR 1998-011, Compaq Systems Research Center, Palo Alto, California, May 1998.

    Google Scholar 

  18. P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. In Proc. 41st IEEE Symp. on Foundations of Computer Science, 2000.

    Google Scholar 

  19. P. Indyk A Small Approximately Min-Wise Independent Family of Hash Functions. In Journal of Algorithms 38(1): 84–90 (2001).

    Article  MATH  MathSciNet  Google Scholar 

  20. S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proceedings of ICDE, 2002.

    Google Scholar 

  21. K. Mulmuley An Introduction through Randomized Algorithms. Prentice Hall, 1993.

    Google Scholar 

  22. R. Motwani, P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.

    Google Scholar 

  23. NOAA. U. S. national weather service. http://www.nws.noaa.gov/.

  24. R. Seidel, and C. Aragon. Randomized Search Trees. In Algorithmica (1996) 16, pp 464–497.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Datar, M., Muthukrishnan, S. (2002). Estimating Rarity and Similarity over Data Stream Windows. In: Möhring, R., Raman, R. (eds) Algorithms — ESA 2002. ESA 2002. Lecture Notes in Computer Science, vol 2461. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45749-6_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-45749-6_31

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44180-9

  • Online ISBN: 978-3-540-45749-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics