Skip to main content
Log in

Methods for finding frequent items in data streams

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20–29, 1996. Journal version in Journal of Computer and System Sciences 58, 137–147 (1999)

  2. Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: ACM Principles of Database Systems (2004)

  3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM Principles of Database Systems, pp. 1–16 (2002)

  4. Bandi, N., Metwally, A., Agrawal, D., Abbadi, A.E.: Fast data stream algorithms using associative memories. In: ACM SIGMOD International Conference on Management of Data (2007)

  5. Bhattacharrya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably skip past streams. In: Scalable Stream Processing Systems (SSPS) Workshop with ICDE (2007)

  6. Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: ACM-SIAM Symposium on Discrete Algorithms (2006)

  7. Blum, A., Gibbons, P., Song, D., Venkataraman, S.: New streaming algorithms for fast detection of superspreaders. Technical Report IRP-TR-04-23, Intel Research (2004)

  8. Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for frequency estimation of packet streams. In: SIROCCO (2003)

  9. Boyer, R.S., Moore, J.: A fast majority vote algorithm. Technical Report ICSCA-CMP-32, Institute for Computer Science, University of Texas (1981)

  10. Boyer, R.S., Moore, J.S.: MJRTY—a fast majority vote algorithm. In: Automated Reasoning: Essays in Honor of Woody Bledsoe, Automated Reasoning Series, pp. 105–117. Kluwer, Dordrecht (1991)

  11. Bu, T., Cao, J., Chen, A., Lee, P.P.C.: A fast and compact method for unveiling significant patterns in high speed networks. In: IEEE INFOCOMM (2007)

  12. Candès, E., Tao, T.: Near optimal signal recovery from random projections and universal encoding strategies. Technical Report math.CA/0410542, arXiv. http://arxiv.org/abs/math.CA/0410542 (2004)

  13. Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: ACM-SIAM Symposium on Discrete Algorithms (2007)

  14. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Procedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)

  15. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)

  16. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: ACM Principles of Database Systems (2006)

  17. Cormode, G., Korn, F., Tirthapura, S.: Exponentially decayed aggregates on data streams. In: IEEE International Conference on Data Engineering (2008)

  18. Cormode, G., Muthukrishnan, S.: What’s new: Finding significant differences in network data streams. In: Proceedings of IEEE Infocom (2004)

  19. Cormode G., Muthukrishnan S.: An improved data stream summary: the count- min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  20. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. In: ACM-SIAM Symposium on Discrete Algorithms (2002)

  21. Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)

  22. Dobra, A., Rusu, F.: Statistical analysis of sketch estimators. In: ACM SIGMOD International Conference on Management of Data (2007)

  23. Donoho, D.: Compressed sensing. http://www-stat.stanford.edu/~donoho/Reports/2004/CompressedSensing091604.pdf, Unpublished Manuscript (2004)

  24. Fischer M., Salzburg S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)

    Google Scholar 

  25. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to summarize the universe: dynamic maintenance of quantiles. In: International Conference on Very Large Data Bases, pp. 454–465 (2002)

  26. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD International Conference on Management of Data (2001)

  27. Henzinger M.: Algorithmic challenges in search engines. Internet Math. 1(1), 115–126 (2003)

    MATH  MathSciNet  Google Scholar 

  28. Hershberger, J., Shrivastava, N., Suri, S., Toth, C.: Adaptive spatial partitioning for multidimensional data streams. In: ISAAC (2004)

  29. Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. In: ACM Principles of Database Systems (2007)

  30. Karp R., Papadimitriou C., Shenker S.: A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28, 51–55 (2003)

    Article  Google Scholar 

  31. Kollios, G., Byers, J., Considine, J., Hadjieleftheriou, M., Li, F.: Robust aggregation in sensor networks. IEEE Data Engineering Bulletin 28(1) (2005)

  32. Lee, L., Ting, H.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: ACM Principles of Database Systems (2006)

  33. Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346–357 (2002)

  34. Manku, G.S.: Frequency counts over data streams. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S10P03slides.pdf (2002)

  35. Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)

  36. Metwally, A., Agrawal, D., Abbadi, A.E.: Why go logarithmic if we can go linear?: towards effective distinct counting of search traffic. In: International Conference on Extending Database Technology (2008)

  37. Misra J., Gries D.: Finding repeated elements. Sci Comput Program 2, 143–152 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  38. Muthukrishnan, S.: Data streams: algorithms and applications. In: ACM-SIAM Symposium on Discrete Algorithms (2003)

  39. Pike R., Dorward S., Griesemer R., Quinlan S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)

    Google Scholar 

  40. Schweller R., Li Z., Chen Y., Gao Y., Gupta A., Zhang Y., Dinda P.A., Kao M.-Y., Memik G.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE Trans. Netw. 15(5), 1059–1072 (2007)

    Article  Google Scholar 

  41. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: ACM SenSys (2004)

  42. Thorup, M.: Even strongly universal hashing is pretty fast. In: ACM-SIAM Symposium on Discrete Algorithms (2000)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Cormode.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cormode, G., Hadjieleftheriou, M. Methods for finding frequent items in data streams. The VLDB Journal 19, 3–20 (2010). https://doi.org/10.1007/s00778-009-0172-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0172-z

Keywords

Navigation