Methods for finding frequent items in data streams

Cormode, Graham; Hadjieleftheriou, Marios

doi:10.1007/s00778-009-0172-z

Methods for finding frequent items in data streams

Special Issue Paper
Published: 01 December 2009

Volume 19, pages 3–20, (2010)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Graham Cormode¹ &
Marios Hadjieleftheriou¹

912 Accesses
9 Altmetric
Explore all metrics

Abstract

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20–29, 1996. Journal version in Journal of Computer and System Sciences 58, 137–147 (1999)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: ACM Principles of Database Systems (2004)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM Principles of Database Systems, pp. 1–16 (2002)
Bandi, N., Metwally, A., Agrawal, D., Abbadi, A.E.: Fast data stream algorithms using associative memories. In: ACM SIGMOD International Conference on Management of Data (2007)
Bhattacharrya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably skip past streams. In: Scalable Stream Processing Systems (SSPS) Workshop with ICDE (2007)
Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: ACM-SIAM Symposium on Discrete Algorithms (2006)
Blum, A., Gibbons, P., Song, D., Venkataraman, S.: New streaming algorithms for fast detection of superspreaders. Technical Report IRP-TR-04-23, Intel Research (2004)
Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for frequency estimation of packet streams. In: SIROCCO (2003)
Boyer, R.S., Moore, J.: A fast majority vote algorithm. Technical Report ICSCA-CMP-32, Institute for Computer Science, University of Texas (1981)
Boyer, R.S., Moore, J.S.: MJRTY—a fast majority vote algorithm. In: Automated Reasoning: Essays in Honor of Woody Bledsoe, Automated Reasoning Series, pp. 105–117. Kluwer, Dordrecht (1991)
Bu, T., Cao, J., Chen, A., Lee, P.P.C.: A fast and compact method for unveiling significant patterns in high speed networks. In: IEEE INFOCOMM (2007)
Candès, E., Tao, T.: Near optimal signal recovery from random projections and universal encoding strategies. Technical Report math.CA/0410542, arXiv. http://arxiv.org/abs/math.CA/0410542 (2004)
Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: ACM-SIAM Symposium on Discrete Algorithms (2007)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Procedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: ACM Principles of Database Systems (2006)
Cormode, G., Korn, F., Tirthapura, S.: Exponentially decayed aggregates on data streams. In: IEEE International Conference on Data Engineering (2008)
Cormode, G., Muthukrishnan, S.: What’s new: Finding significant differences in network data streams. In: Proceedings of IEEE Infocom (2004)
Cormode G., Muthukrishnan S.: An improved data stream summary: the count- min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MATH MathSciNet Google Scholar
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. In: ACM-SIAM Symposium on Discrete Algorithms (2002)
Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)
Dobra, A., Rusu, F.: Statistical analysis of sketch estimators. In: ACM SIGMOD International Conference on Management of Data (2007)
Donoho, D.: Compressed sensing. http://www-stat.stanford.edu/~donoho/Reports/2004/CompressedSensing091604.pdf, Unpublished Manuscript (2004)
Fischer M., Salzburg S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)
Google Scholar
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to summarize the universe: dynamic maintenance of quantiles. In: International Conference on Very Large Data Bases, pp. 454–465 (2002)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD International Conference on Management of Data (2001)
Henzinger M.: Algorithmic challenges in search engines. Internet Math. 1(1), 115–126 (2003)
MATH MathSciNet Google Scholar
Hershberger, J., Shrivastava, N., Suri, S., Toth, C.: Adaptive spatial partitioning for multidimensional data streams. In: ISAAC (2004)
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. In: ACM Principles of Database Systems (2007)
Karp R., Papadimitriou C., Shenker S.: A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28, 51–55 (2003)
Article Google Scholar
Kollios, G., Byers, J., Considine, J., Hadjieleftheriou, M., Li, F.: Robust aggregation in sensor networks. IEEE Data Engineering Bulletin 28(1) (2005)
Lee, L., Ting, H.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: ACM Principles of Database Systems (2006)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346–357 (2002)
Manku, G.S.: Frequency counts over data streams. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S10P03slides.pdf (2002)
Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)
Metwally, A., Agrawal, D., Abbadi, A.E.: Why go logarithmic if we can go linear?: towards effective distinct counting of search traffic. In: International Conference on Extending Database Technology (2008)
Misra J., Gries D.: Finding repeated elements. Sci Comput Program 2, 143–152 (1982)
Article MATH MathSciNet Google Scholar
Muthukrishnan, S.: Data streams: algorithms and applications. In: ACM-SIAM Symposium on Discrete Algorithms (2003)
Pike R., Dorward S., Griesemer R., Quinlan S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)
Google Scholar
Schweller R., Li Z., Chen Y., Gao Y., Gupta A., Zhang Y., Dinda P.A., Kao M.-Y., Memik G.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE Trans. Netw. 15(5), 1059–1072 (2007)
Article Google Scholar
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: ACM SenSys (2004)
Thorup, M.: Even strongly universal hashing is pretty fast. In: ACM-SIAM Symposium on Discrete Algorithms (2000)

Download references

Author information

Authors and Affiliations

AT&T Labs–Research, Florham Park, NJ, USA
Graham Cormode & Marios Hadjieleftheriou

Authors

Graham Cormode
View author publications
You can also search for this author inPubMed Google Scholar
Marios Hadjieleftheriou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Graham Cormode.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cormode, G., Hadjieleftheriou, M. Methods for finding frequent items in data streams. The VLDB Journal 19, 3–20 (2010). https://doi.org/10.1007/s00778-009-0172-z

Download citation

Received: 15 January 2009
Revised: 14 October 2009
Accepted: 01 November 2009
Published: 01 December 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s00778-009-0172-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for finding frequent items in data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Frequent Itemset Mining

Three Big Data Tools for a Data Scientist’s Toolbox

Frequent Itemset

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Methods for finding frequent items in data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Frequent Itemset Mining

Three Big Data Tools for a Data Scientist’s Toolbox

Frequent Itemset

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now