skip to main content
research-article
Free access

Finding the frequent items in streams of data

Published: 01 October 2009 Publication History

Abstract

Many data generation processes can be modeled as data streams. They produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. For example, the sequence of queries posed to an Internet search engine can be thought of as a stream, as can the collection of transactions across all branches of a supermarket chain. In aggregate, this data can arrive at enormous rates, easily in the realm of hundreds of gigabytes per day or higher. While this data may be archived and indexed within a data warehouse, it is also important to process the data "as it happens," to provide up to the minute analysis and statistics on current trends. Methods to achieve this must be quick to respond to each new piece of information, and use resources which are very small when compared to the total quantity of data.
These applications and others like them have led to the formulation of the so-called "streaming model." In this abstraction, algorithms take only a single pass over their input, and must accurately compute various functions while using resources (space and time per item) that are strictly sublinear in the size of the input---ideally, polynomial in the logarithm of the input size. The output must be produced at the end of the stream, or when queried on the prefix of the stream that has been observed so far. (Other variations ask for the output to be maintained continuously in the presence of updates, or on a "sliding window" of only the most recent updates.) Some problems are simple in this model: for example, given a stream of transactions, finding the mean and standard deviation of the bill totals can be accomplished by retaining a few "sufficient statistics" (sum of all values, sum of squared values, etc.). Others can be shown to require a large amount of information to be stored, such as determining whether a particular search query has already appeared anywhere within a large stream of queries. Determining which problems can be solved effectively within this model remains an active research area.
The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams. The problem is popular due to its simplicity to state, and its intuitive interest and value. It is important both in itself, and as a subroutine within more advanced data stream computations. Informally, given a sequence of items, the problem is simply to find those items which occur most frequently. Typically, this is formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. This is shown in Figure 1. Variations arise when the items are given weights, and further when these weights can also be negative.
This abstract problem captures a wide variety of settings. The items can represent packets on the Internet, and the weights are the size of the packets. Then the frequent items represent the most popular destinations, or the heaviest bandwidth users (depending on how the items are extracted from the flow identifiers). This knowledge can help in optimizing routing decisions, for in-network caching, and for planning where to add new capacity. Or, the items can represent queries made to an Internet search engine, and the frequent items are now the (currently) popular terms. These are not simply hypothetical examples, but genuine cases where algorithms for this problem have been applied by large corporations: AT&T and Google, respectively. Given the size of the data (which is being generated at high speed), it is important to find algorithms which are capable of processing each new update very quickly, without blocking. It also helps if the working space of the algorithm is very small, so that the analysis can happen over many different groups in parallel, and because small structures are likely to have better cache behavior and hence further help increase the throughput.
Obtaining efficient and scalable solutions to the frequent items problem is also important since many streaming applications need to find frequent items as a subroutine of another, more complex computation. Most directly, mining frequent itemsets inherently builds on finding frequent items as a basic building block. Finding the entropy of a stream requires learning the most frequent items in order to directly compute their contribution to the entropy, and remove their contribution before approximating the entropy of the residual stream. The HSS (Hierarchical Sampling from Sketches) technique uses hashing to derive multiple substreams, the frequent elements of which are extracted to estimate the frequency moments of the stream. The frequent items problem is also related to the recently popular area of Compressed Sensing.
Other work solves generalized versions of frequent items problems by building on algorithms for the "vanilla" version of the problem. Several techniques for finding the frequent items in a "sliding window" of recent updates (instead of all updates) operate by keeping track of the frequent items in many sub-windows. In the "heavy hitters distinct" problem, with applications to detecting network scanning attacks, the count of an item is the number of distinct pairs containing that item paired with a secondary item. It is typically solved extending a frequent items algorithm with distinct counting algorithms. Frequent items have also been applied to models of probabilistic streaming data, and within faster "skipping" techniques.
Thus the problem is an important one to understand and study in order to produce efficient streaming implementations. It remains an active area, with many new research contributions produced every year on the core problem and its variations. Due to the amount of work on this problem, it is easy to miss out some important references or fail to appreciate the properties of certain algorithms. There are several cases where algorithms first published in the 1980s have been "rediscovered" two decades later; existing work is sometimes claimed to be incapable of a certain guarantee, which in truth it can provide with only minor modifications; and experimental evaluations do not always compare against the most suitable methods.
In this paper, we present the main ideas in this area, by describing some of the most significant algorithms for the core problem of finding frequent items using common notation and terminology. In doing so, we also present the historical development of these algorithms. Studying these algorithms is instructive, as they are relatively simple, but can be shown to provide formal guarantees on the quality of their output as a function of an accuracy parameter ε. We also provide baseline implementations of many of these algorithms against which future algorithms can be compared, and on top of which algorithms for different problems can be built. We perform experimental evaluation of the algorithms over a variety of data sets to indicate their performance in practice. From this, we are able to identify clear distinctions among the algorithms that are not apparent from their theoretical analysis alone.

References

[1]
Alon, N., Matias, Y., Szegedy, M. The space complexity of approximating the frequency moments. In ACM Symposium on Theory of Computing, (1996), 20--29. Journal version in J. Comp. Syst. Sci. 58 (1999), 137--147.
[2]
Arasu, A., Manku, G.S. Approximate counts and quantiles over sliding windows. In ACM Principles of Database Systems (2004).
[3]
Bhattacharrya, S., Madeira, A., Muthukrishnan, S., Ye, T. How to scalably skip past streams. In Scalable Stream Processing Systems (SSPS) Workshop with ICDE 2007 (2007).
[4]
Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C. Simpler algorithm for estimating frequency moments of data streams. In ACM-SIAM Symposium on Discrete Algorithms (2006).
[5]
Bose, P., Kranakis, E., Morin, P., Tang, Y. Bounds for frequency estimation of packet streams. In SIROCCO (2003).
[6]
Boyer, R.S., Moore, J.S. A fast majority vote algorithm. Technical Report ICSCA-CMP-32, Institute for Computer Science, University of Texas (Feb. 1981).
[7]
Boyer, R.S., Moore, J.S. MJRTY---a fast majority vote algorithm. In Automated Reasoning: Essays in Honor of Woody Bledsoe, Automated Reasoning Series. Kluwer Academic Publishers, 1991, 105--117.
[8]
Chakrabarti, A., Cormode, G., McGregor, A. A near-optimal algorithm for computing the entropy of a stream. In ACM-SIAM Symposium on Discrete Algorithms (2007).
[9]
Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002).
[10]
Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. In International Conference on Very Large Data Bases (2008).
[11]
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O. Srivastava, D. Holistic UDAFs at streaming speeds. In ACM SIGMOD International Conference on Management of Data (2004), 35--46.
[12]
Cormode, G., Muthukrishnan, S. An improved data stream summary: The countmin sketch and its applications. J. Algorithms 55, 1 (2005), 58--75.
[13]
Datar, M., Gionis, A., Indyk, P., Motwani, R. Maintaining stream statistics over sliding windows. In ACM-SIAM Symposium on Discrete Algorithms (2002).
[14]
Demaine, E., López-Ortiz, A., Munro, J.I. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms (ESA) (2002).
[15]
Fischer, M., Salzburg, S. Finding a majority among n votes: Solution to problem 81--5. J. Algorithms 3, 4 (1982), 376--379.
[16]
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M. How to summarize the universe: Dynamic maintenance of quantiles. In International Conference on Very Large Data Bases (2002). 454--465.
[17]
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E. Estimating statistical aggregates on probabilistic data streams. In ACM Principles of Database Systems (2007).
[18]
Karp, R., Papadimitriou, C., Shenker, S. A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28 (2003), 51--55.
[19]
Manku, G., Motwani, R. Approximate frequency counts over data streams. In International Conference on Very Large Data Bases (2002).
[20]
Manku, G.S. Frequency counts over data streams. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S10P03slides.pdf (2002).
[21]
Metwally, A., Agrawal, D., Abbadi A.E. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory (2005).
[22]
Misra, J., Gries, D. Finding repeated elements. Sci. Comput. Programming 2 (1982), 143--152.
[23]
Pike, D., Dorward, S., Griesemer, R., Quinlan, S. Interpreting the data: Parallel analysis with sawzall. Dyn. Grids Worldwide Comput. 13, 4 (2005), 277--298.
[24]
Thorup, M., Zhang, Y. Tabulation-based 4-universal hashing with applications to second moment estimation. In ACM-SIAM Symposium on Discrete Algorithms (2004).
[25]
Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A. New streaming algorithms for fast detection of superspreaders. In Network and Distributed System Security Symposium NDSS (2005).

Cited By

View all
  • (2023)A One-Pass Clustering Based Sketch Method for Network MonitoringIEEE/ACM Transactions on Networking10.1109/TNET.2023.325198131:6(2604-2613)Online publication date: Dec-2023
  • (2023)Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning ModelsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.324273341:4(941-963)Online publication date: 1-Apr-2023
  • (2023)Data stream fusion for accurate quantile tracking and analysisInformation Fusion10.1016/j.inffus.2022.08.00589(155-165)Online publication date: Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 52, Issue 10
A View of Parallel Computing
October 2009
134 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1562764
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2009
Published in CACM Volume 52, Issue 10

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)712
  • Downloads (Last 6 weeks)113
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A One-Pass Clustering Based Sketch Method for Network MonitoringIEEE/ACM Transactions on Networking10.1109/TNET.2023.325198131:6(2604-2613)Online publication date: Dec-2023
  • (2023)Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning ModelsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.324273341:4(941-963)Online publication date: 1-Apr-2023
  • (2023)Data stream fusion for accurate quantile tracking and analysisInformation Fusion10.1016/j.inffus.2022.08.00589(155-165)Online publication date: Jan-2023
  • (2023)Accelerate distributed deep learning with cluster-aware sketch quantizationScience China Information Sciences10.1007/s11432-021-3532-866:6Online publication date: 22-May-2023
  • (2023)A novel data streaming method for detecting abnormal flows in distributed monitoring systemsInternational Journal of Network Management10.1002/nem.222733:6Online publication date: 31-Mar-2023
  • (2022)Persistent Items Tracking in Large Data Streams Based on Adaptive SamplingIEEE INFOCOM 2022 - IEEE Conference on Computer Communications10.1109/INFOCOM48880.2022.9796709(1948-1957)Online publication date: 2-May-2022
  • (2021)A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction StreamACM Transactions on Knowledge Discovery from Data10.1145/346523816:2(1-30)Online publication date: 21-Jul-2021
  • (2021)Efficient Approximate Algorithms for Empirical Entropy and Mutual InformationProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457255(274-286)Online publication date: 9-Jun-2021
  • (2021)Frequency Estimation in Data Streams: Learning the Optimal Hashing SchemeIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3103819(1-1)Online publication date: 2021
  • (2021)CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix MultiplicationIEEE Transactions on Information Theory10.1109/TIT.2021.306816567:6(4185-4196)Online publication date: Jun-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media