skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Timely Reporting of Heavy Hitters Using External Memory

Journal Article · · ACM Transactions on Database Systems
DOI:https://doi.org/10.1145/3472392· OSTI ID:1830533
 [1];  [2];  [3];  [4];  [5];  [6];  [4];  [4]
  1. Williams College, Williamstown, MA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
  3. Stony Brook Univ., NY (United States)
  4. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  5. Rutgers Univ., Piscataway, NJ (United States)
  6. VMware Research, Palo Alto, CA (United States)

Given an input stream S of size N, a Φ-heavy hitter is an item that occurs at least ΦN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. In this work, we study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = Φ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)
Grant/Contract Number:
NA0003525; CCF 1947789; CCF 1725543; CSR 1763680; CCF 1716252; CCF 1617618; CNS 1938709; CCF 2106827; CCF 1715777; 1637458; AC02-05CH11231
OSTI ID:
1830533
Report Number(s):
SAND-2021-14465J; 701551
Journal Information:
ACM Transactions on Database Systems, Vol. 46, Issue 4; ISSN 0362-5915
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (21)

An improved data stream summary: the count-min sketch and its applications journal April 2005
The log-structured merge-tree (LSM-tree) journal June 1996
What's hot and what's not: tracking most frequent items dynamically journal March 2005
A General-Purpose Counting Filter: Making Every Bit Count
  • Pandey, Prashant; Bender, Michael A.; Johnson, Rob
  • SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the 2017 ACM International Conference on Management of Data https://doi.org/10.1145/3035918.3035963
conference May 2017
Identifying heavy hitters in high-speed network monitoring journal March 2010
The input/output complexity of sorting and related problems journal August 1988
BGPmon: A Real-Time, Scalable, Extensible Monitoring System
  • Yan, He; Oliveira, Ricardo; Burnett, Kevin
  • Technology Conference for Homeland Security (CATCH), 2009 Cybersecurity Applications & Technology Conference for Homeland Security https://doi.org/10.1109/CATCH.2009.28
conference March 2009
SVELTE: Real-time intrusion detection in the Internet of Things journal November 2013
BPTree: An ℓ 2 Heavy Hitters Algorithm Using Constant Memory
  • Braverman, Vladimir; Chestnut, Stephen R.; Ivkin, Nikita
  • SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems https://doi.org/10.1145/3034786.3034798
conference May 2017
Power-Law Distributions in Empirical Data journal November 2009
Continuous queries over data streams journal September 2001
HeavyKeeper: An Accurate Algorithm for Finding Top-$k$ Elephant Flows journal October 2019
The Characterization of Continuous Queries journal December 1999
Methods for finding frequent items in data streams journal December 2009
A simple algorithm for finding frequent elements in streams and bags journal March 2003
Real-time Stability in Power Systems: Techniques for Early Detection of the Risk of Blackout [Book Review journal May 2006
Finding repeated elements journal November 1982
Probabilistic lossy counting: an efficient algorithm for finding heavy hitters journal January 2008
Don't thrash: how to cache your hash on flash journal July 2012
Conditional heavy hitters: detecting interesting correlations in data streams journal February 2015
Power laws, Pareto distributions and Zipf's law journal September 2005