research-article

Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

Authors:
Xenofontas Dimitropoulos

IBM Zurich Research Laboratory

IBM Zurich Research Laboratory
View Profile

,
Paul Hurley

IBM Zurich Research Laboratory

IBM Zurich Research Laboratory
View Profile

,
Andreas Kind

IBM Zurich Research Laboratory

IBM Zurich Research Laboratory
View Profile

Authors Info & Claims

ACM SIGCOMM Computer Communication Review Volume 38 Issue 1January 2008pp 5https://doi.org/10.1145/1341431.1341433

Published:30 January 2008Publication History

ACM SIGCOMM Computer Communication Review

Abstract

Knowledge of the largest traffic ows in a network is important for many network management applications. The problem of finding these ows is known as the heavy-hitter problem and has been the subject of many studies in the past years. One of the most efficient and well-known algorithms for finding heavy hitters is lossy counting [29].

In this work we introduce probabilistic lossy counting (PLC), which enhances lossy counting in computing network traffic heavy hitters. PLC uses on a tighter error bound on the estimated sizes of traffic ows and provides probabilistic rather than deterministic guarantees on its accuracy. The probabilistic-based error bound substantially improves the memory consumption of the algorithm. In addition, PLC reduces the rate of false positives of lossy counting and achieves a low estimation error, although slightly higher than that of lossy counting

We compare PLC with state-of-the-art algorithms for finding heavy hitters. Our experiments using real traffic traces find that PLC has 1) between 34.4% and 74% lower memory consumption, 2) between 37.9% and 40.5% fewer false positives than lossy counting, and 3) a small estimation error.

References

L. A. Adamic. Zipf, Power-laws, and Pareto - a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html.Google Scholar
N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems, 1999. Google ScholarDigital Library
N. Alon, T. Matias, and M. Szegedy. The space complexity of appriximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Sympsuim on the Theory of Computing, 1999. Google ScholarDigital Library
B. Babcock and C. Olston. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 28--39, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
C. Barakat, G. Iannaccone, and C. Diot. Ranking flows from sampled traffic. In Proceedings of CoNext, 2005. Google ScholarDigital Library
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), 2002. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. The gigascope stream database. IEEE Data Engineering Bulletin, 26(1):27--32, 2003.Google Scholar
E. Demaine, A. Lopez-Ortiz, and J. Munro. Frequency estimation of Internet packet streams with limited space. In In Proceedings of the 10th ESA Annual European Symposium on Algorithms, pages 348--360, 2002. Google ScholarDigital Library
N. Duffield, C. Lund, and M. Thorup. Charging from sampled network usage. In SIGCOMM Internet Measurement Workshop, Nov 2001. Google ScholarDigital Library
N. Duffield, C. Lund, and M. Thorup. Flow sampling under hard resource constraints. In SIGMETRICS'04/Performance '04: Proceedings of the joint international conference on Measurement and Modeling of Computer Systems, pages 85--96, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
N. Duffield, C. Lund, and M. Thorup. Learn more, sample less: Control of volume and variance in network measurement. IEEE Transactions of Information Theory, 51:1756--1775, 2005. Google ScholarDigital Library
N. Duffield, C. Lund, and M. Thorup. Sampling to estimate arbitrary subset sums, 2005.Google Scholar
C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM, 2002. Google ScholarDigital Library
C. Estan and G. Varghese. New directions in traffic measurement and accounting. Technical Report 699, UCSD CSE, 2002.Google ScholarDigital Library
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and K. Ullman. Computing iceberg queries efficiently. In Proceedings of the 24th International Conference on Very Large Databases, 1998. Google ScholarDigital Library
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. pages 331--342, 1998. Google ScholarDigital Library
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Databases, 2001. Google ScholarDigital Library
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases, 2002. Google ScholarDigital Library
N. Kamiyama and T. Mori. Simple and accurate identification of high-rate ows by packet sampling. In Proceedings of IEEE INFOCOM, 2006.Google ScholarCross Ref
R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, 28(1):51--55, 2003. Google ScholarDigital Library
K. Keys, D. Moore, and C. Estan. A robust system for accurate real time summaries of Internet traffic. ACM SIGMETRICS Performance Evaluation Review, 33(1), 2005. Google ScholarDigital Library
K. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, and k. claffy. The architecture of CoralReef: an Internet traffic monitoring software suite. In Workshop (PAM' 01), 2001.Google Scholar
B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM Internet Measurement Conference, 2003. Google ScholarDigital Library
A. Kumar, M. Sung, J. Xu, and J. Wang. Data streaming algorithms for efficient and accurate estimation of ow size distributions. In Proceedings of ACM SIGMETRICS, 2004. Google ScholarDigital Library
F. Li, C. Chang, G. Kollios, and A. Bestavros. Characterizing and exploiting reference locality in data stream applications. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE), 2006. Google ScholarDigital Library
X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan, G. Iannaccone, and A. Lakhina. Detection and identification of network anomalies using sketch subspaces. In Proceedings of the 6th ACM SIGCOMM Internet Measurement Conference, pages 147--152, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
J. D. C. Little. A proof for the queueing formula: l = λw. Operations Research, 9(3):383--387, 1961.Google ScholarDigital Library
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), 2002. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory (ICDT), pages 398--412, 2005. Google ScholarDigital Library
K. Papagiannaki, N. Taft, and C. Diot. Impact of flow dynamics on traffic engineering design principles. In Proceedings of IEEE INFOCOM, 2004.Google Scholar
R. Schweller, A. Gupta, E. Parsons, and Y. Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of ACM SIGCOMM Internet Measurement Conference, 2004. Google ScholarDigital Library
S. Venkataraman, D. Song, P. Gibbons, and A. Blum. New streaming algorithms for fast detection of superspeaders. In Proceedings of Internet Society Network and Distributed System Security (NDSS) Symposium, February 2006.Google Scholar
Y. Zhang, S. Singh, S. Sen, N. Duffield, and C. Lund. Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications. In Proceedings of the 4th ACM SIGCOMM Internet Measurement Conference, pages 101--114, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
Q. Zhao, A. Kumar, and J. Xu. Joint data streaming and sampling techniques for detection of super sources and destinations. In Proceedings of ACM SIGCOMM Internet Measurement Conference, October 2005. Google ScholarDigital Library

Index Terms

Probabilistic lossy counting: an efficient algorithm for finding heavy hitters
1. Mathematics of computing
  1. Discrete mathematics
    1. Combinatorics
      1. Enumeration
2. Networks
  1. Network services
    1. Network monitoring

Recommendations

Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm

Finding the "Top k" list or heavy hitters is an important function in many computing applications, including database joins, data warehousing (e.g., OLAP), web caching and hits, network usage monitoring, and detecting DDoS attacks. While most ...
Read More
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

Given a stream p₁, …, p_m of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ₂-heavy hitters, i.e., those items j for which f_j ≥ є √F₂, where f_j is ...
Read More
BPTree: An ℓ₂ Heavy Hitters Algorithm Using Constant Memory
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i₁,i₂,...,i_m∈[n] and the goal is to identify the items among [n] that appear frequently in the list. In sub-polynomial ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGCOMM Computer Communication Review Volume 38, Issue 1
January 2008
54 pages
ISSN:0146-4833
DOI:10.1145/1341431
Issue’s Table of Contents

Copyright © 2008 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 January 2008
Check for updates
Author Tags
data streams
heavy hitters
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 89
  Total Citations
  View Citations
- 159
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

ACM SIGCOMM Computer Communication Review

Abstract

References

Cited By

Index Terms

Recommendations

Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm

Beating CountSketch for heavy hitters in insertion streams

BPTree: An ℓ₂ Heavy Hitters Algorithm Using Constant Memory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

ACM SIGCOMM Computer Communication Review

Abstract

References

Cited By

Index Terms

Recommendations

Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm

Beating CountSketch for heavy hitters in insertion streams

BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

BPTree: An ℓ₂ Heavy Hitters Algorithm Using Constant Memory