Abstract
Knowledge of the largest traffic ows in a network is important for many network management applications. The problem of finding these ows is known as the heavy-hitter problem and has been the subject of many studies in the past years. One of the most efficient and well-known algorithms for finding heavy hitters is lossy counting [29].
In this work we introduce probabilistic lossy counting (PLC), which enhances lossy counting in computing network traffic heavy hitters. PLC uses on a tighter error bound on the estimated sizes of traffic ows and provides probabilistic rather than deterministic guarantees on its accuracy. The probabilistic-based error bound substantially improves the memory consumption of the algorithm. In addition, PLC reduces the rate of false positives of lossy counting and achieves a low estimation error, although slightly higher than that of lossy counting
We compare PLC with state-of-the-art algorithms for finding heavy hitters. Our experiments using real traffic traces find that PLC has 1) between 34.4% and 74% lower memory consumption, 2) between 37.9% and 40.5% fewer false positives than lossy counting, and 3) a small estimation error.
- L. A. Adamic. Zipf, Power-laws, and Pareto - a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html.Google Scholar
- N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems, 1999. Google ScholarDigital Library
- N. Alon, T. Matias, and M. Szegedy. The space complexity of appriximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Sympsuim on the Theory of Computing, 1999. Google ScholarDigital Library
- B. Babcock and C. Olston. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 28--39, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- C. Barakat, G. Iannaccone, and C. Diot. Ranking flows from sampled traffic. In Proceedings of CoNext, 2005. Google ScholarDigital Library
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), 2002. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. The gigascope stream database. IEEE Data Engineering Bulletin, 26(1):27--32, 2003.Google Scholar
- E. Demaine, A. Lopez-Ortiz, and J. Munro. Frequency estimation of Internet packet streams with limited space. In In Proceedings of the 10th ESA Annual European Symposium on Algorithms, pages 348--360, 2002. Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. Charging from sampled network usage. In SIGCOMM Internet Measurement Workshop, Nov 2001. Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. Flow sampling under hard resource constraints. In SIGMETRICS'04/Performance '04: Proceedings of the joint international conference on Measurement and Modeling of Computer Systems, pages 85--96, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. Learn more, sample less: Control of volume and variance in network measurement. IEEE Transactions of Information Theory, 51:1756--1775, 2005. Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. Sampling to estimate arbitrary subset sums, 2005.Google Scholar
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM, 2002. Google ScholarDigital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. Technical Report 699, UCSD CSE, 2002.Google ScholarDigital Library
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and K. Ullman. Computing iceberg queries efficiently. In Proceedings of the 24th International Conference on Very Large Databases, 1998. Google ScholarDigital Library
- P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. pages 331--342, 1998. Google ScholarDigital Library
- A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Databases, 2001. Google ScholarDigital Library
- A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases, 2002. Google ScholarDigital Library
- N. Kamiyama and T. Mori. Simple and accurate identification of high-rate ows by packet sampling. In Proceedings of IEEE INFOCOM, 2006.Google ScholarCross Ref
- R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, 28(1):51--55, 2003. Google ScholarDigital Library
- K. Keys, D. Moore, and C. Estan. A robust system for accurate real time summaries of Internet traffic. ACM SIGMETRICS Performance Evaluation Review, 33(1), 2005. Google ScholarDigital Library
- K. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, and k. claffy. The architecture of CoralReef: an Internet traffic monitoring software suite. In Workshop (PAM' 01), 2001.Google Scholar
- B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM Internet Measurement Conference, 2003. Google ScholarDigital Library
- A. Kumar, M. Sung, J. Xu, and J. Wang. Data streaming algorithms for efficient and accurate estimation of ow size distributions. In Proceedings of ACM SIGMETRICS, 2004. Google ScholarDigital Library
- F. Li, C. Chang, G. Kollios, and A. Bestavros. Characterizing and exploiting reference locality in data stream applications. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE), 2006. Google ScholarDigital Library
- X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan, G. Iannaccone, and A. Lakhina. Detection and identification of network anomalies using sketch subspaces. In Proceedings of the 6th ACM SIGCOMM Internet Measurement Conference, pages 147--152, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- J. D. C. Little. A proof for the queueing formula: l = λw. Operations Research, 9(3):383--387, 1961.Google ScholarDigital Library
- G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), 2002. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory (ICDT), pages 398--412, 2005. Google ScholarDigital Library
- K. Papagiannaki, N. Taft, and C. Diot. Impact of flow dynamics on traffic engineering design principles. In Proceedings of IEEE INFOCOM, 2004.Google Scholar
- R. Schweller, A. Gupta, E. Parsons, and Y. Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of ACM SIGCOMM Internet Measurement Conference, 2004. Google ScholarDigital Library
- S. Venkataraman, D. Song, P. Gibbons, and A. Blum. New streaming algorithms for fast detection of superspeaders. In Proceedings of Internet Society Network and Distributed System Security (NDSS) Symposium, February 2006.Google Scholar
- Y. Zhang, S. Singh, S. Sen, N. Duffield, and C. Lund. Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications. In Proceedings of the 4th ACM SIGCOMM Internet Measurement Conference, pages 101--114, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- Q. Zhao, A. Kumar, and J. Xu. Joint data streaming and sampling techniques for detection of super sources and destinations. In Proceedings of ACM SIGCOMM Internet Measurement Conference, October 2005. Google ScholarDigital Library
Index Terms
- Probabilistic lossy counting: an efficient algorithm for finding heavy hitters
Recommendations
Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm
Finding the "Top k" list or heavy hitters is an important function in many computing applications, including database joins, data warehousing (e.g., OLAP), web caching and hits, network usage monitoring, and detecting DDoS attacks. While most ...
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsThe task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i1,i2,...,im∈[n] and the goal is to identify the items among [n] that appear frequently in the list. In sub-polynomial ...
Comments