skip to main content
10.1145/1989493.1989541acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Finding heavy distinct hitters in data streams

Published: 04 June 2011 Publication History

Abstract

A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct network connections. While it is fairly easy to maintain an accurate estimate of the current total number of distinct connections using streaming algorithms that exhibit both a low space and computational complexity, identifying the network entities that are involved in the largest number of distinct connections efficiently is considerably harder.
In this paper, we study the problem of finding all entities whose number of distinct (outgoing or incoming) network connections is at least a specific fraction of the total number of distinct connections. These entities are referred to as heavy distinct hitters. Since this problem is hard in general, we focus on randomized approximation techniques and propose a sampling-based and a sketch-based streaming algorithm. Both algorithms output a list of the potential heavy distinct hitters including the estimated counts of the corresponding number of distinct connections. We prove that, depending on the required level of accuracy of the output list, the space complexities of the presented algorithms are asymptotically optimal up to small logarithmic factors. Additionally, the algorithms are evaluated and compared using real network data in order to determine their usefulness in practice.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. Journal of Computer and System Sciences, 58(1):137--147, 1999.
[2]
N. Bandi, D. Agrawal, and A. El Abbadi. Fast Algorithms for Heavy Distinct Hitters using Associative Memories. In Proc. 27th International Conference on Distributed Computing Systems (ICDCS), 2007.
[3]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An Information Statistics Approach to Data Stream and Communication Complexity. Journal of Computer and System Sciences, 68(4):702--732, 2004.
[4]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting Distinct Elements in a Data Stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques (RANDOM), pages 1--10, 2002.
[5]
L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha. Simpler Algorithm for Estimating Frequency Moments of Data Streams. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 708--713, 2006.
[6]
B. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM (CACM), 13:422--426, 1970.
[7]
J. Cao, Y. Jin, A. Chen, T. Bu, and Z.-L. Zhang. Identifying High Cardinality Internet Hosts. In Proc. 28th IEEE Conference on Computer Communications (INFOCOM), pages 810--818, 2009.
[8]
A. Chakrabarti, S. Khot, and X. Sun. Near-Optimal Lower Bounds on the Multi-Party Communication Complexity of Set Disjointness. In In Proc. 18th IEEE Conference on Computational Complexity (CCC), pages 107--117, 2003.
[9]
M. Charikar, K. Chen, and M. Farach-Colton. Finding Frequent Items in Data Streams. Theoretical Computer Science, 312(1):3--15, 2004.
[10]
M. Durand and P. Flajolet. LogLog Counting of Large Cardinalities. In Proc. 11th Annual European Symposium on Algorithms (ESA), pages 605--617, 2003.
[11]
C. Estan and G. Varghese. New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Transactions on Computer Systems, 21(3):270--313, 2003.
[12]
C. Estan, G. Varghese, and M. Fisk. Bitmap Algorithms for Counting Active Flows on High Speed Links. In Proc. 3rd ACM SIGCOMM Conference on Internet Measurement (IMC), pages 153--166, 2003.
[13]
P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.
[14]
S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani. Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks. In Proc. 27th International Conference on Distributed Computing Systems (ICDCS), 2007.
[15]
P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In Proc. ACM SIGMOD International Conference on Management of Data, pages 331--342, 1998.
[16]
F. Giroire. Order Statistics and Estimating Cardinalities of Massive Data Sets. Discrete Applied Mathematics, 157(2):406--427, 2009.
[17]
P. Indyk and D. Woodruff. Tight Lower Bounds for the Distinct Elements Problem. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2003.
[18]
P. Indyk and D. Woodruff. Optimal Approximations of the Frequency Moments of Data Streams. In Proc. 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202--208, 2005.
[19]
B. Kalyanasundaram and G. Schnitger. The Probabilistic Communication Complexity of Set Intersection. SIAM Journal on Discrete Mathematics, 5(2):545--557, 1992.
[20]
N. Kamiyama, T. Mori, and R. Kawahara. Simple and Adaptive Identification of Superspreaders by Flow Sampling. In Proc. 26th IEEE Conference on Computer Communications (INFOCOM), pages 2481--2485, 2007.
[21]
D. M. Kane, J. Nelson, and D. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. In Proc. 29th ACM SIGMOD Symposium on Principles of Database Systems (PODS), pages 41--52, 2010.
[22]
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[23]
G. Manku and R. Motwani. Approximate Frequency Counts Over Data Streams. In Proc. 28th International Conference on Very Large Data Bases (VLDB), pages 346--357, 2002.
[24]
S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 2005.
[25]
A. A. Razborov. On the Distributional Complexity of Disjointness. Theoretical Computer Science, 106(2):385--390, 1992.
[26]
S. Venkatamaran, D. Song, P. B. Gibbons, and A. Blum. New Streaming Algorithms for Fast Detection of Superspreaders. In Proc. 12th ISOC Symposium on Network and Distributed Systems Security (NDSS), pages 149--166, 2005.
[27]
Q. Zhao, A. Kumar, and J. Xu. Joint Data Streaming and Sampling Techniques for Detection of Super Sources and Destinations. In Proc. 5th ACM SIGCOMM Conference on Internet Measurement (IMC), pages 77--90, 2005.

Cited By

View all
  • (2024)OctoSketchProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691914(1621-1639)Online publication date: 16-Apr-2024
  • (2020)IoT or NoT: Identifying IoT Devices in a Short Time ScaleNOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS47738.2020.9110451(1-9)Online publication date: 20-Apr-2020
  • (2018)Fast Detection of Heavy Hitters in Software Defined Networking Using an Adaptive and Learning MethodCloud Computing and Security10.1007/978-3-030-00012-7_5(44-55)Online publication date: 13-Sep-2018
  • Show More Cited By

Index Terms

  1. Finding heavy distinct hitters in data streams

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
    June 2011
    404 pages
    ISBN:9781450307437
    DOI:10.1145/1989493
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • EATCS: European Association for Theoretical Computer Science

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 June 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. heavy distinct hitter
    3. network monitoring
    4. space complexity
    5. streaming algorithms

    Qualifiers

    • Research-article

    Conference

    SPAA '11

    Acceptance Rates

    Overall Acceptance Rate 447 of 1,461 submissions, 31%

    Upcoming Conference

    SPAA '25
    37th ACM Symposium on Parallelism in Algorithms and Architectures
    July 28 - August 1, 2025
    Portland , OR , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)OctoSketchProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691914(1621-1639)Online publication date: 16-Apr-2024
    • (2020)IoT or NoT: Identifying IoT Devices in a Short Time ScaleNOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS47738.2020.9110451(1-9)Online publication date: 20-Apr-2020
    • (2018)Fast Detection of Heavy Hitters in Software Defined Networking Using an Adaptive and Learning MethodCloud Computing and Security10.1007/978-3-030-00012-7_5(44-55)Online publication date: 13-Sep-2018
    • (2017)Mitigating DNS random subdomain DDoS attacks by distinct heavy hitters sketchesProceedings of the fifth ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies10.1145/3132465.3132474(1-6)Online publication date: 14-Oct-2017
    • (2016)Identifying High-Cardinality Hosts from Network-Wide Traffic MeasurementsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2015.242367513:5(547-558)Online publication date: 1-Sep-2016
    • (2013)Identifying high-cardinality hosts from network-wide traffic measurements2013 IEEE Conference on Communications and Network Security (CNS)10.1109/CNS.2013.6682718(287-295)Online publication date: Oct-2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media