skip to main content
10.1145/1142473.1142495acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Modeling skew in data streams

Published: 27 June 2006 Publication History

Abstract

Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical -- how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space -- and conceptual -- how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope® data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.

References

[1]
L. Adamic. Zipf, power-law, pareto - a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/, 2000.
[2]
A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, 2004.
[3]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.
[4]
P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in web client access patterns characteristics and caching implications. In WWW, 1999.
[5]
A. Belussi and C. Faloutsos. Estimating the selectivity of spatial queries using the 'correlation' fractal dimension. In VLDB, 1995.
[6]
Z. Bi, C. Faloutsos, and F. Korn. The "dgx" distribution for mining massive, skewed data. In KDD, 2001.
[7]
CIDR, http://www.webopedia.com/TERM/C/CIDR.html.
[8]
E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In PODS, 2003.
[9]
G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic udafs at streaming speeds. In SIGMOD, 2004.
[10]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. In LATIN, 2004.
[11]
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.
[12]
C. D. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In SIGMOD, 2003.
[13]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In SODA, 2002.
[14]
A. Deshpande, C. Guestrin, S. Madden, J.M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.
[15]
M. Evans, N. Hastings, and B. Peacock. Statistical Distributions. Wiley, New York, 3rd edition, 2000.
[16]
C. Faloutsos and I. Kamel. Beyond uniformity and independence: Analysis of R- trees using the concept of fractal dimension. In PODS, 1994.
[17]
C. Faloutsos, Y. Matias, and A. Silberschatz. Modeling skewed distributions using multifractals and the '80-20 law'. In VLDB, 1996.
[18]
P. Flajolet and G.N. Martin. Probabilistic counting. In FOCS, 1983.
[19]
X. Gabaix, P. Gopikrishnan, V. Plerou, and H.E. Stanley. A theory of power law distributions in financial market fluctuations. 423:267--270, 2003.
[20]
J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD, 2001.
[21]
A.C. Gilbert, W. Willinger, and A. Feldmann. Scaling analysis of conservative cascades, with applications to network traffic. In IEEE Transaction on Information Theory, volume 45, pages 971--991, 1999.
[22]
T. Krishnan G.J. McLachlan. The EM Algorithm and Extensions. Wiley-Interscience, 1996.
[23]
C. Jr, A. Traina, L. Wu, and C. Faloutsos. Fast feature selection using the fractal dimension. In SBBD, 2000.
[24]
E. Kohler, J. Li, V. Paxson, and S. Shenker. Observed structure of addresses in ip traffic. In IMC, 2002.
[25]
B. Krishnamurthy, H. Madhyastha, and O. Spatscheck. Atmen: a triggered network measurement infrastructure. In WWW, 2005.
[26]
Darpa intrusion detection evaluation. http://www.ll.mit.edu/IST/ideval/index.html.
[27]
F. Li, C. Chang, G. Kollios, and A. Bestavros. Characterizing and exploiting reference locality in data stream applications. In ICDE, 2006.
[28]
B.B. Mandelbrot. Fractals and Scaling in Finance. Springer-Verlag, New York, 1997.
[29]
S. Muthukrishnan. Data streams: Algorithms and applications. In SODA, 2003.
[30]
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Falout-sos. Loci: Fast outlier detection using the local correlation integral. In ICDE, 2003.
[31]
S.I. Resnick. Heavy tail modeling and teletraffic data. The Annals of Statistics, 25:1805--1869, 1997.
[32]
M. Roughan and C. Kalmanek. Pragmatic modeling of broad-band access traffic. In Computer Communications 26(8), 2003.
[33]
M. Schroeder. Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise. W.H. Freeman and Company, New York, 1991.
[34]
M. Wang, T. Madhyastha, N.H. Chan, S. Papadimitriou, and C. Faloutsos. Data mining meets performance evaluation: Fast algorithm for modeling bursty traffic. In ICDE, 2002.
[35]
W. Willinger, D. Alderson, and L. Li. A pragmatic approach to dealing with high-variability in network measurements. In IMC, 2004.
[36]
W. Willinger and V. Paxson. Where mathematics meets the internet. Notices of the American Mathematical Society, 45(8):961--970, 1998.
[37]
W. Willinger, V. Paxson, and M.S. Taqqu. Self-similarity and Heavy Tails: Structural Modeling of Network Traffic. Chapman & Hall, New York, 1998.
[38]
A. Wong, L. Wu, P.B. Gibbons, and C. Faloutsos. Fast estimation of fractal dimension and correlation integral on stream data. In Inf. Process. Lett. 93(2): 91--97, 2005.

Cited By

View all
  • (2024)Subsequence Join in Streaming Time Series under Dynamic Time WarpingVietnam Journal of Computer Science10.1142/S219688882350020311:02(241-274)Online publication date: 5-Jan-2024
  • (2023)Maximizing the Spread of Effective Information in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313878335:4(4062-4076)Online publication date: 1-Apr-2023
  • (2023)A generic sketch for estimating super-spreaders and per-flow cardinality distribution in high-speed data streamsComputer Networks10.1016/j.comnet.2023.110059237(110059)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
June 2006
830 pages
ISBN:1595934340
DOI:10.1145/1142473
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. estimation
  2. modeling
  3. skew
  4. streaming algorithms

Qualifiers

  • Article

Conference

SIGMOD/PODS06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Subsequence Join in Streaming Time Series under Dynamic Time WarpingVietnam Journal of Computer Science10.1142/S219688882350020311:02(241-274)Online publication date: 5-Jan-2024
  • (2023)Maximizing the Spread of Effective Information in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313878335:4(4062-4076)Online publication date: 1-Apr-2023
  • (2023)A generic sketch for estimating super-spreaders and per-flow cardinality distribution in high-speed data streamsComputer Networks10.1016/j.comnet.2023.110059237(110059)Online publication date: Dec-2023
  • (2021)Multiple Dynamic Outlier-Detection from a Data Stream by Exploiting Duality of Data and QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452810(2063-2075)Online publication date: 9-Jun-2021
  • (2018)A Comparison of Performance and Accuracy of Measurement Algorithms in SoftwareProceedings of the Symposium on SDN Research10.1145/3185467.3185475(1-14)Online publication date: 28-Mar-2018
  • (2017)Nonlinear Dynamics of Information Diffusion in Social NetworksACM Transactions on the Web10.1145/305774111:2(1-40)Online publication date: 24-Apr-2017
  • (2015)Re-evaluating Measurement Algorithms in SoftwareProceedings of the 14th ACM Workshop on Hot Topics in Networks10.1145/2834050.2834064(1-7)Online publication date: 16-Nov-2015
  • (2015)SCREAMProceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies10.1145/2716281.2836099(1-13)Online publication date: 1-Dec-2015
  • (2015)Catching the Head, Tail, and Everything in BetweenProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.47(979-984)Online publication date: 14-Nov-2015
  • (2013)Pattern discovery in data streams under the time warping distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0289-322:3(295-318)Online publication date: 1-Jun-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media