skip to main content
10.1145/1065167.1065201acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Space efficient mining of multigraph streams

Published: 13 June 2005 Publication History

Abstract

The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying graph while using small space (much smaller than the number of communicants), and to avoid bias introduced because some edges may appear many times, while others are seen only once. We give results for three fundamental problems on multigraph degree sequences: estimating frequency moments of degrees, finding the heavy hitter degrees, and computing range sums of degree values. In all cases we are able to show space bounds for our summarizing algorithms that are significantly smaller than storing complete information. We use a variety of data stream methods: sketches, sampling, hashing and distinct counting, but a common feature is that we use cascaded summaries: nesting multiple estimation techniques within one another. In our experimental study, we see that such summaries are highly effective, enabling massive multigraph streams to be effectively summarized to answer queries of interest with high accuracy using only a small amount of space.

References

[1]
N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems, pages 10--20, 1999.
[2]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pages 20--29, 1996. Journal version in Journal of Computer and System Sciences, 58:137--147, 1999.
[3]
A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom. STREAM: the Stanford Stream Data Manager (demonstration description). In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 665--665, 2003.
[4]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of ACM Principles of Database Systems, pages 1--16, 2002.
[5]
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisian. Counting distinct elements in a data stream. In Proceedings of RANDOM 2002, pages 1--10, 2002.
[6]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 623--632, 2002.
[7]
A. Blum, P. Gibbons, D. Song, and S. Venkataraman. New streaming algorithms for fast detection of superspreaders. Technical Report IRP-TR-04-23, Intel Research, 2004.
[8]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 693--703, 2002.
[9]
G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using Hamming norms. In Proceedings of the International Conference on Very Large Data Bases, pages 335--345, 2002. Journal version in IEEE Transactions on Knowledge and Data Engineering 15(3):529--541, 2003.
[10]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. In Latin American Informatics, pages 29--38, 2004.
[11]
C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 647--651, 2003.
[12]
M. Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. In Proceedings of 10th Annual European Symposium on Algorithms, volume 2461 of Lecture Notes in Computer Science, pages 323--334, 2002.
[13]
C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM, volume 32, 4 of Computer Communication Review, pages 323--338, 2002.
[14]
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. In Proceedings of the International Colloquium on Automata, Languages, and Programming, 2004.
[15]
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: The value of space. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2005.
[16]
P. Flajolet and G. N. Martin. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, pages 76--82, 1983. Journal version in Journal of Computer and System Sciences, 31:182--209, 1985.
[17]
S. Ganguly, M. Garofalakis, and R. Rastogi. Processing set expressions over continuous update streams. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 265--276, 2003.
[18]
M. Garofalakis, J. Gehrke, and R. Rastogi. Querying and mining data streams: You only get one look. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2002.
[19]
P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the International Conference on Very Large Data Bases, pages 541--550, 2001.
[20]
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the 13th ACM Symposium on Parallel Algorithms and Architectures, pages 281--290, 2001.
[21]
M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report SRC 1998-011, DEC Systems Research Centre, 1998.
[22]
P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84--90, 2001.
[23]
E. Kohler, J. Li, V. Paxson, and S. Shenker. Observed structure of addresses in IP traffic. In ACM SIGCOMM Internet Measurement Workshop, pages 253--266, 2002.
[24]
N. Koudas and D. Srivastava. Data stream query processing: A tutorial. In Proceedings of the International Conference on Very Large Data Bases, page 1149, 2003.
[25]
Internet traffic archive. http://ita.ee.lbl.gov/.
[26]
G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases, pages 346--357, 2002.
[27]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[28]
S. Muthukrishnan. Data streams: Algorithms and applications. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, 2003.
[29]
S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson. Synopsis diffusion for robust aggrgation in sensor networks. In ACM SenSys, 2004.
[30]
V. Paxson. Empirically derived analytic models of wide-area TCP connections. IEEE ACM Transactions on Networking, 2(4):316--336, 1994.

Cited By

View all
  • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
  • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
  • (2023)Real-time Spread Burst Detection in Data StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899797:2(1-31)Online publication date: 22-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2005
388 pages
ISBN:1595930620
DOI:10.1145/1065167
  • General Chair:
  • Georg Gottlob,
  • Program Chair:
  • Foto Afrati
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2005

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS05

Acceptance Rates

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
  • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
  • (2023)Real-time Spread Burst Detection in Data StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899797:2(1-31)Online publication date: 22-May-2023
  • (2023)A High-Performance Invertible Sketch for Network-Wide Superspreader DetectionIEEE/ACM Transactions on Networking10.1109/TNET.2022.319873831:2(724-737)Online publication date: Apr-2023
  • (2023)Randomized Error Removal for Online Spread Estimation in High-Speed NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2022.319796831:2(558-573)Online publication date: Apr-2023
  • (2023)Persistent graph stream summarization for real-time graph analyticsWorld Wide Web10.1007/s11280-023-01165-z26:5(2647-2667)Online publication date: 5-May-2023
  • (2022)Multi-relation Graph SummarizationACM Transactions on Knowledge Discovery from Data10.1145/349456116:5(1-30)Online publication date: 9-Mar-2022
  • (2022)Virtual Filter for Non-Duplicate Sampling With Network ApplicationsIEEE/ACM Transactions on Networking10.1109/TNET.2022.318269430:6(2818-2833)Online publication date: 22-Jun-2022
  • (2022)Super Spreader Identification Using Geometric-Min FilterIEEE/ACM Transactions on Networking10.1109/TNET.2021.310803330:1(299-312)Online publication date: Feb-2022
  • (2022)Erasable Virtual HyperLogLog for Approximating Cumulative Distribution over Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305293834:11(5336-5350)Online publication date: 1-Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media