skip to main content
10.1145/3093742.3095108acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
tutorial

Data Streaming and its Application to Stream Processing: Tutorial

Published: 08 June 2017 Publication History

Abstract

In this tutorial paper we present the results of recent research findings in the area of data streaming applied to stream processing systems. In particular, we introduce the data streaming model, detailing the main algorithmic results in this research field. We then move to detail how such algorithms can be applied to modern distributed stream processing systems to improve their efficiency. Finally we outline several open research directions in this field.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In Proceedings of the 28th ACM Symposium on Theory of Computing, STOC, 1996.
[2]
E. Anceaume and Y. Busnel. A Distributed Information Divergence Estimation over Data Streams. IEEE Transactions on Parallel and Distributed Systems, 25(2):478--487, 2014.
[3]
R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. Heavy hitters in streams and sliding windows. In IEEE TNFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pages 1--9, 2016.
[4]
M. Caneill, A. El Rheddane, V. Leroy, and N. De Palma. Locality-aware routing in stateful streaming applications. In Proceedings of the 17th International Middleware Conference, Middleware '16, 2016.
[5]
V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu. The State of the Art in Locally Distributed Web-server Systems. ACM Computing Surveys, 34(2):263--311, 2002.
[6]
V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli. Optimal Operator Placement for Distributed Stream Processing Applications. In Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems, DEBS, 2016.
[7]
D. Carney, U. Çetintemel, A. Rasin, S. Zdonik, M. Cherniack, and M. Stonebraker. Operator Scheduling in a Data Stream Manager. In Proceedings of the 29th International Conference on Very large Data Bases, VLDB, 2003.
[8]
A. Chakrabarti, G. Cormode, and A. McGregor. A Near-optimal Algorithm for Computing the Entropy of a Stream. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms, SODA, 2007.
[9]
M. Charikar, K. Chen, and M. Farach-Colton. Finding Frequent Items in Data Streams. In Proceedings of the 29th International Colloquium on Automata, languages and Programming, ICALP, 2002.
[10]
R. Chen, J. Shi, Y. Chen, and H. Chen. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs. In Proceedings of the 10th European Conference on Computer Systems, EuroSys, 2015.
[11]
G. Cormode and S. Muthukrishnan. An Improved Data Stream Summary: The Count-min Sketch and Its Applications. Journal of Algorithms, 55(1):58--75, 2005.
[12]
G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for Distributed Functional Monitoring. ACM Transactions on Algorithms, 7(2):21:1--21:20, 2011.
[13]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. SIAM Journal on Computing, 31(6):1794--1813, 2002.
[14]
O. Etzion and P. Niblett. Event Processing in Action. Manning Publications Co., 2010.
[15]
P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.
[16]
B. Gedik. Partitioning Functions for Stateful Data Parallelism in Stream Processing. The VLDB Journal, 23(4):517--539, 2014.
[17]
P. B. Gibbons and S. Tirthapura. Distributed Streams Algorithms for Sliding Windows. Theory of Computing Systems, 37(3):457--478, 2004.
[18]
M. Hirzel, R. Soulé, S. Schneider, B. Gedik, and R. Grimm. A Catalog of Stream Processing Optimizations. ACM Computing Surveys, 46(4):41--34, 2014.
[19]
Q. Liu. Approximate Query Processing, pages 113--119. Springer US, Boston, MA, 2009.
[20]
G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In Proceedings of the 28th International Conference on Very large Data Bases, VLDB, 2002.
[21]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of the 10th International Conference on Database Theory, ICDT, 2005.
[22]
J. Misra and D. Gries. Finding Repeated Elements. Science of Computer Programming, 2:143--152, 1982.
[23]
S. Muthukrishnan. Data streams: algorithms and applications. Now Publishers Inc, 2005.
[24]
M. A. U. Nasir, G. D. F. Morales, N. Kourtellis, and M. Serafini. When two choices are not enough: Balancing at scale in distributed stream processing. In Proceedings of the IEEE 32nd International Conference on Data Engineering, ICDE, 2016.
[25]
M. A. U. Nasir, G. D. F. Morales, D. G. Soriano, N. Kourtellis, and M. Serafini. The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines. In Proceedings of the 31st IEEE International Conference on Data Engineering, ICDE, 2015.
[26]
O. Papapetrou, M. N. Garofalakis, and A. Deligiannakis. Sketch-based Querying of Distributed Sliding-Window Data Streams. Proceedings of the VLDB Endowment, 5(10):992--1003, 2012.
[27]
F. Petroni, L. Querzoni, K. Daudjee, S. Kamali, and G. Iacoboni. Hdrf: Stream-based partitioning for power-law graphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 243--252. ACM, 2015.
[28]
F. Reiss and J. M. Hellerstein. Data Triage: An Adaptive Architecture for Load Dhedding in TelegraphCQ. In Proceedings of the 21st International Conference on Data Engineering, ICDE, 2005.
[29]
N. Rivetti, E. Anceaume, Y. Busnel, L. Querzoni, and B. Sericola. Proactive Online Scheduling for Shuffle Grouping in Distributed Stream Processing Systems. In Proceedings of the 17th ACM/IFIP/USENIX International Middleware Conference, Middleware, 2016.
[30]
N. Rivetti, Y. Busnel, and A. Mostefaoui. Efficiently Summarizing Distributed Data Streams over Sliding Windows. In Proceedings of the 14th IEEE International Symposium on Network Computing and Applications, NCA, 2015.
[31]
N. Rivetti, L. Querzoni, E. Anceaume, Y. Busnel, and B. Sericola. Efficient Key Grouping for Near-Optimal Load Balancing in Stream Processing Systems. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS, 2015.
[32]
The Apache Software Foundation. Apache Flink. https://flink.apache.org/.
[33]
The Apache Software Foundation. Apache Spark. http://spark.apache.org/.
[34]
The Apache Software Foundation. Apache Storm. http://storm.apache.org.
[35]
D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proceedings VLDB Endowment, 8(12):1530--1541, Aug. 2015.
[36]
S. Zhou. Performance Studies of Dynamic Load Balancing in Distributed Systems. PhD thesis, UC Berkeley, 1987.

Cited By

View all
  • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
  • (2023)A Systemic Mapping of Methods and Tools for Performance Analysis of Data Streaming with Containerized Microservices Architecture2023 18th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI58278.2023.10211834(1-6)Online publication date: 20-Jun-2023
  • (2019)A Load-Shedding Technique Based on the Measurement Project DefinitionRecent Trends in Intelligent Computing, Communication and Devices10.1007/978-981-13-9406-5_122(1027-1033)Online publication date: 2-Oct-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '17: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems
June 2017
393 pages
ISBN:9781450350655
DOI:10.1145/3093742
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2017

Check for updates

Author Tags

  1. Data Streaming
  2. Load Balancing
  3. Load Shedding
  4. Stream Processing

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

DEBS '17

Acceptance Rates

DEBS '17 Paper Acceptance Rate 22 of 60 submissions, 37%;
Overall Acceptance Rate 145 of 583 submissions, 25%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
  • (2023)A Systemic Mapping of Methods and Tools for Performance Analysis of Data Streaming with Containerized Microservices Architecture2023 18th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI58278.2023.10211834(1-6)Online publication date: 20-Jun-2023
  • (2019)A Load-Shedding Technique Based on the Measurement Project DefinitionRecent Trends in Intelligent Computing, Communication and Devices10.1007/978-981-13-9406-5_122(1027-1033)Online publication date: 2-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media