research-article

Moira: A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems

Authors:

Daniele Foroni,

Cristian Axenie,

Stefano Bortoli,

Mohamad Al Hajj Hassan,

Yannis VelegrakisAuthors Info & Claims

BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics

Article No.: 2, Pages 1 - 10

https://doi.org/10.1145/3242153.3242160

Published: 27 August 2018 Publication History

Abstract

The need for real-time analysis is still spreading and the number of available streaming sources is increasing. The recent literature has plenty of works on Data Stream Processing (DSP). In a streaming environment, the data incoming rate varies over time. The challenge is how to efficiently deploy these applications in a cluster. Several works have been conducted on improving the latency of the system or to minimize the allocated resources per application through time. However, to the best of our knowledge, none of the existing works takes into consideration the user needs for a specific application, which is different from one user to another. In this paper, we propose Moria, a goal-oriented framework for dynamically optimizing the resource allocation built on top of Apache Flink.

The system takes actions based on the user application and on the incoming data characteristics (i.e., input rate and window size). Starting from an initial estimation of the resources needed for the user query, at each iteration we improve our cost function with the collected metrics from the monitored system about the incoming data, to fulfill the user needs. We present a series of experiments that show in which cases our dynamic estimation outperforms the baseline Apache Flink and the thumb rule estimation alone performed at the deployment of the applications.

References

[1]

Apache Flink. http://flink.apache.org.

[2]

Apache Heron. http://heronstreaming.io.

[3]

Apache Lucene. http://lucene.apache.org.

[4]

Apache Spark. http://spark.apache.org/streaming/.

[5]

Apache Storm. http://storm.apache.org.

[6]

TPC-H. http://www.tpc.org/tpch/.

[7]

C. Axenie, C. Richter, and J. Conradt. A self-synthesis approach to perceptual learning for multisensory fusion in robotics. Sensors, 16(10):1751, 2016.

[8]

S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. Peng, and P. Poulosky. Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, pages 1789--1792, 2016.

[9]

A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy. Dhalion: Self-regulating stream processing in heron. PVLDB, 10(12):1825--1836, 2017.

Digital Library

[10]

T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang. DRS: dynamic resource scheduling for real-time analytics over fast streams. In 35th IEEE ICDCS 2015, Columbus, OH, USA, June 29-July 2, 2015, pages 411--420, 2015.

[11]

Z. Han, R. Chu, H. Mi, and H. Wang. Elastic allocator: An adaptive task scheduler for streaming query in the cloud. In 8th IEEE International Symposium on Service Oriented System Engineering, SOSE 2014, Oxford, United Kingdom, April 7-11, 2014, pages 284--289, 2014.

Digital Library

[12]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR 2011, Asilomar, CA, USA, January 9-12, 2011, pages 261--272, 2011.

[13]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX NSDI 2011, Boston, MA, USA, March 30-April 1, 2011, 2011.

Digital Library

[14]

A. Khoshkbarforoushha, A. Khosravian, and R. Ranjan. Elasticity management of streaming data analytics flows on clouds. J. Comput. Syst. Sci., 89:24--40, 2017.

[15]

A. Khoshkbarforoushha, R. Ranjan, Q. Wang, and C. Friedrich. Flower: A data analytics flow elasticity manager. PVLDB, 10(12):1893--1896, 2017.

Digital Library

[16]

S. Perera, A. Perera, and K. Hakimzadeh. Reproducible experiments for comparing apache flink and apache spark on public clouds. CoRR, abs/1610.04493, 2016.

[17]

G. R. Russo. Towards decentralized auto-scaling policies for data stream processing applications. In Proceedings of the 10th Central European Workshop on Services and their Composition, Dresden, Germany, February 8-9, 2018., pages 47--54, 2018.

[18]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop YARN: yet another resource negotiator. In ACM SOCC '13, Santa Clara, CA, USA, October 1-3, 2013, pages 5:1--5:16, 2013.

Digital Library

[19]

Z. Zhang, R. T. B. Ma, J. Ding, and Y. Yang. ABACUS: an auction-based approach to cloud service differentiation. In 2013 IEEE IC2E 2013, San Francisco, CA, USA, March 25-27, 2013, pages 292--301, 2013.

Digital Library

Cited By

Barry MMontiel JBifet AWadkar SManchev NHalford MChiky RJaouhari SShakman KFehaily JLe Deit FTran VGuerizec E(2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00272
Li WZhang ZShu YLiu HLiu T(2022)Toward optimal operator parallelism for stream processing topology with limited buffersThe Journal of Supercomputing10.1007/s11227-022-04376-978:11(13276-13297)Online publication date: 16-Mar-2022
https://doi.org/10.1007/s11227-022-04376-9
Morisawa YSuzuki MKitahara T(2020)Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377967(2198-2206)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377967
Show More Cited By

Index Terms

Moira: A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Stream management
  2. Information systems applications
    1. Data mining
      1. Data stream mining
    2. Process control systems

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics

August 2018

59 pages

ISBN:9781450366076

DOI:10.1145/3242153

Editors:
Damianos Chatziantoniou
Athens University of Economics and Business, Greece
,
Malu Castellanos
Teradata Aster, USA
,
Panos K. Chrysanthis
University of Pittsburgh, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

NSF: National Science Foundation
Google Inc.
Microsoft: Microsoft

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

BIRTE '18

BIRTE '18: International Workshop on Real-Time Business Intelligence and Analytics

August 27, 2018

Rio de Janeiro, Brazil

Acceptance Rates

Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
123
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Barry MMontiel JBifet AWadkar SManchev NHalford MChiky RJaouhari SShakman KFehaily JLe Deit FTran VGuerizec E(2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00272
Li WZhang ZShu YLiu HLiu T(2022)Toward optimal operator parallelism for stream processing topology with limited buffersThe Journal of Supercomputing10.1007/s11227-022-04376-978:11(13276-13297)Online publication date: 16-Mar-2022
https://doi.org/10.1007/s11227-022-04376-9
Morisawa YSuzuki MKitahara T(2020)Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377967(2198-2206)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377967
da Silva Veith Ade Souza Fde Assunção MLefèvre Ldos Anjos J(2019)Multi-Objective Reinforcement Learning for Reconfiguring Data Stream Analytics on Edge ComputingProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337894(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337894

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten