skip to main content
10.1145/3242153.3242160acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbirteConference Proceedingsconference-collections
research-article

Moira: A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems

Published: 27 August 2018 Publication History

Abstract

The need for real-time analysis is still spreading and the number of available streaming sources is increasing. The recent literature has plenty of works on Data Stream Processing (DSP). In a streaming environment, the data incoming rate varies over time. The challenge is how to efficiently deploy these applications in a cluster. Several works have been conducted on improving the latency of the system or to minimize the allocated resources per application through time. However, to the best of our knowledge, none of the existing works takes into consideration the user needs for a specific application, which is different from one user to another. In this paper, we propose Moria, a goal-oriented framework for dynamically optimizing the resource allocation built on top of Apache Flink.
The system takes actions based on the user application and on the incoming data characteristics (i.e., input rate and window size). Starting from an initial estimation of the resources needed for the user query, at each iteration we improve our cost function with the collected metrics from the monitored system about the incoming data, to fulfill the user needs. We present a series of experiments that show in which cases our dynamic estimation outperforms the baseline Apache Flink and the thumb rule estimation alone performed at the deployment of the applications.

References

[1]
Apache Flink. http://flink.apache.org.
[2]
Apache Heron. http://heronstreaming.io.
[3]
Apache Lucene. http://lucene.apache.org.
[4]
Apache Spark. http://spark.apache.org/streaming/.
[5]
Apache Storm. http://storm.apache.org.
[6]
TPC-H. http://www.tpc.org/tpch/.
[7]
C. Axenie, C. Richter, and J. Conradt. A self-synthesis approach to perceptual learning for multisensory fusion in robotics. Sensors, 16(10):1751, 2016.
[8]
S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. Peng, and P. Poulosky. Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, pages 1789--1792, 2016.
[9]
A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy. Dhalion: Self-regulating stream processing in heron. PVLDB, 10(12):1825--1836, 2017.
[10]
T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang. DRS: dynamic resource scheduling for real-time analytics over fast streams. In 35th IEEE ICDCS 2015, Columbus, OH, USA, June 29-July 2, 2015, pages 411--420, 2015.
[11]
Z. Han, R. Chu, H. Mi, and H. Wang. Elastic allocator: An adaptive task scheduler for streaming query in the cloud. In 8th IEEE International Symposium on Service Oriented System Engineering, SOSE 2014, Oxford, United Kingdom, April 7-11, 2014, pages 284--289, 2014.
[12]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR 2011, Asilomar, CA, USA, January 9-12, 2011, pages 261--272, 2011.
[13]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX NSDI 2011, Boston, MA, USA, March 30-April 1, 2011, 2011.
[14]
A. Khoshkbarforoushha, A. Khosravian, and R. Ranjan. Elasticity management of streaming data analytics flows on clouds. J. Comput. Syst. Sci., 89:24--40, 2017.
[15]
A. Khoshkbarforoushha, R. Ranjan, Q. Wang, and C. Friedrich. Flower: A data analytics flow elasticity manager. PVLDB, 10(12):1893--1896, 2017.
[16]
S. Perera, A. Perera, and K. Hakimzadeh. Reproducible experiments for comparing apache flink and apache spark on public clouds. CoRR, abs/1610.04493, 2016.
[17]
G. R. Russo. Towards decentralized auto-scaling policies for data stream processing applications. In Proceedings of the 10th Central European Workshop on Services and their Composition, Dresden, Germany, February 8-9, 2018., pages 47--54, 2018.
[18]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop YARN: yet another resource negotiator. In ACM SOCC '13, Santa Clara, CA, USA, October 1-3, 2013, pages 5:1--5:16, 2013.
[19]
Z. Zhang, R. T. B. Ma, J. Ding, and Y. Yang. ABACUS: an auction-based approach to cloud service differentiation. In 2013 IEEE IC2E 2013, San Francisco, CA, USA, March 25-27, 2013, pages 292--301, 2013.

Cited By

View all
  • (2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
  • (2022)Toward optimal operator parallelism for stream processing topology with limited buffersThe Journal of Supercomputing10.1007/s11227-022-04376-978:11(13276-13297)Online publication date: 16-Mar-2022
  • (2020)Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377967(2198-2206)Online publication date: 10-Dec-2020
  • Show More Cited By

Index Terms

  1. Moira: A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics
        August 2018
        59 pages
        ISBN:9781450366076
        DOI:10.1145/3242153
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        In-Cooperation

        • NSF: National Science Foundation
        • Google Inc.
        • Microsoft: Microsoft

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 August 2018

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        BIRTE '18

        Acceptance Rates

        Overall Acceptance Rate 12 of 21 submissions, 57%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)11
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 19 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
        • (2022)Toward optimal operator parallelism for stream processing topology with limited buffersThe Journal of Supercomputing10.1007/s11227-022-04376-978:11(13276-13297)Online publication date: 16-Mar-2022
        • (2020)Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377967(2198-2206)Online publication date: 10-Dec-2020
        • (2019)Multi-Objective Reinforcement Learning for Reconfiguring Data Stream Analytics on Edge ComputingProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337894(1-10)Online publication date: 5-Aug-2019

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media