Elsevier

Future Generation Computer Systems

Volume 136, November 2022, Pages 252-269
Future Generation Computer Systems

An energy efficient and runtime-aware framework for distributed stream computing systems

https://doi.org/10.1016/j.future.2022.06.007Get rights and content

Highlights

  • Modeling of a stream application, resource and energy consumption for stream computing environments.

  • Formalization of the scheduling problem and optimization of stream application at the initial and run-time stages.

  • Reliable migration based on operators to ensure data stream sustainability.

  • Reduction of system energy consumption based on the Karush–Kuhn–Tucker mathematical model.

  • Development and integration of the proposed Er-Stream into Apache Storm.

Abstract

Task scheduling in distributed stream computing systems is an NP-complete problem. Current scheduling schemes usually have a pause or slow start process due to the fluctuation of input data stream, which affects the performance stability, especially the high throughput and low latency goals. In addition, idle compute nodes at runtime may result in large idle load energy consumption. To address these problems, we propose an energy efficient and runtime-aware framework (Er-Stream). This paper thoroughly discusses the framework from the following aspects: (1) The communication between real-time data streaming tasks is investigated; stream application, resource and energy consumption are modeled to formalize the scheduling problem. (2) After an initial topology is submitted to the cluster, task pairs with high communication cost are processed on the same compute node through a lightweight task partitioning strategy, minimizing the communication cost between nodes and avoiding frequent triggering of runtime scheduling. (3) At runtime, reliable task migration is performed based on node communication and resource usage, which in turn helps the dynamic adjustment of the node energy consumption. (4) Metrics including latency, throughput, resource load and energy consumption are evaluated in a real distributed stream computing environment. With a comprehensive evaluation of variable-rate input scenarios, the proposed Er-Stream system provides promising improvements on throughput, latency and energy consumption compared to the existing Storm’s scheduling strategies.

Introduction

Data-intensive services, such as social networking, stock trading and weather monitoring, are becoming increasingly common. They generate massive amounts of data every second. At the same time, more and more applications emphasize real-time and accuracy, putting higher demand on stream processing. For example, security issues face great challenges in large-scale computing environments [1], [2], where timeliness is crucial in security check, and real-time computing allows for fast analysis and processing to obtain useful information. Besides, real-time computing is also used in various aspects such as education industry [3], road traffic [4], and environmental inspection [5].

To meet this demand, a variety of stream processing frameworks have emerged, including Spark [6], Heron [7], Samza [8] and Storm [9], etc. Built on batch processing [10], Spark divides incoming data stream into short batches; Heron is a real-time fault-tolerant distributed stream data processing system developed by Twitter [11] as an open-source project; Samza started out as a stream processing solution for LinkedIn [12]. Its most important feature is its construction relies heavily on the log-based Kafka [13]; Storm is one of the most popular open source big data stream computing systems and has been widely used by many well-known companies and organizations, such as Twitter and Alibaba [14].

A stream computing system has multiple compute nodes that collaborate to process tasks, where high data transfer latency between nodes may have a negative impact on system performance. The communication time and data transfer latency can be effectively reduced by restricting data transfer on the same node or between nearby nodes. In addition, capability differences between nodes result in different performance of task execution and data transfer. Task placement for streaming applications can be mapped to an NP-complete problem [15]. Given the computational resources of a node are limited, data loss may occur when node resources cannot meet the computational demand. These factors make the scheduling of streaming applications challenging.

Triggering rescheduling [16], [17] at runtime to reallocate tasks and resources for streaming applications is one of the ways to addressing this challenge. However, runtime scheduling also has problems, such as how to keep the sustainability of stream processing, how to effectively reduce the network delay of processing data tuples and balance the computational resources of nodes. Fig. 1 shows the throughput variation of an example application over time in Storm system. To create resource utilization bottleneck on some nodes, a smaller than required number of compute nodes are purposely used to run a streaming application. As can be seen, when the topology is running in [0s, 15s], its throughput stays at a relatively low level. The main reason for the low throughput may be because tasks with high communication load are on different nodes and/or some nodes are short of computing resources. To improve the throughput of the system, we can restart the whole topology to make the tasks with high communication load on the same node and deploy part of tasks from the nodes with limited computing resources to these with idle computing resources or to new nodes. However, this rescheduling process can seriously lower the throughput during interval [18s–27s] and a slow pickup during interval [27s–33s], which obviously affects user’s experience. This is just a simple case, but it demonstrates the necessity of providing a runtime-aware mechanism which can monitor the node resource consumption and communication among tasks, dynamically balance the node resource load and deploy tasks based on their communication load at runtime.

In addition, after a streaming application is mapped to a directed acyclic graph (DAG), a critical path of the DAG can reflect the response time of the system. When none of the tasks running on a node is on a critical path, the node does not have to run the tasks to its full capability as it will inevitably result in high energy consumption. If there is a good method to dynamically adjust the working state of compute nodes, the energy consumption of the system may become more effective.

Based on the above observations and thoughts, this paper proposes an energy efficient and runtime-aware framework (Er-stream). It tries to resolve: (1) when and how to reschedule an application topology based on the fluctuation of data stream, (2) when to perform reliable task migration based on node resource consumption, and (3) how to dynamically adjust the frequency of node’s CPU based on their resource load.

As discussed, Er-Stream is proposed to improve the throughput and reduce the latency of a distributed stream computing system. Our contributions are summarized as follows:

  • (1)

    Investigate task placement, resource constraint and energy consumption of fluctuating data streams, and formalize the scheduling problem by modeling stream application, resource constraint and energy consumption;

  • (2)

    Propose a stream application scheduling algorithm that deploys tasks with potential communication load on the same node in the DAG initialization phase and evaluate the resource allocation scheme at runtime to determine the necessity of making partial task adjustments;

  • (3)

    Propose a run-time aware scheduling algorithm to avoid excessive consumption of node resources by determining the necessity of making task migration, and adjust node’s CPU frequency dynamically based on the resource usage information to lower the energy consumption;

  • (4)

    Evaluate the system throughput, response time and energy consumption of the proposed scheduling framework.

Experiments are conducted on real data and the results demonstrate the effectiveness of the Er-stream framework.

The rest of this paper is organized as follows. Section 2 describes the background knowledge; Section 3 introduces the system models, including the DAG model, the resource model and the energy consumption model; Section 4 formalizes the scheduling problem and provides optimization schemes; Section 5 introduces the Er-Stream framework and its main algorithms; Section 6 evaluates the performance of the Er-Stream; Section 7 presents related work and Section 8 concludes our work.

Section snippets

Background

Scheduling strategies in a stream computing system determine the allocation of stream applications to compute nodes. In the process of creating a topology for a streaming application, user can define the parallelism of components and the number of resources to be used by the topology. Storm, as one of the most popular distributed streaming computing systems, provides four built-in scheduling strategies [9]: EvenScheduler, IsolationScheduler, MultitenantScheduler and ResourceAwareScheduler.

System models

Before formalizing the scheduling problem and introducing our solution, we first model the stream application, resource and energy consumption in stream computing environments.

Problem statement and optimization

In this section, we formalize the scheduling problem in stream computing systems and present our optimization schemes for DAG initialization and runtime scheduling, and strategies for energy saving.

Er-Stream: Framework and algorithms

Based on the above formal modeling and analysis, we propose and implement Er-Stream, an energy efficient and runtime-aware framework for stream computing systems. To provide a better description of the proposal, this section discusses its overall framework and key algorithms, including DAG initialization algorithm, DAG runtime partial adjustment algorithm and energy saving algorithm.

Performance evaluation

Experiments are conducted to evaluate whether the proposed scheduling algorithm can improve the system performance. In this section, the experimental environment and parameter settings are first discussed, followed by the result analysis of two stream applications, Top_N and WordCount.

Related work

In this section, we review three major categories of related work: stream application deployment optimization, resource constraints, and reliable scheduling & energy consumption in stream computing. The summary of the comparison between our work and other closely related works is given in Table 5.

Conclusions and future work

In a fluctuating data stream environment, minimizing network communication and energy consumption, and keeping nodes meeting load constraints are the goals of a system implementation. Achieving these goals relies on the system that can intelligently monitor data stream size and node resource utilization to adaptively adjust instance deployment, and sense data center energy consumption to tailor the CPU frequency of node accordingly. In this work, we attempt to optimize the system performance

CRediT authorship contribution statement

Dawei Sun: Conceptualization, Methodology, Validation, Writing – original draft, Funding acquisition. Yijing Cui: Validation, Investigation, Writing – review & editing. Minghui Wu: Investigation, Data curation, Writing – review & editing. Shang Gao: Formal analysis, Investigation, Writing – review & editing. Rajkumar Buyya: Methodology, Writing – review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No. 61972364; the Fundamental Research Funds for the Central Universities under Grant No. 265QZ2021001; and MelbourneChindia Cloud Computing (MC3) Research Network .

Dawei Sun is an Associate Professor in the School of Information Engineering, China University of Geosciences, Beijing, P.R. China. He received his Ph.D. degree in computer science from Northeastern University, China in 2012, and conducted the Postdoctoral research in the department of computer science and technology at Tsinghua University, China in 2015. His current research interests include big data computing, cloud computing and distributed systems. In these areas, he has authored over 70

References (41)

  • . Apache, Samza,...
  • . Apache, Storm,...
  • ZhuJ. et al.

    Batch process modeling and monitoring with lo.cal outlier factor

    IEEE Trans. Control Syst. Technol.

    (2019)
  • . Twitter, Twitter,...
  • . Linkedin, Linkedin,...
  • H. Wu, Research Proposal: Reliability Evaluation of the Apache Kafka Streaming System, in: 2019 IEEE International...
  • . alibaba, aliyun,...
  • Y. Hou, X. Zhao, Q. Li, J. Chen, Y. Li, Z. Zheng, Solving Large-Scale NP-Complete Problem with an Optical Solver Driven...
  • SunD. et al.

    Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams

    J. Supercomput.

    (2018)
  • HuangJ. et al.

    Dynamic DAG scheduling on multiprocessor systems: Reliability, energy, and makespan

    IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

    (2020)
  • Dawei Sun is an Associate Professor in the School of Information Engineering, China University of Geosciences, Beijing, P.R. China. He received his Ph.D. degree in computer science from Northeastern University, China in 2012, and conducted the Postdoctoral research in the department of computer science and technology at Tsinghua University, China in 2015. His current research interests include big data computing, cloud computing and distributed systems. In these areas, he has authored over 70 journal and conference papers.

    Yijing Cui is a postgraduate student at the School of Information Engineering, China University of Geosciences, Beijing, China. She received her Bachelor Degree in Network Engineering from Zhengzhou University of Aeronautics, Zhengzhou, China in 2020. Her research interests include big data stream computing, data analytics and distributed systems.

    Minghui Wu is a postgraduate student at the School of Information Engineering, China University of Geosciences, Beijing, China. He received his Bachelor Degree in Network Engineering from Zhengzhou University of Aeronautics, Zhengzhou, China in 2020. His research interests include big data stream computing, distributed systems and blockchain.

    Shang Gao received her Ph.D. degree in computer science from Northeastern University, China in 2000. She is currently a Senior Lecturer in the School of Information Technology, Deakin University, Geelong, Australia. Her current research interests include distributed system, cloud computing and cyber security.

    Rajkumar Buyya is a Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He has authored over 750 publications and four text books. He is one of the highly cited authors in computer science and software engineering worldwide (h-index 154 with 125,000+ citations). He served as the founding Editor-in-Chief (EiC) of IEEE Transactions on Cloud Computing and now serving as EiC of Journal of Software: Practice and Experience.

    View full text