An energy efficient and runtime-aware framework for distributed stream computing systems
Introduction
Data-intensive services, such as social networking, stock trading and weather monitoring, are becoming increasingly common. They generate massive amounts of data every second. At the same time, more and more applications emphasize real-time and accuracy, putting higher demand on stream processing. For example, security issues face great challenges in large-scale computing environments [1], [2], where timeliness is crucial in security check, and real-time computing allows for fast analysis and processing to obtain useful information. Besides, real-time computing is also used in various aspects such as education industry [3], road traffic [4], and environmental inspection [5].
To meet this demand, a variety of stream processing frameworks have emerged, including Spark [6], Heron [7], Samza [8] and Storm [9], etc. Built on batch processing [10], Spark divides incoming data stream into short batches; Heron is a real-time fault-tolerant distributed stream data processing system developed by Twitter [11] as an open-source project; Samza started out as a stream processing solution for LinkedIn [12]. Its most important feature is its construction relies heavily on the log-based Kafka [13]; Storm is one of the most popular open source big data stream computing systems and has been widely used by many well-known companies and organizations, such as Twitter and Alibaba [14].
A stream computing system has multiple compute nodes that collaborate to process tasks, where high data transfer latency between nodes may have a negative impact on system performance. The communication time and data transfer latency can be effectively reduced by restricting data transfer on the same node or between nearby nodes. In addition, capability differences between nodes result in different performance of task execution and data transfer. Task placement for streaming applications can be mapped to an NP-complete problem [15]. Given the computational resources of a node are limited, data loss may occur when node resources cannot meet the computational demand. These factors make the scheduling of streaming applications challenging.
Triggering rescheduling [16], [17] at runtime to reallocate tasks and resources for streaming applications is one of the ways to addressing this challenge. However, runtime scheduling also has problems, such as how to keep the sustainability of stream processing, how to effectively reduce the network delay of processing data tuples and balance the computational resources of nodes. Fig. 1 shows the throughput variation of an example application over time in Storm system. To create resource utilization bottleneck on some nodes, a smaller than required number of compute nodes are purposely used to run a streaming application. As can be seen, when the topology is running in [0s, 15s], its throughput stays at a relatively low level. The main reason for the low throughput may be because tasks with high communication load are on different nodes and/or some nodes are short of computing resources. To improve the throughput of the system, we can restart the whole topology to make the tasks with high communication load on the same node and deploy part of tasks from the nodes with limited computing resources to these with idle computing resources or to new nodes. However, this rescheduling process can seriously lower the throughput during interval [18s–27s] and a slow pickup during interval [27s–33s], which obviously affects user’s experience. This is just a simple case, but it demonstrates the necessity of providing a runtime-aware mechanism which can monitor the node resource consumption and communication among tasks, dynamically balance the node resource load and deploy tasks based on their communication load at runtime.
In addition, after a streaming application is mapped to a directed acyclic graph (DAG), a critical path of the DAG can reflect the response time of the system. When none of the tasks running on a node is on a critical path, the node does not have to run the tasks to its full capability as it will inevitably result in high energy consumption. If there is a good method to dynamically adjust the working state of compute nodes, the energy consumption of the system may become more effective.
Based on the above observations and thoughts, this paper proposes an energy efficient and runtime-aware framework (Er-stream). It tries to resolve: (1) when and how to reschedule an application topology based on the fluctuation of data stream, (2) when to perform reliable task migration based on node resource consumption, and (3) how to dynamically adjust the frequency of node’s CPU based on their resource load.
As discussed, Er-Stream is proposed to improve the throughput and reduce the latency of a distributed stream computing system. Our contributions are summarized as follows:
- (1)
Investigate task placement, resource constraint and energy consumption of fluctuating data streams, and formalize the scheduling problem by modeling stream application, resource constraint and energy consumption;
- (2)
Propose a stream application scheduling algorithm that deploys tasks with potential communication load on the same node in the DAG initialization phase and evaluate the resource allocation scheme at runtime to determine the necessity of making partial task adjustments;
- (3)
Propose a run-time aware scheduling algorithm to avoid excessive consumption of node resources by determining the necessity of making task migration, and adjust node’s CPU frequency dynamically based on the resource usage information to lower the energy consumption;
- (4)
Evaluate the system throughput, response time and energy consumption of the proposed scheduling framework.
Experiments are conducted on real data and the results demonstrate the effectiveness of the Er-stream framework.
The rest of this paper is organized as follows. Section 2 describes the background knowledge; Section 3 introduces the system models, including the DAG model, the resource model and the energy consumption model; Section 4 formalizes the scheduling problem and provides optimization schemes; Section 5 introduces the Er-Stream framework and its main algorithms; Section 6 evaluates the performance of the Er-Stream; Section 7 presents related work and Section 8 concludes our work.
Section snippets
Background
Scheduling strategies in a stream computing system determine the allocation of stream applications to compute nodes. In the process of creating a topology for a streaming application, user can define the parallelism of components and the number of resources to be used by the topology. Storm, as one of the most popular distributed streaming computing systems, provides four built-in scheduling strategies [9]: EvenScheduler, IsolationScheduler, MultitenantScheduler and ResourceAwareScheduler.
System models
Before formalizing the scheduling problem and introducing our solution, we first model the stream application, resource and energy consumption in stream computing environments.
Problem statement and optimization
In this section, we formalize the scheduling problem in stream computing systems and present our optimization schemes for DAG initialization and runtime scheduling, and strategies for energy saving.
Er-Stream: Framework and algorithms
Based on the above formal modeling and analysis, we propose and implement Er-Stream, an energy efficient and runtime-aware framework for stream computing systems. To provide a better description of the proposal, this section discusses its overall framework and key algorithms, including DAG initialization algorithm, DAG runtime partial adjustment algorithm and energy saving algorithm.
Performance evaluation
Experiments are conducted to evaluate whether the proposed scheduling algorithm can improve the system performance. In this section, the experimental environment and parameter settings are first discussed, followed by the result analysis of two stream applications, Top_N and WordCount.
Related work
In this section, we review three major categories of related work: stream application deployment optimization, resource constraints, and reliable scheduling & energy consumption in stream computing. The summary of the comparison between our work and other closely related works is given in Table 5.
Conclusions and future work
In a fluctuating data stream environment, minimizing network communication and energy consumption, and keeping nodes meeting load constraints are the goals of a system implementation. Achieving these goals relies on the system that can intelligently monitor data stream size and node resource utilization to adaptively adjust instance deployment, and sense data center energy consumption to tailor the CPU frequency of node accordingly. In this work, we attempt to optimize the system performance
CRediT authorship contribution statement
Dawei Sun: Conceptualization, Methodology, Validation, Writing – original draft, Funding acquisition. Yijing Cui: Validation, Investigation, Writing – review & editing. Minghui Wu: Investigation, Data curation, Writing – review & editing. Shang Gao: Formal analysis, Investigation, Writing – review & editing. Rajkumar Buyya: Methodology, Writing – review & editing, Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 61972364; the Fundamental Research Funds for the Central Universities under Grant No. 265QZ2021001; and MelbourneChindia Cloud Computing (MC3) Research Network .
Dawei Sun is an Associate Professor in the School of Information Engineering, China University of Geosciences, Beijing, P.R. China. He received his Ph.D. degree in computer science from Northeastern University, China in 2012, and conducted the Postdoctoral research in the department of computer science and technology at Tsinghua University, China in 2015. His current research interests include big data computing, cloud computing and distributed systems. In these areas, he has authored over 70
References (41)
- et al.
Throughput optimization for storm-based processing of stream data on clouds
Future Gener. Comput. Syst.
(2020) - et al.
T3-scheduler: A topology and traffic aware two-level scheduler for stream processing systems in a heterogeneous cluster
Future Gener. Comput. Syst.
(2018) - et al.
I-Scheduler: Iterative scheduling for distributed stream processing systems
Future Gener. Comput. Syst.
(2021) - Y. Chen, X. Li, Research on big data application in intelligent safety supervision, in: 2017 IEEE 2nd International...
- J. Zhang, J. Chen, Y. Zheng, H. Yuan, Applications of big data in public safety emergency services, in: 2017 IEEE 2nd...
- F. Matsebula, E. Mnkandla, A big data architecture for learning analytics in higher education, in: 2017 IEEE AFRICON,...
- A. Imawan, J. Kwon, A timeline visualization system for road traffic big data, in: 2015 IEEE International Conference...
- A. Juneja, N.N. Das, Big Data Quality Framework: Pre-Processing Data in Weather Monitoring Application, in: 2019...
- S. Sameti, M. Wang, D. Krishnamurthy, Stride: Distributed Video Transcoding in Spark, in: 2018 IEEE 37th International...
- . Twitter, Heron,...
Batch process modeling and monitoring with lo.cal outlier factor
IEEE Trans. Control Syst. Technol.
Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams
J. Supercomput.
Dynamic DAG scheduling on multiprocessor systems: Reliability, energy, and makespan
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
Cited by (3)
Micro-batch and data frequency for stream processing on multi-cores
2023, Journal of SupercomputingMicro-batch and Data Frequency for Stream Processing on Multi-cores
2022, Research Square
Dawei Sun is an Associate Professor in the School of Information Engineering, China University of Geosciences, Beijing, P.R. China. He received his Ph.D. degree in computer science from Northeastern University, China in 2012, and conducted the Postdoctoral research in the department of computer science and technology at Tsinghua University, China in 2015. His current research interests include big data computing, cloud computing and distributed systems. In these areas, he has authored over 70 journal and conference papers.
Yijing Cui is a postgraduate student at the School of Information Engineering, China University of Geosciences, Beijing, China. She received her Bachelor Degree in Network Engineering from Zhengzhou University of Aeronautics, Zhengzhou, China in 2020. Her research interests include big data stream computing, data analytics and distributed systems.
Minghui Wu is a postgraduate student at the School of Information Engineering, China University of Geosciences, Beijing, China. He received his Bachelor Degree in Network Engineering from Zhengzhou University of Aeronautics, Zhengzhou, China in 2020. His research interests include big data stream computing, distributed systems and blockchain.
Shang Gao received her Ph.D. degree in computer science from Northeastern University, China in 2000. She is currently a Senior Lecturer in the School of Information Technology, Deakin University, Geelong, Australia. Her current research interests include distributed system, cloud computing and cyber security.
Rajkumar Buyya is a Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He has authored over 750 publications and four text books. He is one of the highly cited authors in computer science and software engineering worldwide (h-index 154 with 125,000+ citations). He served as the founding Editor-in-Chief (EiC) of IEEE Transactions on Cloud Computing and now serving as EiC of Journal of Software: Practice and Experience.