I-Scheduler: Iterative scheduling for distributed stream processing systems

https://doi.org/10.1016/j.future.2020.11.011Get rights and content

Highlights

  • I-Scheduler is an iterative K-way graph partitioning-based algorithm.

  • It reduces the task graph size so that it can be solved by optimisation software.

  • An iterative heuristic fall back algorithm is used to solve large problem sizes.

Abstract

Task allocation in Data Stream Processing Systems (DSPSs) has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs can be represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for DSPSs is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs. In this paper, we propose a heuristic scheduling algorithm which reliably and efficiently finds highly communicating tasks by exploiting graph partitioning algorithms and a mathematical optimisation software package. We evaluate the communication cost of our method using three micro-benchmarks, showing that we can achieve results that are close to optimal. We further compare our scheduler with two popular existing schedulers, R-Storm and Aniello et al.’s ‘Online scheduler’ using two real-world applications. Our experimental results show that our proposed scheduler outperforms R-Storm, increasing throughput by up to 30%, and improves on the Online scheduler by 20%–86% as a result of finding a more efficient schedule.1

Introduction

In the era of big data, with streaming applications such as social media, surveillance monitoring and real-time search generating large volumes of data, efficient Data Stream Processing Systems (DSPSs) have become essential. More than 30,000 GB of data is generated every second and the rate is accelerating [2]. According to IBM,2 90% of the data that existed in 2012 was created in the two years prior. These data sources need to be analysed to gain insights, and find trends such as determining the most frequent events for a continuous dataflow occurring over a certain period of time. DSPSs are designed to process such dataflows, by operating on data streams, which are a continuous, unbounded sequence of data items, with a number of data attributes, processed in the order in which they arrive. In comparison, batch processing systems store the data before performing ad hoc queries, which is not suited for real-time analysis. A general purpose DSPS faces a number of competing challenges, such as task allocation, scalability, fault tolerance, QoS, parallelism degree, and state management, among others. It is not possible to optimise for all of the challenges at the same time as there will be tradeoffs. The specific streaming application requirements determine which challenges need to be addressed. While each of these challenges are current areas of research, we focus on scheduling in this paper as low latency response times are a consistent priority across many streaming applications. The scheduling policy determines how tasks are distributed in the DSPS, which can have a significant impact on the performance metrics of the system such as tuple latency (the time taken to process a tuple) and system throughput (the number of tuples processed in a given time) [3]. A scheduling policy needs to strike a balance between system performance, the use of system resources and run-time overhead.

Finding an optimal placement for DSPSs is NP-hard [4], [5], [6]. Thus, approximate approaches are required to improve the performance of DSPSs [7]. An efficient task scheduler will adapt to changes in the communication pattern of a streaming application, ensuring that the communication between compute nodes, referred to as inter-node communication, is minimised. Specifically, by placing highly communicating tasks on the same node, communication between compute nodes can be reduced [8], [9], [10], [11]. The term “highly communicating tasks”, which is used throughout this paper, refers to a pair or group of tasks which exchange a larger amount of data than other neighbouring tasks. Additionally, prioritising the use of higher capacity compute nodes allows more highly communicating tasks to be co-located within a compute node, requiring fewer nodes to be used, which helps to further reduce inter-node communication. To achieve this, the scheduler monitors the run-time communication of a streaming application, logging the communication rates between tasks and tasks’ load, which is then used when rescheduling.

Existing work on task schedulers that aim to minimise the inter-node communication have a number of limitations. Firstly, the compute nodes might be underutilised, which results in using more compute nodes than required. Secondly, many of the schedulers are not designed for heterogeneous clusters, which is an important requirement for many deployments, as they evolve over time, as new hardware is added. Further, a multi-user homogeneous cluster where not all of the system resources are available can be viewed as a heterogeneous system. Finally, offline schedulers are incapable of adapting to the run-time changes in the traffic patterns of streaming applications. In this paper, we aim to address these limitations and propose I-Scheduler, for DAG-based DSPSs, which partitions the DAG in order to minimise the communication between each part, such that inter-node communication is minimised when each part is assigned to a compute node. The contributions of this paper are summarised as follows:

  • We propose I-Scheduler, for homogeneous and heterogeneous DSPSs, which reduces the size of the task graph by fusing highly communicating tasks, allowing mathematical optimisation software to be used to find an efficient task assignment. A fallback heuristic is also proposed for cases where the optimisation software cannot be used, which iteratively partitions the application graph based on the capacity of the heterogeneous nodes and assigns each partition to a node with a relative capacity.

  • We evaluate the communication cost of I-Scheduler by comparing it to a theoretically optimal scheduler, implemented in CPLEX, when run on three micro-benchmarks, each representing a different communication pattern. The results show that I-Scheduler can achieve results that are close to optimal in a number of different cluster configurations.

  • We implement the proposed scheduler in Apache Storm 1.1.1 and through experimental results, show that our proposed scheduler can outperform state of the art R-Storm [9] and Aniello et al.’s ‘Online scheduler’ [8] (for brevity we refer to this scheduler as OLS in this paper). The results show that I-Scheduler outperforms OLS, increasing throughput by 20%–86% and R-Storm by 3%–30% for the real-world applications.

The rest of this paper is organised as follows. In Section 2, we discuss key related work. We then present the problem definition in Section 3. Section 4 describes our proposed scheduler, which is then followed by a comparison with an optimal scheduler in Section 5. Section 6 evaluates our proposed scheduler with two real world applications. Finally, Section 7 concludes the paper.

Section snippets

Related work

There is a broad range of research on task placement in DSPSs with different optimisation goals. Generally, there are three main approaches to tackle each optimisation problem [12]: mathematical programming, graph-based approximation and heuristics. In the following, we discuss a number of scheduling algorithms that use different approaches to meet specific optimisation goals.

In the mathematical programming approach, the scheduling problem is formulated as an optimisation problem and solved

System model and problem definition

Data stream processing systems, such as Apache S4 [32], Flink,4 Storm5 and Twitter Heron [33], process large volumes of data, on the fly, as it flows through the system, without first being stored. This makes them ideal for providing real-time analysis of fast flowing information, with a common example being Twitter’s top trending topics, where real-time streams of tweets are analysed to determine the most popular topics at a given time.

I-Scheduler algorithm

As discussed in Section 3, using optimisation software to find an optimal solution to the scheduling problem is often not practical. One approach is to reduce the size of the task graph by fusing each pair of tasks into a single task [12]. However, this has a number of drawbacks. Firstly, fusing task pairs takes a localised view of the task communications and may be unreliable at placing communicating tasks within the same node. Secondly, this approach may not be able to scale to larger problem

I-Scheduler versus optimal scheduler

In this section, we compare the communication cost and resolution time of I-Scheduler with a theoretical optimal scheduler when run on three micro-benchmarks. Each of these micro-benchmarks, shown in Fig. 3, is named after its shape of the topology and communication pattern: linear, diamond and star, and are based on the implementation originally presented in [9]. These are common communication patterns that can be found in real world applications, which consists of either just one of these

Experimental evaluation

In this section, we evaluate I-Scheduler on a homogeneous and heterogeneous cluster, using three micro-benchmarks and two real-world applications. These results are then compared with Aniello et al.’s ‘Online scheduler’ (referred to as ‘OLS’) [8] and R-Storm [9], previously discussed in Section 2, which were chosen for two main reasons. Firstly, OLS and R-Storm have the same optimisation criteria as I-Scheduler, which is to reduce the inter-node communication. Secondly, the source code for both

Conclusions and future work

In this paper, we have presented I-Scheduler, which iteratively uses graph partitioning to efficiently fuse groups of highly communicating tasks into a single task, in order to reduce the task size of the application graph. Once the task graph is sufficiently small, optimisation software is able to determine the task assignment within a user defined time tolerance. In cases where the task graph is too large to be solved by the optimisation software or the user cannot tolerate any delays in

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Leila Eskandari received her Ph.D. in Computer Science from the University of Otago, New Zealand. She previously completed an M.Sc. degree from Sharif University of Technology in Network Engineering and a B.Sc. degree from Ferdowsi University of Mashhad in Computer Engineering, Iran. Her research interests include big data processing, distributed and cloud computing and computer networks.

References (44)

  • ChakravarthyS. et al.

    Stream Data Processing: A Quality of Service Perspective: Modeling, Scheduling, Load Shedding, and Complex Event Processing, Vol. 36

    (2009)
  • GaryM.R. et al.

    Computers and intractability: A guide to the theory of NP-completeness

    J. Symbolic Logic

    (1983)
  • SrivastavaU. et al.

    Operator placement for in-network stream query processing

  • EidenbenzR. et al.

    Task allocation for distributed stream processing

  • LakshmananG.T. et al.

    Placement strategies for internet-scale data stream systems

    IEEE Internet Comput.

    (2008)
  • L. Aniello, R. Baldoni, L. Querzoni, Adaptive online scheduling in Storm, in: Proceedings of the 7th ACM International...
  • PengB. et al.

    R-Storm: Resource-aware scheduling in Storm

  • XuJ. et al.

    T-Storm: Traffic-aware online scheduling in Storm

  • ChatzistergiouA. et al.

    Fast heuristics for near-optimal task allocation in data stream processing over clusters

  • ChuW.W. et al.

    Task allocation in distributed data processing

    IEEE Comput.

    (1980)
  • CardelliniV. et al.

    Optimal operator placement for distributed stream processing applications

  • CardelliniV. et al.

    Optimal operator replication and placement for distributed stream processing systems

  • Cited by (9)

    • An energy efficient and runtime-aware framework for distributed stream computing systems

      2022, Future Generation Computer Systems
      Citation Excerpt :

      In [28], an adaptive online scheme was proposed to schedule and enforce resource allocation in stream processing systems, which guaranteed that the system could achieve less congestion under heavy load and less resource waste under light load. In [29], the task assignment problem under stream computing systems was mainly studied by finding highly communicative tasks for which tasks were assigned by using graph partitioning algorithms and mathematical packages. However, the elastic variation of the data stream and the energy consumption of the compute nodes were likewise not considered.

    • Recent implications towards sustainable and energy efficient AI and big data implementations in cloud-fog systems: A newsworthy inquiry

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      ExpREsS (Maroulis et al., 2019) is a proactive scheduler deploying Random Forest on stream time series to estimate execution time and energy consumption, then a modified DVFS technique to reduce the CPU cores voltage-frequency. I-Scheduler (Eskandari et al., 2021) attempts to merge highly communicating stream tasks for efficient workload mapping across heterogeneous flow nodes. Re-Stream (Sun et al., 2015), in turn, consolidates non-critical vertices above non-critical path in order to enhance energy efficiency.

    View all citing articles on Scopus

    Leila Eskandari received her Ph.D. in Computer Science from the University of Otago, New Zealand. She previously completed an M.Sc. degree from Sharif University of Technology in Network Engineering and a B.Sc. degree from Ferdowsi University of Mashhad in Computer Engineering, Iran. Her research interests include big data processing, distributed and cloud computing and computer networks.

    Jason Mair received his Ph.D. in Computer Science in 2015 and a B.Sc. Honors degree in 2010 from the University of Otago, New Zealand. His research interests include multicore architectures, green computing, big data processing, distributed and cloud computing.

    Zhiyi Huang received the B.Sc. degree in 1986 and the Ph.D. degree in 1992 in computer science from the National University of Defense Technology (NUDT) in China. He is an Associate Professor at the Department of Computer Science, University of Otago, New Zealand. He was a visiting professor at EPFL (Swiss Federal Institute of Technology Lausanne) and Tsinghua University in 2005, and a visiting scientist at MIT CSAIL in 2009. His research fields include parallel/distributed computing, multicore architectures, operating systems, green computing, cluster/grid/cloud computing, high-performance computing, and computer networks.

    David Eyers received his Ph.D. in Computer Science in 2006 from the University of Cambridge, UK, having previously attained his BE in Computer Engineering from the University of New South Wales in Sydney, Australia. He is an Associate Professor at the Department of Computer Science at the University of Otago in New Zealand, and a visiting research fellow at the University of Cambridge Computer Laboratory. He has broad research interests including green computing, information flow control, network security, and distributed and cloud computing.

    1

    This work is an extension of I-Scheduler (Eskandari et al., 2018 [1]).

    View full text