A novel multi-agent reinforcement learning approach for job scheduling in Grid computing

https://doi.org/10.1016/j.future.2010.10.009Get rights and content

Abstract

Grid computing utilizes distributed heterogeneous resources to support large-scale or complicated computing tasks, and an appropriate resource scheduling algorithm is fundamentally important for the success of Grid applications. Due to the complex and dynamic properties of Grid environments, traditional model-based methods may result in poor scheduling performance in practice. Scalability and adaptability are among the key objectives of Grid job scheduling. In this paper, a novel multi-agent reinforcement learning method, called ordinal sharing learning (OSL) method, is proposed for job scheduling problems, especially, for realizing load balancing in Grids. The approach circumvents the scalability problem by using an ordinal distributed learning strategy, and realizes multi-agent coordination based on an information-sharing mechanism with limited communication. Simulation results show that the OSL method can achieve the goal of load balancing effectively, and its performance is even comparable to some centralized scheduling algorithm in most cases. The convergence property and adaptability of the proposed method are also illustrated.

Research highlights

► We propose a novel multi-agent reinforcement learning method for job scheduling in Grid computing. ► The proposed approach circumvents the scalability problem by using an ordinal distributed learning strategy. ► We realize multi-agent coordination based on an information sharing mechanism with limited communication. ► Simulation results show that the OSL method can achieve the goal of load balancing effectively.

Introduction

Multi-agent resource allocation is the process of distributing a number of items amongst a number of agents, and acts as a central matter of concern in both computer science and economics [1]. It is relevant to a wide range of application domains, such as network routing [2], public transportation [3] and Grid computing [4], [5], [6], where Grid computing is one of the most important applications of resource allocation or scheduling [7].

Grid computing enables the sharing, selection, and aggregation of geographically distributed heterogeneous resources and becomes an important solution paradigm for supporting complicated computing problems. However, there are still some technical challenges for Grids [5]. For a majority of Grid systems, the real and specific problem that underlies Grid computing is coordinated resource scheduling and problem solving in dynamic, multi-institutional virtual organizations, where an effective and efficient scheduling algorithm is fundamentally important [8], [9]. Only with the help of a feasible scheduling policy, can the Grids speed up the task process and provide non-trivial services to users [10]. In the following, the job scheduling problem, which is the key issue for balancing the entire system load while completing all the jobs at hand as soon as possible, is studied (see Fig. 1).

In the past decade, there have been many advances in Grid job scheduling techniques. Various scheduling approaches, including model-based or model-free methods, either using centralized or decentralized mechanisms, have been developed for Grids. On the one hand, lots of algorithms have been studied for job scheduling problems in traditional parallel and distributed systems, such as FPLTF (Fastest Processor to Largest Task First), WQR (Work Queue with Replication) and FCFS (First Come First Serve) [11]. On the other hand, extensive research has been done for Grid scheduling problems, too. In traditional resource scheduling systems, such as Condor [12], PBS [13] and SGE [14], centralized schedulers work effectively since accurate and global information can be obtained. However, centralized or hierarchical resource allocation methods may suffer from the lack of scalability and fault-tolerance ability as well as having a single point of failure [15]. To overcome the scalability problem, some decentralized scheduling algorithms have been proposed. However, most existing decentralized schedulers, for example, in Condor-G [16] and AppleS [17], perform individual scheduling policies regardless of the other schedulers’ decisions and may lead to serious synchronization problems in resource management. Finally, a Herd behavior will arise since schedulers run without central oversight and communication [18], [19]. However, if job scheduling is carried out under the assumption of coordination, such as in Legion Federation [20] and Condor Flock P2P [21], the strong dependency on negotiation among schedulers and resources may lead to high communication overhead. Therefore, how to coordinate the scheduling among decentralized schedulers with a moderate communication cost is an important and open problem. A recent work to deal with the above problem has been done in [22], where a collaborative model is proposed based on the Random Early Detection (RED) strategies via gossiping and good scheduling performance is achieved.

Moreover, to meet the need for scheduling adaptation, which comes from the heterogeneity of resources, the variations of resource performance, and the diversity of applications, an adaptive scheduling method is deserved. Recently, a promising approach based on reinforcement learning (RL) has been studied for job scheduling and resource allocation in Grids [23]. As an important class of machine learning methods, RL aims to solve uncertain decision-making problems by interacting with the environment and near-optimal or suboptimal policies can be obtained in a data-driven way [24]. Therefore, RL provides a model-free methodology and is very promising to solve the difficulties of Grid resource scheduling. According to different learning mechanisms, existing RL approaches to resource scheduling can be mainly divided into two types. One is based on policy gradient learning algorithms [6], [25], [26] and the other uses value-function-based learning algorithms [5], [23], [27]. However, the learning efficiency and scalability of existing RL methods in Grid resource allocation still need to be improved for large-scale applications of Grid computing.

In this paper, to realize learning-based coordination and generalization in large-scale Grid environments, a novel multi-agent reinforcement learning method, called the ordinal sharing learning (OSL) method, is proposed to solve the job scheduling problem for Grid computing. In the OSL method, a fast distributed learning algorithm is designed based on an ordinal information-sharing mechanism. Compared with previous multi-agent RL (MARL) methods for job scheduling, the OSL method has two aspects of innovations. One aspect simplifies the modeling of optimal decision-making in job scheduling, where only a utility table is learned online to estimate the resources’ efficiency, instead of building the complex Grid Information System (GIS). The other aspect circumvents the scalability and coordination problem by an efficient information-sharing mechanism with limited communication for multi-agent systems, where an ordinal sharing strategy makes all agents share their utility tables and make decisions in turn. The proposed approach was evaluated in a simulated large-scale Grid computing environment and the results show its validity and feasibility.

The remainder of this paper is organized as follows. Section 2 introduces a general model for job scheduling in Grid computing, and discusses the performance measures. Section 3 discusses the basic idea of multi-agent reinforcement learning and presents the OSL method for Grid job scheduling. Section 4 makes performance evaluation and comparisons of different job scheduling methods in a simulated Grid computing environment and the results illustrate the effectiveness of the proposed method. Section 5 gives a further overview of the related works. Finally, conclusions are made in Section 6.

Section snippets

A general job scheduling model in Grids

It is well known that the complexity of a general centralized scheduling problem is NP-Complete [28]. Due to the NP-Complete nature and the difficulty to prove the optimality of scheduling algorithms in Grid scenarios, current research always tries to find suboptimal solutions. Moreover, in this paper, to solve the scalability problem, a strategy where decentralized schedulers take charge of job scheduling simultaneously instead of a centralized scheduler is considered. To describe the

The OSL method for adaptive job scheduling

As mentioned above, in practical large-scale Grid applications, even with the help of the GIS system, the information about resources in the schedulers is time delayed and potentially inaccurate. So it is reasonable to develop a robust scheduling algorithm which is not dependent on an accurate model. To satisfy the requirements in adaptive job scheduling, a coordinated multi-agent reinforcement leaning method may be an appropriate solution. In the following, after an analysis on different MARL

Performance evaluation and discussions

In this section, the performance of the OSL-based Selection (OSLS) rule for job scheduling will be evaluated and analyzed in simulations. In addition, the proposed OSLS method is compared with four other resource scheduling or selection rules, which are Decentralized Min–Min Selection (DMMS) [38], Random Selection (RS), Least Load Selection (LLS), and Simple Learning Selection (SLS) [5]. The Min–Min algorithm is a heuristic scheduling method that becomes a benchmark scheduling algorithm for

Related works

For the RL-based job scheduling problem in Grids, there are some other related works. In [5], the SLS method was adopted for Grid job scheduling. However, the above experimental results show that the SLS method only has good performance in some special cases when the number of users is much more than the number of resources. Moreover, its performance still needs to be improved.

In [6], the authors introduced a new gradient ascent learning algorithm named Weighted Policy Learner (WPL) for the

Conclusions

One of the key concerns of Grid computing is to develop autonomic computing systems that have the abilities of self-configuration and self-optimization in dynamic environments. In this paper, the OSL method based on multi-agent reinforcement learning is proposed to solve the job scheduling problem in Grids. This approach circumvents the scalability problem by using a distributed learning strategy, and achieves multi-agent coordination based on an ordinal information-sharing mechanism. Finally,

Acknowledgements

This work is supported in part by National Natural Science Foundation of China under Grants 60774076 and 61075072, the Fork Ying Tong Youth Teacher Foundation Under Grant 114005, and Natural Science Foundation of Hunan Province under Grant 07JJ3122. We also thank the anonymous reviewers for their comments and recommendations, which have been crucial to improving the quality of this work.

Jun Wu received the B.Sc. and M.Sc. degrees in electrical engineering from National University of Defense Technology, Changsha, China, in 2002 and 2005, respectively. He is currently working toward the Ph.D. degree from the Institute of Automation, National University of Defense Technology, China. His current research interests include reinforcement learning, autonomous agent and multi-agent systems, especially in resource allocation, and multi-robot control.

References (40)

  • E. Cantillon et al.

    Auctioning bus routes: the London experience

  • P. Gradwell et al.

    Markets vs. auctions: approaches to distributed combinatorial resource scheduling

    Journal Multiagent and Grid Systems

    (2005)
  • A. Galstyan et al.

    Resource allocation in the grid with learning agents

    Journal of Grid Computing

    (2005)
  • S. Abdallahy, V. Lesser, Learning the task allocation game, in: Proceedings of the Fifth AAMAS, Japan, May 8–12, 2006,...
  • F.P. Dong, S.G. Akl, Scheduling algorithms for grid computing: state of the art and open problems, Technical Report No....
  • D. Thain et al.

    Distributed computing in practice: the Condor experience

    Concurrency and Computation: Practice and Experience

    (2005)
  • Portable Barch System, 2009....
  • Sun Grid Engine, 2009....
  • K. Krauter et al.

    A taxonomy and survey of grid resource management systems for distributed computing

    Software: Practice and Experience

    (2002)
  • Cited by (100)

    • Solving job scheduling problems in a resource preemption environment with multi-agent reinforcement learning

      2022, Robotics and Computer-Integrated Manufacturing
      Citation Excerpt :

      Most researches focus on applying DASA to handle various scheduling problems. Wu [33] proposed a distributed-agent architecture with RL, which leveraged the partial information and the interaction between agents for job scheduling in grid computing. Roesch [34] applied the PPO RL algorithm with the Generalized Advantage Estimator (GAE) in a distributed-agent architecture to address the dynamic energy scheduling problem, which outperformed meta-heuristic algorithms.

    • A synergistic reinforcement learning-based framework design in driving automation

      2022, Computers and Electrical Engineering
      Citation Excerpt :

      Moreover, considering the driving automation system is still evolving, a heterogeneous accelerator architecture can better accommodate the ever-changing new algorithms and applications in this area [20]. To efficiently process such a large amount of CNN-based tasks with high variability on the complicated hardware substrate [21], effective criteria for system design that are tailored to driving automation should be defined [22]. Obviously the overall performance of the computing platform should be considered at first.

    • Dynamic Task Allocation for Heterogeneous Multi-Robot System under Communication Constraints

      2023, ITNEC 2023 - IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference
    View all citing articles on Scopus

    Jun Wu received the B.Sc. and M.Sc. degrees in electrical engineering from National University of Defense Technology, Changsha, China, in 2002 and 2005, respectively. He is currently working toward the Ph.D. degree from the Institute of Automation, National University of Defense Technology, China. His current research interests include reinforcement learning, autonomous agent and multi-agent systems, especially in resource allocation, and multi-robot control.

    Xin Xu received the B.S. degree in control engineering from the Department of Automatic Control, National University of Defense Technology (NUDT), Changsha, PR China, in 1996 and the Ph.D. degree in electrical engineering from the College of Mechatronics and Automation (CMEA), NUDT. From 2003 to 2004, he was a Postdoctoral Fellow at School of Computer, NUDT. In August, 2006 and from September to October 2007, he was a visiting scholar for cooperation research in the Hong Kong Polytechnic University, Hong Kong, China and the University of Strathclyde, UK, respectively. Currently, he is an Associate Professor at the Institute of Automation, CMEA, NUDT.

    He has coauthored four books and published more than 50 papers in international journals and conferences, including IEEE Transactions on Neural Networks, Journal of AI Research, etc. His research interests include reinforcement learning, data mining, learning control, robotics, autonomic computing, and computer security.

    Dr. Xu received the excellent Ph.D. dissertation award from Hunan Province, PR China, in 2004 and the Fork Ying Tong Youth Teacher Fund of China in 2008. He has served as a PC member or Session Chair in many international conferences, and currently, he is a reviewer for several journals including several IEEE Transactions. He has been a grant reviewer of National Natural Science Foundation of China since 2005.

    View full text