Modelling and developing conflict-aware scheduling on large-scale data centres
Introduction
Data centres nowadays have to process the massive scale of jobs on a daily basis. On one hand, the resource demand, along with the commercial success of Cloud computing, becomes the major driving force for the cloud providers to increase the size of their clusters. On the other hand, in order to handle the jobs efficiently, Cloud giants, such as Google, Microsoft and Amazon, have developed various cluster management frameworks in their production clusters [1]. Among them, one conventional approach is to develop a centralized scheduler in the cluster, which manages all and diverse types of job submitted to the cluster. However, because of the massive number of jobs and the complexity of making scheduling decisions for some types of job [2], the centralized schedulers become the performance bottleneck for delivering resources and processing jobs timely. A recent trend thus is to deploy multiple, independent schedulers in a cluster. Different schedulers make scheduling decisions simultaneously for different types of job, aiming to improve the throughput and cluster utilization. These independently working schedulers in a cluster are termed distributed schedulers in the literature [3], [4].
In data centres, a job is typically running in a resource container, the examples of which are Linux Docker and Virtual Machine [5], [6]. A resource container may consume a certain amount of various types of resource such as CPU, memory, storage and bandwidth. Scheduling decisions involve determining which physical machine a resource container should run in. Since distributed schedulers make the scheduling decisions independently, it is likely that different schedulers decide to place their resource containers in the same physical machine and that the total resource capacity of these resource containers exceed that of the physical machine. This situation is called the scheduling/resource conflict between distributed schedulers. It has been shown that the scheduling conflict is a crucial part of performance penalty for distributed schedulers, since the distributed schedulers may spend a long period of time in rescheduling jobs due to scheduling conflicts rather than putting the jobs into execution.
This scheduling conflict problem has been recognized and the measures have been taken in the literature to resolve the conflict. For example, Omega [3] accepts a job, which consists of a set of tasks, as if any one of tasks has been accepted. But the scheduler will keep on rescheduling those tasks conflicting with others schedulers. Apollo [4] implemented a waiting queue on each machine so that the conflicting tasks do not have to be rejected immediately and returned back to the schedulers fore rescheduling. If the tasks are held on the waiting queue for too long, the scheduler will duplicate the tasks and schedule them to other machines. Both systems focus on resolving the conflicts upon their occurrence, not on reducing the conflicts. As the result, the straggler tasks may exist and increases the makespan of the whole job. Moreover, holding the tasks in the waiting queue of the machines and duplicating the tasks cost more resources. Further, because the scheduler will only release the resources after all tasks of one job have been completed, the early completed tasks of the job will still hold the resources and reduce the resource utilization. Therefore, although prior works have demonstrated the effectiveness of resolving the conflicts, these approaches neither provide an explicit method to reduce the conflicts, nor develop the conflict-aware scheduling policy and resource management mechanisms.
In this work, we investigate the performance penalty incurred by the scheduling conflicts and quantify the relation between the conflict and the number of requested resources. We then develop a game-theoretical solution for cluster schedulers to improve job performance. Finally, we evaluate the proposed methodology with both simulation experiments and real experiments on a cluster built on Amazon Web Service (AWS).
The remainder of this paper is organized as follows. In Section 2, the background and motivation of this work is discussed. Section 3 presents our conflict-aware scheduling strategies based on game theory modelling. The performance and effectiveness of the proposed modelling approach and scheduling strategies are evaluated in Section 4. Section 5 discusses the related work. Finally, this paper is concluded in Section 6.
Section snippets
Motivation: scheduling and workload characteristics in data centres
In this section, we look into the details of distributed cluster scheduling and investigate the possible trade-off between job performance and conflict cost in the shared cluster environment. We also discuss the opportunities and challenges in improving job performance in such a shared environment.
Conflict-aware scheduling strategies
A shared cluster contains multiple autonomous schedulers, which compete for shared but limited resources and have the incentive to request more resource containers to increase its QoS as we discussed in Section 2.3. Based on this assumption, the scheduling scenario in a shared cluster can be modelled as an non-cooperative game. This is because we assume that distributed schedulers (players) are not willing to sacrifice their own benefits when playing the strategies. If it is modelled as the
Performance evaluation
In this section, we compare our NE strategy with other strategies. The performance evaluation was conducted with both simulations and real cluster platform deployed on AWS.
Trace-driven simulator To evaluate the proposed framework with a wide range of parameters, we built a trace-driven simulator, which can perform the scheduling on the scale of the Google production cluster. We use the publicly available Google trace [17], [20] in the experiments. The trace collects the detailed job information
Related work
This paper considers the distributed scheduler in data centres, which is relatively new research issue faced by the cloud platform with increasing large scale. Indeed, traditional and existing schedulers also consider the similar scheduling conflict, but they consider the issue in a different context [37]. For example, when different tasks are scheduled to co-run on different CPU cores in a multicore computer, they may interfere with each other since they share (also compete) the common
Conclusion
In data centres, the distributed schedulers schedule different types of job independently. This paper presents a game-theoretical framework with the awareness of performance target and resources competition for distributed schedulers. The proposed scheduling strategies are derived based on the game theory and can strategically adjust their scheduling policies for the incoming jobs according to the performance target and the behaviour of other competitors. We formalize the expected number of
Acknowledgements
This work is partially supported by the EU Horizon 2020 - Marie Sklodowska-Curie Actions through the project entitled Computer Vision Enabled Multimedia Forensics and People Identification (Project No. 690907, Acronym: IDENTITY), Guangzhou City Science Foundation (Project No. 201510010275), PAPD and CICAEET .
Bin Wang is a Master student in School of Computer Science and Electronic Engineering, Hunan University, China. His research area is Cloud computing and mobile computing.
References (51)
- et al.
Developing resource consolidation frameworks for moldable virtual machines in clouds
Future Gener. Comput. Syst.
(2014) - et al.
A rapid learning algorithm for vehicle classification
Inf. Sci.
(2015) - et al.
Incremental learning for v-support vector regression
Neural Netw.
(2015) - et al.
Developing the cloud-integrated data replication framework in decentralized online social networks
J. Comput. Syst. Sci.
(2016) - et al.
Asymptotic scheduling for many task computing in big data platforms
Inf. Sci.
(2015) - et al.
Resource-aware hybrid scheduling algorithm in heterogeneous distributed computing
Future Gener. Comput. Syst.
(2015) - et al.
Noncooperative load balancing in distributed systems
J. Parallel Distrib. Comput.
(2005) - et al.
Large-scale cluster management at Google with Borg
- et al.
A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems
IEEE Trans. Parallel Distrib. Syst.
(2015) - et al.
Omega: flexible, scalable schedulers for large compute clusters
Apollo: scalable and coordinated scheduling for cloud-scale computing
Incremental support vector learning for ordinal regression
IEEE Trans. Neural Netw. Learn. Syst.
Image segmentation by generalized hierarchical Fuzzy C-means algorithm
J. Intell. Fuzzy Syst.
Fast motion estimation based on content property for low-complexity H.265/HEVC encoder
IEEE Trans. Broadcast
Efficient motion and disparity estimation optimization for low complexity multiview video coding
IEEE Trans. Broadcast
Achieving efficient cloud search services: multi-keyword ranked search over encrypted cloud data supporting parallel computing
IEICE Transactions on Communications
Social network and tag sources based augmenting collaborative recommender system
IEICE Trans. Inf. & Syst.
Fuxi: a fault-tolerant resource management and job scheduling system at internet scale
Proceedings of the VLDB Endowment
Towards understanding heterogeneous clouds at scale: Google trace analysis
Intel Science and Technology Center for Cloud Computing, Tech. Rep.
A fund-constrained investment scheme for profit maximization in cloud computing
IEEE Transactions on Services Computing
Spark: cluster computing with working sets
Jockey: guaranteed job latency in data parallel clusters
Reoptimizing data parallel computing
Cited by (0)
Bin Wang is a Master student in School of Computer Science and Electronic Engineering, Hunan University, China. His research area is Cloud computing and mobile computing.
Chao Chen received his Ph.D. degree in the Department of Computer Science, University of Warwick, UK. He is now the research fellow in Warwick Manufacture Group at the University of Warwick. His research area is Cloud computing.
Ligang He received the Ph.D. degree in Computer Science at the University of Warwick, United Kingdom, and worked as a post-doctoral researcher at the University of Cambridge, UK. From 2006, he worked in the Department of Computer Science at the University of Warwick as Assistant Professor and then Associate Professor. His research interests focus on parallel and distributed processing, Cluster, Grid and Cloud computing. He has published more than 90 papers in international conferences and journals, such as IEEE Transactions on Parallel and Distributed Systems, IPDPS, CCGrid, MASCOTS. He has been a co-chair or a member of the program committee for a number of international conferences, and been the reviewers for many international journals, including IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Computers, etc. He is a member of the IEEE.
Bo Gao received the Ph.D. degree in Computer Science at the University of Warwick. He is now a post-doctoral researcher in the Systems Biology Centre at Warwick University. His research area is mobile computing and Cloud computing.
Jiadong Ren received the Ph.D. degree in the school of Computer Science and technology at the Harbin Institute of Technology, China, and worked as a post-doctoral researcher at the BEIJING Institute of Technology, China. From 1989, he worked in the Department of Computer Science at the Yanshan University as Associate Professor. In 2004, he worked as a visiting scholar at the University of Zurich. He is now the full professor and Dean in School of Computer Science and Engineering at the Yanshan University. His research interests are in the areas of Cloud computing and Data mining. He has published more than 80 papers in international conferences and journals. He is a senior member of the China Computer Federation (CCF).
Zhangjie Fu received his Ph.D. in computer science from the College of Computer, Hunan University, China, in 2012. He is currently an Associate Professor at the College of Computer and Software, Nanjing University of Information Science and Technology, China. His research interests include Cloud & Outsourcing Security, Digital Forensics, Network and Information Security. His research has been supported by NSFC, PAPD, and GYHY. Zhangjie is a member of IEEE, and a member of ACM.
Songling Fu received the BS degree in the department of electronic science and technology from Harbin Institute of Technology, Harbin, China, in 2001, and received the MS and Ph.D. degree of computer science and technology from National University of Defense Technology, Changsha, China, in 2003 and 2014, respectively. In 2014, he joined in the Department of Electronic Information Engineering at the Hunan Normal University as an Assistant Professor. His research interests include parallel and distributed computing, big data, robot operating systems. He is a member of the IEEE.
Yongjian Hu received the Ph.D. degree in communication and information systems from South China University of Technology in 2002. Now he is a full professor in the School of Electronic and Information Engineering, South China University of Technology. From 2011 to 2013, he was a Marie Curie Fellow in the Department of Computer Science, University of Warwick, UK. From 2006 to 2008, he worked as a research professor in the Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), South Korea. From 2005 to 2006, he worked as a research professor in the School of Information and Communication Engineering, SungKyunKwan University, South Korea. Between 2000 and 2004, he visited the Department of Computer Science, City University of Hong Kong four times as a research assistant, senior research associate, and research fellow, respectively. Dr. Hu has been a senior member of IEEE since 2009. He is also a senior member of Chinese Institute of Electronics (CIE) and a senior member of China Computer Federation (CCF). He has published more than 60 peer reviewed papers since 2000. His research interests include information hiding, multimedia security and machine learning.
Chang-Tsun Li received the BEng degree in electrical engineering from National Defence University (NDU), Taiwan, in 1987, the MSc degree in computer science from U.S. Naval Postgraduate School, USA, in 1992, and the Ph.D. degree in computer science from the University of Warwick, UK, in 1998. He was an associate professor of the Department of Electrical Engineering at NDU during 1998–2002 and a visiting professor of the Department of Computer Science at U.S. Naval Postgraduate School in the second half of 2001. He was a professor of the Department of Computer Science at the University of Warwick, UK, until Dec 2016. He is currently a professor of the School of Computing and Mathematics, Charles Sturt University, Australia, leading the Data Science Research Unit. His research interests include multimedia forensics and security, biometrics, data mining, machine learning, data analytics, computer vision, image processing, pattern recognition, bioinformatics, and content-based image retrieval. The outcomes of his multimedia forensics and machine learning research have been translated into award-winning commercial products protected by a series of international patents and have been used by a number of police forces and courts of law around the world. He is currently Associate Editor of the EURASIP Journal of Image and Video Processing (JIVP) and Associate Editor of IET Biometrics. He involved in the organisation of many international conferences and workshops and also served as member of the international program committees for several international conferences. He is also actively contributing keynote speeches and talks at various international events.