Multi-stage resource-aware scheduling for data centers with heterogeneous servers

Tran, Tony T.; Padmanabhan, Meghana; Zhang, Peter Yun; Li, Heyse; Down, Douglas G.; Beck, J. Christopher

doi:10.1007/s10951-017-0537-x

Multi-stage resource-aware scheduling for data centers with heterogeneous servers

Published: 20 July 2017

Volume 21, pages 251–267, (2018)
Cite this article

Journal of Scheduling Aims and scope Submit manuscript

Tony T. Tran¹,
Meghana Padmanabhan¹,
Peter Yun Zhang²,
Heyse Li¹,
Douglas G. Down³ &
…
J. Christopher Beck¹

521 Accesses
9 Citations
4 Altmetric
Explore all metrics

Abstract

This paper presents a three-stage algorithm for resource-aware scheduling of computational jobs in a large-scale heterogeneous data center. The algorithm aims to allocate job classes to machine configurations to attain an efficient mapping between job resource request profiles and machine resource capacity profiles. The first stage uses a queueing model that treats the system in an aggregated manner with pooled machines and jobs represented as a fluid flow. The latter two stages use combinatorial optimization techniques to solve a shorter-term, more accurate representation of the problem using the first-stage, long-term solution for heuristic guidance. In the second stage, jobs and machines are discretized. A linear programming model is used to obtain a solution to the discrete problem that maximizes the system capacity given a restriction on the job class and machine configuration pairings based on the solution of the first stage. The final stage is a scheduling policy that uses the solution from the second stage to guide the dispatching of arriving jobs to machines. We present experimental results of our algorithm on both Google workload trace data and generated data and show that it outperforms existing schedulers. These results illustrate the importance of considering heterogeneity of both job and machine configuration profiles in making effective scheduling decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

An energy-efficient unrelated parallel machine scheduling problem with learning effect of operators and deterioration of jobs

Article 15 April 2024

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

Notes

Earlier work on our algorithm, appearing at the Multidisciplinary International Scheduling Conference: Theory and Applications (MISTA) 2015 presented a comparison only to the Greedy policy. We have extended the paper by improving our algorithm, including a comparison to the Tetris scheduler, and significantly expanding the experimentation.
It may be beneficial to consider the dominant resource classification of Dominant Resource Fairness when creating such an ordering (Ghodsi et al. 2011).
The data can be found at https://code.google.com/p/googleclusterdata/.
We examine the impact of processing time variation in subsequent experiments (see Sect. 5.4.3).
Note that \(\lambda ^*\) represents an upper bound on the system load that can be handled. The bound may not be tight depending on the fragmentation of resources on a machine and/or the inefficiencies in the scheduling model used.

References

Al-Azzoni, I., & Down, D. G. (2008). Linear programming-based affinity scheduling of independent tasks on heterogeneous computing systems. IEEE Transactions on Parallel and Distributed Systems, 19(12), 1671–1682.
Article Google Scholar
Andradóttir, S., Ayhan, H., & Down, D. G. (2003). Dynamic server allocation for queueing networks with flexible servers. Operations Research, 51(6), 952–968.
Article Google Scholar
Berral, J. L., Goiri, Í., Nou, R., Julià, F., Guitart, J., Gavaldà, R., & Torres, J. (2010). Towards energy-aware scheduling in data centers using machine learning. In Proceedings of the 1st international conference on energy-efficient computing and networking (pp. 215–224). ACM.
Dai, J. G., & Meyn, S. P. (1995). Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Transactions on Automatic Control, 40(11), 1889–1904.
Article Google Scholar
Gandhi, A., Harchol-Balter, M., & Kozuch, M. A. (2012). Are sleep states effective in data centers? In International green computing conference (IGCC) (pp. 1–10). IEEE.
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., & Stoica, I. (2011). Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX conference on networked systems design and implementation (Vol. 11, pp. 323–336).
Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., & Akella, A. (2014). Multi-resource packing for cluster schedulers. In Proceedings of the 2014 ACM conference on SIGCOMM (pp. 455–466). ACM.
Guazzone, M., Anglano, C., & Canonico, M. (2012). Exploiting vm migration for the automated power and performance management of green cloud computing systems. In Energy efficient data centers (Vol. 7396, pp. 81–92). Springer.
Guenter, B., Jain, N., & Williams, C. (2011). Managing cost, performance, and reliability tradeoffs for energy-aware server provisioning. In INFOCOM, 2011 proceedings IEEE (pp. 1332–1340). IEEE.
He, Y.-T., & Down, D. G. (2008). Limited choice and locality considerations for load balancing. Performance Evaluation, 65(9), 670–687.
Article Google Scholar
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (pp. 261–276). ACM.
Jain, R., Chiu, D.-M., & Hawe, W. (1984). A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. In Digital equipment corporation research technical report TR-301 (pp. 1–37).
Kim, J.-K., Shivle, S., Siegel, H. J., Maciejewski, A. A., Braun, T. D., Schneider, M., et al. (2007). Dynamically mapping tasks with priorities and multiple deadlines in a heterogeneous environment. Journal of Parallel and Distributed Computing, 67(2), 154–169.
Article Google Scholar
Le, K., Bianchini, R., Zhang, J., Jaluria, Y., Meng, J., & Nguyen, T. D. (2011). Reducing electricity cost through virtual machine placement in high performance computing clouds. In Proceedings of the international conference for high performance computing, networking, storage and analysis (p. 22). ACM.
Liu, Z., Lin, M., Wierman, A., Low, S. H., & Andrew, L. L. H. (2011). Greening geographical load balancing. In Proceedings of the ACM SIGMETRICS joint international conference on measurement and modeling of computer systems (pp. 233–244). ACM.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Article Google Scholar
Maguluri, S. T., Srikant, R., & Ying, L. (2012a). Heavy traffic optimal resource allocation algorithms for cloud computing clusters. In Proceedings of the 24th international teletraffic congress (pp. 25). International Teletraffic Congress.
Maguluri, S. T., Srikant, R., & Ying, L. (2012b). Stochastic models of load balancing and scheduling in cloud computing clusters. In Proceedings IEEE INFOCOM (pp. 702–710). IEEE.
Mann, Z. Á. (2015). Allocation of virtual machines in cloud data centers–A survey of problem models and optimization algorithms. ACM Computing Surveys, 48(1), 1–31.
Article Google Scholar
Mishra, A. K., Hellerstein, J. L., Cirne, W., & Das, C. R. (2010). Towards characterizing cloud backend workloads: Insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37(4), 34–41.
Article Google Scholar
Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013). Sparrow: Distributed, low latency scheduling. In Proceedings of the twenty-fourth ACM symposium on operating systems principles (pp. 69–84). ACM.
Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.
Article Google Scholar
Reiss, C., Tumanov, A., Ganger, G. R., Katz, R. H., & Kozuch, M. A. (2012). Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the third ACM symposium on cloud computing (pp. 1–13). ACM.
Salehi, M. A., Krishna, P. R., Deepak, K. S., & Buyya, R. (2012). Preemption-aware energy management in virtualized data centers. In 2012 IEEE 5th international conference on cloud computing (CLOUD) (pp. 844–851). IEEE.
Tang, Q., Gupta, S. K. S., & Varsamopoulos, G. (2007). Thermal-aware task scheduling for data centers through minimizing heat recirculation. In IEEE international conference on cluster computing (pp. 129–138). IEEE.
Tarplee, K. M., Friese, R., Maciejewski, A. A., Siegel, H. J., & Chong, E. K. P. (2016). Energy and makespan tradeoffs in heterogeneous computing systems using efficient linear programming techniques. IEEE Transactions on Parallel and Distributed Systems, 27(6), 1633–1646.
Article Google Scholar
Terekhov, D., Tran, T. T., Down, D. G., & Beck, J. C. (2014). Integrating queueing theory and scheduling for dynamic scheduling problems. Journal of Artificial Intelligence Research, 50, 535–572.
Google Scholar
Wang, L., Von Laszewski, G., Dayal, J., He, X., Younge, A. J., & Furlani, T. R. (2009). Towards thermal aware workload scheduling in a data center. In 2009 10th international symposium on pervasive systems, algorithms, and networks (ISPAN) (pp. 116–122). IEEE.
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., & Stoica, I. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on computer systems (pp. 265–278). ACM.

Download references

Acknowledgements

This work was made possible in part due to a Google Research Award and the Natural Sciences and Engineering Research Council of Canada (NSERC). We also wish to thank the referees for their insightful comments and providing directions for additional work which has resulted in this paper.

Author information

Authors and Affiliations

Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Canada
Tony T. Tran, Meghana Padmanabhan, Heyse Li & J. Christopher Beck
Engineering Systems Division, Massachusetts Institute of Technology, Cambridge, MA, USA
Peter Yun Zhang
Department of Computing and Software, McMaster University, Hamilton, Canada
Douglas G. Down

Authors

Tony T. Tran
View author publications
You can also search for this author in PubMed Google Scholar
Meghana Padmanabhan
View author publications
You can also search for this author in PubMed Google Scholar
Peter Yun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Heyse Li
View author publications
You can also search for this author in PubMed Google Scholar
Douglas G. Down
View author publications
You can also search for this author in PubMed Google Scholar
J. Christopher Beck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tony T. Tran.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, T.T., Padmanabhan, M., Zhang, P.Y. et al. Multi-stage resource-aware scheduling for data centers with heterogeneous servers. J Sched 21, 251–267 (2018). https://doi.org/10.1007/s10951-017-0537-x

Download citation

Published: 20 July 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10951-017-0537-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stage resource-aware scheduling for data centers with heterogeneous servers

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

An energy-efficient unrelated parallel machine scheduling problem with learning effect of operators and deterioration of jobs

Dynamic resource allocation in cloud computing: analysis and taxonomies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-stage resource-aware scheduling for data centers with heterogeneous servers

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

An energy-efficient unrelated parallel machine scheduling problem with learning effect of operators and deterioration of jobs

Dynamic resource allocation in cloud computing: analysis and taxonomies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation