research-article

Flexible resource allocation for reliable virtual cluster computing systems

Authors:

Thomas J. Hacker,

Kanak MahadikAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 48, Pages 1 - 12

https://doi.org/10.1145/2063384.2063448

Published: 12 November 2011 Publication History

Abstract

Virtualization and cloud computing technologies now make it possible to create scalable and reliable virtual high performance computing clusters. Integrating these technologies, however, is complicated by fundamental and inherent differences in the way in which these systems allocate resources to computational tasks. Cloud computing systems immediately allocate available resources or deny requests. In contrast, parallel computing systems route all requests through a queue for future resource allocation. This divergence of allocation policies hinders efforts to implement efficient, responsive, and reliable virtual clusters.

In this paper, we present a continuum of four scheduling polices along with an analytical resource prediction model for each policy to estimate the level of resources needed to operate an efficient, responsive, and reliable virtual cluster system. We show that it is possible to estimate the size of the virtual cluster system needed to provide a predictable grade of service for a realistic high performance computing workload and estimate the queue wait time for a partial or full resource allocation. Moreover, we show that it is possible to provide a reliable virtual cluster system using a limited pool of spare resources. The models and results we present are useful for cloud computing providers seeking to operate efficient and cost-effective virtual cluster systems.

References

[1]

H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon, "Top500 Supercomputer Sites," The report can be downloaded from http://www.top500.org/.

Digital Library

[2]

R. Henderson, "Job scheduling under the portable batch system," in Job Scheduling Strategies for Parallel Processing. Springer, 1995, pp. 279--294.

Digital Library

[3]

M. Litzkow, M. Livny, and M. Mutka, "Condor-a hunter of idle workstations," in Distributed Computing Systems, 1988., 8th International Conference on. IEEE, 2002, pp. 104--111.

[4]

S. Kannan, M. Roberts, P. Mayes, D. Brelsford, and J. Skovira, "Workload management with loadleveler," IBM Redbooks, vol. 2, p. 2, 2001.

[5]

B. Sotomayor, R. S. Montero, I. M. Llorente, and I. Foster, "Virtual infrastructure management in private and hybrid clouds," IEEE Internet Computing, vol. 13, pp. 14--22, 2009.

Digital Library

[6]

R. Eigenmann, T. Hacker, and E. Rathje, "Nees cyberinfrastructure: A foundation for innovative research and education," in Proceedings of the 9th US/10th Canadian Conference on Earthquake Engineering, 2010.

[7]

T. Hacker, R. Eigenmann, S. Bagchi, A. Irfanoglu, S. Pujol, A. Catlin, and E. Rathje, "The neeshub cyberinfrastructure for earthquake engineering," Computing in Science & Engineering, vol. 13, no. 4, pp. 67--78, 2011.

Digital Library

[8]

T. Hacker, "Toward a Reliable Cloud Computing Service," Cloud Computing and Software Services: Theory and Techniques, p. 139, 2010.

[9]

M. McLennan and R. Kennell, "HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering," Computing in Science & Engineering, vol. 12, no. 2, pp. 48--53, 2010.

Digital Library

[10]

N. Wilkins-Diehr, "Special issue: Science gateways - Common community interfaces to grid resources," Concurrency and Computation: Practice and Experience, vol. 19, no. 6, pp. 743--749, 2007.

Digital Library

[11]

F. Zhang, J. Cao, X. Song, H. Cai, and C. Wu, "AMREF: An Adaptive MapReduce Framework for Real Time Applications," in 2010 Ninth International Conference on Grid and Cloud Computing. IEEE, 2010, pp. 157--162.

Digital Library

[12]

T. Hacker and B. Athey, "A methodology for account management in grid computing environments," in Grid Computing GRID 2001, ser. Lecture Notes in Computer Science, C. Lee, Ed. Springer Berlin/Heidelberg, 2001, vol. 2242, pp. 133--144.

Digital Library

[13]

H. Kobayashi and B. Mark, System Modeling and Analysis: Foundations of System Performance Evaluation. Pearson Education, 2009.

Digital Library

[14]

T. Bonald, "Insensitive traffic models for communication networks," Discrete Event Dynamic Systems, vol. 17, no. 3, pp. 405--421, 2007.

Digital Library

[15]

J. S. Kaufman, "Blocking in a shared resource environment," IEEE trans. on commun., vol. COM-29, 10, pp. 1474--1481, 1981.

[16]

H. Li, D. Groep, and L. Walters, "Workload characteristics of a multi-cluster supercomputer," in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer-Verlag, 2004, pp. 176--193, lect. Notes Comput. Sci. vol. 3277.

Digital Library

[17]

F. Hubner and P. Tran-Gia, "An analysis of multi-service systems with trunk reservation mechanisms," 1992.

[18]

S. M. Ross, Introduction to Probability Models, 4th ed. Academic Press, 1989.

[19]

G. Zeng, "Two common properties of the erlang-B function, erlang-C function, and Engset blocking function," Mathematical and Computer Modelling, vol. 37, no. 12-13, pp. 1287--1296, 2003.

Digital Library

[20]

V. Iverson, "Teletraffic engineering and network planning, Technical University of Denmark, Revised January 2007."

[21]

P. Smith, T. Hacker, and C. Song, "Implementing an industrial-strength academic cyberinfrastructure at purdue university," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008, pp. 1--7.

[22]

T. Hacker, F. Romero, and C. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, pp. 652--665, 2009.

Digital Library

[23]

T. J. Hacker and Z. Meglicki, "Using queue structures to improve job reliability," in Proceedings of the 16th International Symposium on High-Performance Distributed Computing (HPDC-16 2007), 25-29 June 2007, Monterey, California, USA. ACM, 2007, pp. 43--54.

Digital Library

[24]

D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb_cs:TR-2003-37, Nov. 2003.

[25]

J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190--199.

Digital Library

[26]

D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30 - September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432--441.

Digital Library

[27]

D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.

[28]

M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.

[29]

F. Romero and T. J. Hacker, "Live migration of parallel applications with openvz," in Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications, ser. WAINA '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 526--531. {Online}. Available: http://dx.doi.org/10.1109/WAINA.2011.156

Digital Library

[30]

D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.

[31]

B. Sotomayor, R. Montero, I. Llorente, I. Foster, and F. de Informatica, "Capacity leasing in cloud systems using the opennebula engine," Cloud Computing and Applications, vol. 2008, 2008.

[32]

S. Venugopal, J. Broberg, and R. Buyya, "OpenPEX: An open provisioning and execution system for virtual machines," in 17th International Conference on Advanced Computing and Communications (ADCOMŠ09). Citeseer, 2009.

[33]

D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov, "The eucalyptus open-source cloud-computing system," in Proceedings of Cloud Computing and Its Applications, 2008.

[34]

M. De Assun¸ão, A. Di Costanzo, and R. Buyya, "Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters," in Proceedings of the 18th ACM international symposium on High performance distributed computing. ACM, 2009, pp. 141--150.

Digital Library

[35]

T. Hacker and B. Athey, "A methodology for account management in grid computing environments," Grid Computing - GRID 2001, pp. 133--144, 2001.

Digital Library

[36]

D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy, "Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction," in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ser. SC '06. New York, NY, USA: ACM, 2006. {Online}. Available: http://doi.acm.org/10.1145/1188455.1188579

Digital Library

[37]

W. Smith, "A service for queue prediction and job statistics," in Gateway Computing Environments Workshop (GCE), 2010, nov. 2010, pp. 1--8.

Cited By

Fan HWu SZhao XXie ZDi SXiao JYu CJin H(2021)Accelerating Parallel Applications in Cloud Platforms via Adaptive Time-Slice ControlIEEE Transactions on Computers10.1109/TC.2020.299961970:7(992-1005)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TC.2020.2999619
Gao TWu CHou AWang YLi RXu MHung CPapadopoulos G(2019)Minimizing financial cost of scientific workflows under deadline constraints in multi-cloud environmentsProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297293(114-121)Online publication date: 8-Apr-2019
https://dl.acm.org/doi/10.1145/3297280.3297293
Li HWang HFang SZou YTian W(2019)An energy-aware scheduling algorithm for big data applications in SparkCluster Computing10.1007/s10586-019-02947-9Online publication date: 4-Jun-2019
https://doi.org/10.1007/s10586-019-02947-9
Show More Cited By

Flexible resource allocation for reliable virtual cluster computing systems

Recommendations

Resource reconstruction algorithms for on-demand allocation in virtual computing resource pool

Resource reconstruction algorithms are studied in this paper to solve the problem of resource on-demand allocation and improve the efficiency of resource utilization in virtual computing resource pool. Based on the idea of resource virtualization and ...
Recent advancements in resource allocation techniques for cloud computing environment: a systematic review

There are two actors in cloud computing environment cloud providers and cloud users. On one hand cloud providers hold enormous computing resources in the cloud large data centers that rent the resources out to the cloud users on a pay-per-use basis to ...
Resource virtualization methodology for on-demand allocation in cloud computing systems

The resources' heterogeneity and unbalanced capability, together with the diversity of resource requirements in cloud computing systems, have produced great contradictions between resources' tight coupling characteristics and user's multi-granularities ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
644
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan HWu SZhao XXie ZDi SXiao JYu CJin H(2021)Accelerating Parallel Applications in Cloud Platforms via Adaptive Time-Slice ControlIEEE Transactions on Computers10.1109/TC.2020.299961970:7(992-1005)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TC.2020.2999619
Gao TWu CHou AWang YLi RXu MHung CPapadopoulos G(2019)Minimizing financial cost of scientific workflows under deadline constraints in multi-cloud environmentsProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297293(114-121)Online publication date: 8-Apr-2019
https://dl.acm.org/doi/10.1145/3297280.3297293
Li HWang HFang SZou YTian W(2019)An energy-aware scheduling algorithm for big data applications in SparkCluster Computing10.1007/s10586-019-02947-9Online publication date: 4-Jun-2019
https://doi.org/10.1007/s10586-019-02947-9
Wei LFoh CHe BCai J(2018)Towards Efficient Resource Allocation for Heterogeneous Workloads in IaaS CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24814006:1(264-275)Online publication date: 1-Jan-2018
https://doi.org/10.1109/TCC.2015.2481400
Wang DDai WZhang CShi XJin H(2017)TPS: An Efficient VM Scheduling Algorithm for HPC Applications in CloudGreen, Pervasive, and Cloud Computing10.1007/978-3-319-57186-7_13(152-164)Online publication date: 13-Apr-2017
https://doi.org/10.1007/978-3-319-57186-7_13
Wu CCao H(2016)Optimizing the Performance of Big Data Workflows in Multi-cloud Environments Under Budget Constraint2016 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2016.25(138-145)Online publication date: Jun-2016
https://doi.org/10.1109/SCC.2016.25
Wu SXie ZChen HDi SZhao XJin H(2016)Dynamic Acceleration of Parallel Applications in Cloud Platforms by Adaptive Time-Slice Control2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.77(343-352)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.77
Shao YLi CDong WLiu Y(2016)Energy-Aware Dynamic Resource Allocation on Hadoop YARN Cluster2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0059(364-371)Online publication date: Dec-2016
https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0059
Meng JLlamosí EKaplan FZhang CSheng JHerbordt MSchirner GCoskun A(2016)Communication and cooling aware job allocation in data centers for communication-intensive workloadsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.05.01696:C(181-193)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1016/j.jpdc.2016.05.016
Mashayekhy LNejad MGrosu DQuan Zhang Weisong Shi (2015)Energy-Aware Scheduling of MapReduce Jobs for Big Data ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.235855626:10(2720-2733)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2358556
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten