skip to main content
10.1145/2063384.2063448acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Flexible resource allocation for reliable virtual cluster computing systems

Published: 12 November 2011 Publication History

Abstract

Virtualization and cloud computing technologies now make it possible to create scalable and reliable virtual high performance computing clusters. Integrating these technologies, however, is complicated by fundamental and inherent differences in the way in which these systems allocate resources to computational tasks. Cloud computing systems immediately allocate available resources or deny requests. In contrast, parallel computing systems route all requests through a queue for future resource allocation. This divergence of allocation policies hinders efforts to implement efficient, responsive, and reliable virtual clusters.
In this paper, we present a continuum of four scheduling polices along with an analytical resource prediction model for each policy to estimate the level of resources needed to operate an efficient, responsive, and reliable virtual cluster system. We show that it is possible to estimate the size of the virtual cluster system needed to provide a predictable grade of service for a realistic high performance computing workload and estimate the queue wait time for a partial or full resource allocation. Moreover, we show that it is possible to provide a reliable virtual cluster system using a limited pool of spare resources. The models and results we present are useful for cloud computing providers seeking to operate efficient and cost-effective virtual cluster systems.

References

[1]
H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon, "Top500 Supercomputer Sites," The report can be downloaded from http://www.top500.org/.
[2]
R. Henderson, "Job scheduling under the portable batch system," in Job Scheduling Strategies for Parallel Processing. Springer, 1995, pp. 279--294.
[3]
M. Litzkow, M. Livny, and M. Mutka, "Condor-a hunter of idle workstations," in Distributed Computing Systems, 1988., 8th International Conference on. IEEE, 2002, pp. 104--111.
[4]
S. Kannan, M. Roberts, P. Mayes, D. Brelsford, and J. Skovira, "Workload management with loadleveler," IBM Redbooks, vol. 2, p. 2, 2001.
[5]
B. Sotomayor, R. S. Montero, I. M. Llorente, and I. Foster, "Virtual infrastructure management in private and hybrid clouds," IEEE Internet Computing, vol. 13, pp. 14--22, 2009.
[6]
R. Eigenmann, T. Hacker, and E. Rathje, "Nees cyberinfrastructure: A foundation for innovative research and education," in Proceedings of the 9th US/10th Canadian Conference on Earthquake Engineering, 2010.
[7]
T. Hacker, R. Eigenmann, S. Bagchi, A. Irfanoglu, S. Pujol, A. Catlin, and E. Rathje, "The neeshub cyberinfrastructure for earthquake engineering," Computing in Science & Engineering, vol. 13, no. 4, pp. 67--78, 2011.
[8]
T. Hacker, "Toward a Reliable Cloud Computing Service," Cloud Computing and Software Services: Theory and Techniques, p. 139, 2010.
[9]
M. McLennan and R. Kennell, "HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering," Computing in Science & Engineering, vol. 12, no. 2, pp. 48--53, 2010.
[10]
N. Wilkins-Diehr, "Special issue: Science gateways - Common community interfaces to grid resources," Concurrency and Computation: Practice and Experience, vol. 19, no. 6, pp. 743--749, 2007.
[11]
F. Zhang, J. Cao, X. Song, H. Cai, and C. Wu, "AMREF: An Adaptive MapReduce Framework for Real Time Applications," in 2010 Ninth International Conference on Grid and Cloud Computing. IEEE, 2010, pp. 157--162.
[12]
T. Hacker and B. Athey, "A methodology for account management in grid computing environments," in Grid Computing GRID 2001, ser. Lecture Notes in Computer Science, C. Lee, Ed. Springer Berlin/Heidelberg, 2001, vol. 2242, pp. 133--144.
[13]
H. Kobayashi and B. Mark, System Modeling and Analysis: Foundations of System Performance Evaluation. Pearson Education, 2009.
[14]
T. Bonald, "Insensitive traffic models for communication networks," Discrete Event Dynamic Systems, vol. 17, no. 3, pp. 405--421, 2007.
[15]
J. S. Kaufman, "Blocking in a shared resource environment," IEEE trans. on commun., vol. COM-29, 10, pp. 1474--1481, 1981.
[16]
H. Li, D. Groep, and L. Walters, "Workload characteristics of a multi-cluster supercomputer," in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer-Verlag, 2004, pp. 176--193, lect. Notes Comput. Sci. vol. 3277.
[17]
F. Hubner and P. Tran-Gia, "An analysis of multi-service systems with trunk reservation mechanisms," 1992.
[18]
S. M. Ross, Introduction to Probability Models, 4th ed. Academic Press, 1989.
[19]
G. Zeng, "Two common properties of the erlang-B function, erlang-C function, and Engset blocking function," Mathematical and Computer Modelling, vol. 37, no. 12-13, pp. 1287--1296, 2003.
[20]
V. Iverson, "Teletraffic engineering and network planning, Technical University of Denmark, Revised January 2007."
[21]
P. Smith, T. Hacker, and C. Song, "Implementing an industrial-strength academic cyberinfrastructure at purdue university," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008, pp. 1--7.
[22]
T. Hacker, F. Romero, and C. Carothers, "An analysis of clustered failures on large supercomputing systems," Journal of Parallel and Distributed Computing, vol. 69, no. 7, pp. 652--665, 2009.
[23]
T. J. Hacker and Z. Meglicki, "Using queue structures to improve job reliability," in Proceedings of the 16th International Symposium on High-Performance Distributed Computing (HPDC-16 2007), 25-29 June 2007, Monterey, California, USA. ACM, 2007, pp. 43--54.
[24]
D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb_cs:TR-2003-37, Nov. 2003.
[25]
J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190--199.
[26]
D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30 - September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432--441.
[27]
D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.
[28]
M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.
[29]
F. Romero and T. J. Hacker, "Live migration of parallel applications with openvz," in Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications, ser. WAINA '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 526--531. {Online}. Available: http://dx.doi.org/10.1109/WAINA.2011.156
[30]
D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.
[31]
B. Sotomayor, R. Montero, I. Llorente, I. Foster, and F. de Informatica, "Capacity leasing in cloud systems using the opennebula engine," Cloud Computing and Applications, vol. 2008, 2008.
[32]
S. Venugopal, J. Broberg, and R. Buyya, "OpenPEX: An open provisioning and execution system for virtual machines," in 17th International Conference on Advanced Computing and Communications (ADCOMŠ09). Citeseer, 2009.
[33]
D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov, "The eucalyptus open-source cloud-computing system," in Proceedings of Cloud Computing and Its Applications, 2008.
[34]
M. De Assun¸ão, A. Di Costanzo, and R. Buyya, "Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters," in Proceedings of the 18th ACM international symposium on High performance distributed computing. ACM, 2009, pp. 141--150.
[35]
T. Hacker and B. Athey, "A methodology for account management in grid computing environments," Grid Computing - GRID 2001, pp. 133--144, 2001.
[36]
D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy, "Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction," in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ser. SC '06. New York, NY, USA: ACM, 2006. {Online}. Available: http://doi.acm.org/10.1145/1188455.1188579
[37]
W. Smith, "A service for queue prediction and job statistics," in Gateway Computing Environments Workshop (GCE), 2010, nov. 2010, pp. 1--8.

Cited By

View all
  • (2021)Accelerating Parallel Applications in Cloud Platforms via Adaptive Time-Slice ControlIEEE Transactions on Computers10.1109/TC.2020.299961970:7(992-1005)Online publication date: 1-Jul-2021
  • (2019)Minimizing financial cost of scientific workflows under deadline constraints in multi-cloud environmentsProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297293(114-121)Online publication date: 8-Apr-2019
  • (2019)An energy-aware scheduling algorithm for big data applications in SparkCluster Computing10.1007/s10586-019-02947-9Online publication date: 4-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Accelerating Parallel Applications in Cloud Platforms via Adaptive Time-Slice ControlIEEE Transactions on Computers10.1109/TC.2020.299961970:7(992-1005)Online publication date: 1-Jul-2021
  • (2019)Minimizing financial cost of scientific workflows under deadline constraints in multi-cloud environmentsProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297293(114-121)Online publication date: 8-Apr-2019
  • (2019)An energy-aware scheduling algorithm for big data applications in SparkCluster Computing10.1007/s10586-019-02947-9Online publication date: 4-Jun-2019
  • (2018)Towards Efficient Resource Allocation for Heterogeneous Workloads in IaaS CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24814006:1(264-275)Online publication date: 1-Jan-2018
  • (2017)TPS: An Efficient VM Scheduling Algorithm for HPC Applications in CloudGreen, Pervasive, and Cloud Computing10.1007/978-3-319-57186-7_13(152-164)Online publication date: 13-Apr-2017
  • (2016)Optimizing the Performance of Big Data Workflows in Multi-cloud Environments Under Budget Constraint2016 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2016.25(138-145)Online publication date: Jun-2016
  • (2016)Dynamic Acceleration of Parallel Applications in Cloud Platforms by Adaptive Time-Slice Control2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.77(343-352)Online publication date: May-2016
  • (2016)Energy-Aware Dynamic Resource Allocation on Hadoop YARN Cluster2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0059(364-371)Online publication date: Dec-2016
  • (2016)Communication and cooling aware job allocation in data centers for communication-intensive workloadsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.05.01696:C(181-193)Online publication date: 1-Oct-2016
  • (2015)Energy-Aware Scheduling of MapReduce Jobs for Big Data ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.235855626:10(2720-2733)Online publication date: 1-Oct-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media