Article

Using queue structures to improve job reliability

Authors:

Thomas J. Hacker,

Zdzislaw MeglickiAuthors Info & Claims

HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing

Pages 43 - 54

https://doi.org/10.1145/1272366.1272373

Published: 25 June 2007 Publication History

Abstract

Many high performance computing systems today exploit the availability and remarkable performance characteristics of stand alone server systems and the impressive price / performance ratio of commodity components. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant popularity and are an indispensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exposed operational issues, which include system availability and reliability. In this paper, we explore the impact of individual component reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over abroad job mix. We also explore the impact of reliability and queue structure on checkpoint intervals and recovery. Our results demonstrate that it is possible to design a reliable high performance computing system with very good operational reliability characteristics from a collection of moderately reliable components.

References

[1]

J. T. Daly, "A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps," International Workshop on Software Engineering for High Performance Computing System Applications, 2004.

[2]

S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479--493, 2005.

Digital Library

[3]

S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job portfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004.

[4]

K. J. Ryan and C. S. Reese, "Estimating reliability trends for the world's fastest computer," Los Alamos National Laboratory, Tech. Rep. LA-UR-00-4201, 2000.

[5]

T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217--227.

Digital Library

[6]

D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb cs:TR-2003-37, Nov. 2003.

[7]

J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190--199.

Digital Library

[8]

D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30-September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432--441.

Digital Library

[9]

Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233--252.

Digital Library

[10]

B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Proceedings of International Symposium on Dependable Systems and Networks (DSN). IEEE Computer Society, 2006, pp. 249--258.

Digital Library

[11]

C. Ebeling, An Introduction to Reliability and Maintainability Engineering. Boston, MA: McGraw-Hill, 1997.

[12]

D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.

[13]

Los Alamos National Laboratory. (2006) Raw operational data on system failures. {Online}. Available: http://www.lanl.gov/projects/computerscience/data/

[14]

EasyFit Statistical Package, "http://www.mathwave.com/products/easyfit.html."

[15]

N. Raju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hpc clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006.

[16]

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS'99). Washington - Brussels - Tokyo: IEEE, Oct. 1999, pp. 178--189.

Digital Library

[17]

B. Schroeder and G. Gibson, "Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13--16 2007.

Digital Library

[18]

E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a large disk drive population," in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13--16 2007.

Digital Library

[19]

D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.

[20]

M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.

[21]

F. Petrini, "Scaling to Thousands of Processors with Bu®er Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002.

[22]

Lublin and Feitelson, "The workload on parallel supercomputers: Modeling the characteristics of rigid jobs," JPDC: Journal of Parallel and Distributed Computing, vol. 63, 2003.

Digital Library

[23]

N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942--947, 1997.

Digital Library

[24]

D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkpoint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004.

[25]

N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system," Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.

[26]

L. P. Cox, C. D. Murray, and B. Noble, "Pastiche: Making backup cheap and easy," in Proceedings of the 5th ACM Symposium on Operating System Design and Implementation (OSDI-02), ser. Operating Systems Review. New York: ACM Press, Dec. 9-11 2007, pp. 285--298.

Digital Library

[27]

R. A. Oldfield, "Investigating lightweight storage and overlay network for fault tolerance," Proceedings of the High Availability and Performance Computing Workshop, 2006.

Cited By

Jayapandian N(2021)Cloud Dynamic Scheduling for Multimedia Data Encryption Using Tabu Search AlgorithmWireless Personal Communications10.1007/s11277-021-08562-5Online publication date: 10-May-2021
https://doi.org/10.1007/s11277-021-08562-5
Sadooghi IWang KPatel DZhao DLi TSrivastava SRaicu I(2015)FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)10.1109/BDC.2015.42(11-20)Online publication date: Dec-2015
https://doi.org/10.1109/BDC.2015.42
Umamaheshwaran SHacker T(2014)Reliability Guided Resource Allocation for Large-Scale SystemsProceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science10.1109/CloudCom.2014.63(334-341)Online publication date: 15-Dec-2014
https://dl.acm.org/doi/10.1109/CloudCom.2014.63
Show More Cited By

Index Terms

Using queue structures to improve job reliability
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Reliability

Recommendations

Reliability Analysis of the Retrial Queue with Server Breakdowns and Repairs

Retrial queues have been widely used to model many problems arising in telephone switching systems, telecommunication networks, computer networks and computer systems, etc. It is of basic importance to study reliability of retrial queues with server ...
System reliability analysis of retrial machine repair systems with warm standbys and a single server of working breakdown and recovery policy

Reliability analysis plays an important role in the machine repair systems. The purpose of this study is to propose reliability analysis of retrial machine repair systems with M operating units, W warm standby units, and a single repair server with ...
Imprecise Reliability of General Structures

This paper discusses the important aspects of the reliability of systems with an imprecise general model of the structure function. It is assumed that the information about reliability behavior of components is restricted by the mean levels of component ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing

June 2007

256 pages

ISBN:9781595936738

DOI:10.1145/1272366

General Chair:
Carl Kesselman
USC/ISI
,
Program Chairs:
Jack Dongarra
University of Tennessee
,
David Walker
University of Cardiff

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

HPDC07

Sponsor:

HPDC07: International Symposium on High Performance Distributed Computing

June 25 - 29, 2007

California, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
462
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jayapandian N(2021)Cloud Dynamic Scheduling for Multimedia Data Encryption Using Tabu Search AlgorithmWireless Personal Communications10.1007/s11277-021-08562-5Online publication date: 10-May-2021
https://doi.org/10.1007/s11277-021-08562-5
Sadooghi IWang KPatel DZhao DLi TSrivastava SRaicu I(2015)FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)10.1109/BDC.2015.42(11-20)Online publication date: Dec-2015
https://doi.org/10.1109/BDC.2015.42
Umamaheshwaran SHacker T(2014)Reliability Guided Resource Allocation for Large-Scale SystemsProceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science10.1109/CloudCom.2014.63(334-341)Online publication date: 15-Dec-2014
https://dl.acm.org/doi/10.1109/CloudCom.2014.63
Di Martino CCinque MCotroneo D(2012)Assessing time coalescence techniques for the analysis of supercomputer logsIEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)10.1109/DSN.2012.6263946(1-12)Online publication date: Jun-2012
https://doi.org/10.1109/DSN.2012.6263946
Thanakornworakij TNassar RLeangsuksun CPăun M(2012)A reliability model for cloud computing for high performance computing applicationsProceedings of the 18th international conference on Parallel processing workshops10.1007/978-3-642-36949-0_53(474-483)Online publication date: 27-Aug-2012
https://dl.acm.org/doi/10.1007/978-3-642-36949-0_53
Hacker TMahadik KLathrop SCosta JKramer W(2011)Flexible resource allocation for reliable virtual cluster computing systemsProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063448(1-12)Online publication date: 12-Nov-2011
https://dl.acm.org/doi/10.1145/2063384.2063448
Romero FHacker T(2011)Live Migration of Parallel Applications with OpenVZProceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications10.1109/WAINA.2011.156(526-531)Online publication date: 22-Mar-2011
https://dl.acm.org/doi/10.1109/WAINA.2011.156
Hacker T(2010)Toward a Reliable Cloud Computing ServiceCloud Computing and Software Services10.1201/EBK1439803158-c6(139-152)Online publication date: 13-Jul-2010
https://doi.org/10.1201/EBK1439803158-c6
Hacker TRomero FCarothers C(2009)An analysis of clustered failures on large supercomputing systemsJournal of Parallel and Distributed Computing10.5555/1550962.155118969:7(652-665)Online publication date: 1-Jul-2009
https://dl.acm.org/doi/10.5555/1550962.1551189

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten