skip to main content
10.1145/1272366.1272373acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
Article

Using queue structures to improve job reliability

Published: 25 June 2007 Publication History

Abstract

Many high performance computing systems today exploit the availability and remarkable performance characteristics of stand alone server systems and the impressive price / performance ratio of commodity components. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant popularity and are an indispensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exposed operational issues, which include system availability and reliability. In this paper, we explore the impact of individual component reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over abroad job mix. We also explore the impact of reliability and queue structure on checkpoint intervals and recovery. Our results demonstrate that it is possible to design a reliable high performance computing system with very good operational reliability characteristics from a collection of moderately reliable components.

References

[1]
J. T. Daly, "A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps," International Workshop on Software Engineering for High Performance Computing System Applications, 2004.
[2]
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479--493, 2005.
[3]
S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job portfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004.
[4]
K. J. Ryan and C. S. Reese, "Estimating reliability trends for the world's fastest computer," Los Alamos National Laboratory, Tech. Rep. LA-UR-00-4201, 2000.
[5]
T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217--227.
[6]
D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb cs:TR-2003-37, Nov. 2003.
[7]
J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190--199.
[8]
D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30-September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432--441.
[9]
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233--252.
[10]
B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Proceedings of International Symposium on Dependable Systems and Networks (DSN). IEEE Computer Society, 2006, pp. 249--258.
[11]
C. Ebeling, An Introduction to Reliability and Maintainability Engineering. Boston, MA: McGraw-Hill, 1997.
[12]
D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.
[13]
Los Alamos National Laboratory. (2006) Raw operational data on system failures. {Online}. Available: http://www.lanl.gov/projects/computerscience/data/
[14]
EasyFit Statistical Package, "http://www.mathwave.com/products/easyfit.html."
[15]
N. Raju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hpc clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006.
[16]
M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS'99). Washington - Brussels - Tokyo: IEEE, Oct. 1999, pp. 178--189.
[17]
B. Schroeder and G. Gibson, "Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13--16 2007.
[18]
E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a large disk drive population," in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13--16 2007.
[19]
D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.
[20]
M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.
[21]
F. Petrini, "Scaling to Thousands of Processors with Bu®er Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002.
[22]
Lublin and Feitelson, "The workload on parallel supercomputers: Modeling the characteristics of rigid jobs," JPDC: Journal of Parallel and Distributed Computing, vol. 63, 2003.
[23]
N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942--947, 1997.
[24]
D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkpoint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004.
[25]
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system," Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.
[26]
L. P. Cox, C. D. Murray, and B. Noble, "Pastiche: Making backup cheap and easy," in Proceedings of the 5th ACM Symposium on Operating System Design and Implementation (OSDI-02), ser. Operating Systems Review. New York: ACM Press, Dec. 9-11 2007, pp. 285--298.
[27]
R. A. Oldfield, "Investigating lightweight storage and overlay network for fault tolerance," Proceedings of the High Availability and Performance Computing Workshop, 2006.

Cited By

View all
  • (2021)Cloud Dynamic Scheduling for Multimedia Data Encryption Using Tabu Search AlgorithmWireless Personal Communications10.1007/s11277-021-08562-5Online publication date: 10-May-2021
  • (2015)FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)10.1109/BDC.2015.42(11-20)Online publication date: Dec-2015
  • (2014)Reliability Guided Resource Allocation for Large-Scale SystemsProceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science10.1109/CloudCom.2014.63(334-341)Online publication date: 15-Dec-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing
June 2007
256 pages
ISBN:9781595936738
DOI:10.1145/1272366
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cluster design and architecture
  2. reliability

Qualifiers

  • Article

Conference

HPDC07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Cloud Dynamic Scheduling for Multimedia Data Encryption Using Tabu Search AlgorithmWireless Personal Communications10.1007/s11277-021-08562-5Online publication date: 10-May-2021
  • (2015)FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)10.1109/BDC.2015.42(11-20)Online publication date: Dec-2015
  • (2014)Reliability Guided Resource Allocation for Large-Scale SystemsProceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science10.1109/CloudCom.2014.63(334-341)Online publication date: 15-Dec-2014
  • (2012)Assessing time coalescence techniques for the analysis of supercomputer logsIEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)10.1109/DSN.2012.6263946(1-12)Online publication date: Jun-2012
  • (2012)A reliability model for cloud computing for high performance computing applicationsProceedings of the 18th international conference on Parallel processing workshops10.1007/978-3-642-36949-0_53(474-483)Online publication date: 27-Aug-2012
  • (2011)Flexible resource allocation for reliable virtual cluster computing systemsProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063448(1-12)Online publication date: 12-Nov-2011
  • (2011)Live Migration of Parallel Applications with OpenVZProceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications10.1109/WAINA.2011.156(526-531)Online publication date: 22-Mar-2011
  • (2010)Toward a Reliable Cloud Computing ServiceCloud Computing and Software Services10.1201/EBK1439803158-c6(139-152)Online publication date: 13-Jul-2010
  • (2009)An analysis of clustered failures on large supercomputing systemsJournal of Parallel and Distributed Computing10.5555/1550962.155118969:7(652-665)Online publication date: 1-Jul-2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media