Article

Proactive fault tolerance for HPC with Xen virtualization

Authors:

Arun Babu Nagarajan,

Christian Engelmann,

Stephen L. ScottAuthors Info & Claims

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

Pages 23 - 32

https://doi.org/10.1145/1274971.1274978

Published: 17 June 2007 Publication History

Abstract

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.

Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

References

[1]

Ganglia. http://ganglia.sourceforge.net/.

[2]

OpenIPMI. http://openipmi.sourceforge.net/.

[3]

Advanced configuration & power interface. http://www.acpi.info/, 2004.

[4]

R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of LA-MPI, a network-fault-tolerant MPI. In International Parallel and Distributed Processing Symposium, 2004.

[5]

A. Barak and R. Wheeler. MOSIX: An integrated multiprocessor UNIX. In USENIX Association, editor, Proceedings of the Winter 1989 USENIX Conference: January 30--February 3, 1989, San Diego, California, USA, pages 101--112, Berkeley, CA, USA, Winter 1989. USENIX.

[6]

P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Symposium on Operating Systems Principles, pages 164--177, 2003.

Digital Library

[7]

G. Bosilca, A. Boutellier, and F. Cappello. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In Supercomputing, Nov. 2002.

Digital Library

[8]

R. Butler, W. Gropp, and E. L. Lusk. A scalable process-management environment for parallel programs. In Euro PVM/MPI, pages 168--175, 2000.

Digital Library

[9]

S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in large systems. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.

[10]

S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in mpi applications via task migration. In International Conference on High Performance Computing, 2006.

Digital Library

[11]

S. Chakravorty, C. Mendes, and L. Kale. A fault tolerance protocol with fast fault recovery. In International Parallel and Distributed Processing Symposium, 2007.

[12]

C. Clark, K. Fraser, S. Hand, J. Hansem, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In 2nd Symposium on Networked Systems Design and Implementation, May 2005.

Digital Library

[13]

F. Douglis and J. K. Ousterhout. Transparent process migration: Design alternatives and the sprite implementation. Softw., Pract. Exper., 21(8):757--785, 1991.

Digital Library

[14]

J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. Tr, Lawrence Berkeley National Laboratory, 2000.

[15]

E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. Comput., 41(5):526--531, 1992.

Digital Library

[16]

G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world. In Euro PVM/MPI User's Group Meeting, Lecture Notes in Computer Science, volume 1908, pages 346--353, 2000.

Digital Library

[17]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, 2003.

Digital Library

[18]

J. G. Hansen and E. Jul. Self-migration of operating systems. In EW11: Proceedings of the 11th workshop on ACM SIGOPS European workshop: beyond the PC, page 23, New York, NY, USA, 2004. ACM Press.

Digital Library

[19]

H. Härtig, M. Hohmuth, J. Liedtke, S. Schönberg, and J. Wolter. The performance of μ-Kernel-based systems. In Proceedings of the 16th Symposium on Operating Systems Principles (SOSP-97), volume 31, 5 of Operating Systems Review, pages 66--77, New York, Oct. 1997. ACM Press.

Digital Library

[20]

C.-H. Hsu and W.-C. Feng. A power-aware run-time system for high-performance computing. In SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005.

Digital Library

[21]

W. Huang, J. Liu, B. Abali, and D. Panda. A case for high performance computing with virtual machines. In International Conference on Supercomputing, June 2006.

Digital Library

[22]

IBM T.J. Watson. Personal communications. Ruud Haring, July 2005.

[23]

E. Jul, H. M. Levy, N. C. Hutchinson, and A. P. Black. Fine-grained mobility in the emerald system. ACM Trans. Comput. Syst., 6(1):109--133, 1988.

Digital Library

[24]

M. Kozuch and M. Satyanarayanan. Internet suspend/resume. In IEEE Workshop on Mobile Computing Systems and Applications, pages 40-, 2002.

Digital Library

[25]

J. Liu, W. Huang, B. Abali, and D. Panda. High performance vmm-bypass i/o in virtual machines. In USENIX Conference, June 2006.

Digital Library

[26]

A. Menon, A. Cox, and W. Zwaenepoel. Optimizing network virtualization in xen. In USENIX Conference, June 2006.

Digital Library

[27]

A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for bluegene/l systems. In International Parallel and Distributed Processing Symposium, 2004.

[28]

A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In International Conference on Supercomputing, pages 14--23, 2006.

Digital Library

[29]

S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of zap: A system for migrating computing environments. In OSDI, 2002.

Digital Library

[30]

I. Philp. Software failures and the road to a petaflop machine. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.

[31]

M. L. Powell and B. P. Miller. Process migration in DEMOS/MP. In Symposium on Operating Systems Principles, pages 110--119, Oct. 1983.

Digital Library

[32]

S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. Scott. Toward efficient failre detection and recovery in hpc. In High Availability and Performance Computing Workshop, page (accepted), 2006.

[33]

R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426--435, 2003.

Digital Library

[34]

S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Oct. 2003.

[35]

C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum. Optimizing the migration of virtual computers. In OSDI, 2002.

Digital Library

[36]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In DSN '06: Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), pages 249--258, 2006.

Digital Library

[37]

H. Song, C. Leangsuksun, and R. Nassar. Availability modeling and analysis on high performance cluster computing systems. In First International Conference on Availability, Reliability and Security, pages 305--313, 2006.

Digital Library

[38]

G. Stellner. CoCheck: checkpointing and process migration for MPI. In IEEE, editor, Proceedings of IPPS '96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15--19 April 1996, pages 526--531, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1996. IEEE Computer Society Press.

Digital Library

[39]

M. Theimer, K. A. Lantz, and D. R. Cheriton. Preemptable remote execution facilities for the v-system. In SOSP, pages 2--12, 1985.

Digital Library

[40]

C. Wang, F. Mueller, C. Engelmann, and S. Scott. A job pause service under lam/mpi+blcr for transparent fault tolerance. In International Parallel and Distributed Processing Symposium, page (accepted), Apr. 2007.

[41]

A. Whitaker, R. S. Cox, M. Shaw, and S. D. Gribble. Constructing services with interposable virtual hardware. In Symposium on Networked Systems Design and Implementation, pages 169--182, 2004.

Digital Library

[42]

F. Wong, R. Martin, R. Arpaci-Dusseau, and D. Culler. Architectural requirements and scalability of the NAS parallel benchmarks. In Supercomputing, 1999.

Digital Library

[43]

E. R. Zayas. Attacking the process migration bottleneck. In SOSP, pages 13--24, 1987.

Digital Library

Cited By

Cerveira FFerreira ABarbosa R(2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/MC.2023.3306617
Lupu CAlbișoru ANichita RBlânzeanu DPogonaru MDeaconescu RRaiciu CFedorova ANarayanan DDi Luna GQuerzoni L(2023)Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587454(574-589)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587454
Thakur AGoraya M(2022)A Workload and Machine Categorization-Based Resource Allocation Framework for Load Balancing and Balanced Resource Utilization in the CloudInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159414:1(1-16)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.4018/IJGHPC.301594
Show More Cited By

Index Terms

Proactive fault tolerance for HPC with Xen virtualization

Recommendations

Xen and the art of virtualization
SOSP '03

Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100% binary compatibility at the expense of ...
Xen and the art of virtualization
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles

Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100% binary compatibility at the expense of ...
A Framework for Proactive Fault Tolerance
ARES '08: Proceedings of the 2008 Third International Conference on Availability, Reliability and Security

Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

June 2007

315 pages

ISBN:9781595937681

DOI:10.1145/1274971

General Chair:
Burton Smith
Microsoft

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS07

Sponsor:

SIGARCH

ICS07: International Conference on Supercomputing

June 17 - 21, 2007

Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

253
Total Citations
View Citations
2,071
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cerveira FFerreira ABarbosa R(2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/MC.2023.3306617
Lupu CAlbișoru ANichita RBlânzeanu DPogonaru MDeaconescu RRaiciu CFedorova ANarayanan DDi Luna GQuerzoni L(2023)Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587454(574-589)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587454
Thakur AGoraya M(2022)A Workload and Machine Categorization-Based Resource Allocation Framework for Load Balancing and Balanced Resource Utilization in the CloudInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159414:1(1-16)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.4018/IJGHPC.301594
Naik K(2022)An Adaptive Push-Pull for Disseminating Dynamic Workload and Virtual Machine Live Migration in Cloud ComputingInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159114:1(1-25)Online publication date: 29-Jun-2022
https://dl.acm.org/doi/10.4018/IJGHPC.301591
Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Nong MHuang LLiu M(2022)Allocation of Resources for Cloud Survivability in Smart ManufacturingACM Transactions on Management Information Systems10.1145/353370113:4(1-11)Online publication date: 10-Aug-2022
https://dl.acm.org/doi/10.1145/3533701
Wang QCamacho IJing SGoel A(2022)Understanding the Design Space of AI-Mediated Social Interaction in Online Learning: Challenges and OpportunitiesProceedings of the ACM on Human-Computer Interaction10.1145/35129776:CSCW1(1-26)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512977
Micallef NArmacost VMemon NPatil S(2022)True or False: Studying the Work Practices of Professional Fact-CheckersProceedings of the ACM on Human-Computer Interaction10.1145/35129746:CSCW1(1-44)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512974
Rastogi CZhang YWei DVarshney KDhurandhar ATomsett R(2022)Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-makingProceedings of the ACM on Human-Computer Interaction10.1145/35129306:CSCW1(1-22)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512930
Long DMcKlin TBoone NDean DGaroufalidis MMagerko B(2022)Active Prolonged Engagement EXpanded (APEX): A Toolkit for Supporting Evidence-Based Iterative Design Decisions for Collaborative, Embodied Museum ExhibitsProceedings of the ACM on Human-Computer Interaction10.1145/35128976:CSCW1(1-33)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512897
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents