skip to main content
10.1145/1274971.1274978acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Proactive fault tolerance for HPC with Xen virtualization

Published: 17 June 2007 Publication History

Abstract

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.
Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

References

[1]
Ganglia. http://ganglia.sourceforge.net/.
[2]
OpenIPMI. http://openipmi.sourceforge.net/.
[3]
Advanced configuration & power interface. http://www.acpi.info/, 2004.
[4]
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of LA-MPI, a network-fault-tolerant MPI. In International Parallel and Distributed Processing Symposium, 2004.
[5]
A. Barak and R. Wheeler. MOSIX: An integrated multiprocessor UNIX. In USENIX Association, editor, Proceedings of the Winter 1989 USENIX Conference: January 30--February 3, 1989, San Diego, California, USA, pages 101--112, Berkeley, CA, USA, Winter 1989. USENIX.
[6]
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Symposium on Operating Systems Principles, pages 164--177, 2003.
[7]
G. Bosilca, A. Boutellier, and F. Cappello. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In Supercomputing, Nov. 2002.
[8]
R. Butler, W. Gropp, and E. L. Lusk. A scalable process-management environment for parallel programs. In Euro PVM/MPI, pages 168--175, 2000.
[9]
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in large systems. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
[10]
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in mpi applications via task migration. In International Conference on High Performance Computing, 2006.
[11]
S. Chakravorty, C. Mendes, and L. Kale. A fault tolerance protocol with fast fault recovery. In International Parallel and Distributed Processing Symposium, 2007.
[12]
C. Clark, K. Fraser, S. Hand, J. Hansem, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In 2nd Symposium on Networked Systems Design and Implementation, May 2005.
[13]
F. Douglis and J. K. Ousterhout. Transparent process migration: Design alternatives and the sprite implementation. Softw., Pract. Exper., 21(8):757--785, 1991.
[14]
J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. Tr, Lawrence Berkeley National Laboratory, 2000.
[15]
E. N. Elnozahy and W. Zwaenepoel. Manetho: Transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. Comput., 41(5):526--531, 1992.
[16]
G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world. In Euro PVM/MPI User's Group Meeting, Lecture Notes in Computer Science, volume 1908, pages 346--353, 2000.
[17]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, 2003.
[18]
J. G. Hansen and E. Jul. Self-migration of operating systems. In EW11: Proceedings of the 11th workshop on ACM SIGOPS European workshop: beyond the PC, page 23, New York, NY, USA, 2004. ACM Press.
[19]
H. Härtig, M. Hohmuth, J. Liedtke, S. Schönberg, and J. Wolter. The performance of μ-Kernel-based systems. In Proceedings of the 16th Symposium on Operating Systems Principles (SOSP-97), volume 31, 5 of Operating Systems Review, pages 66--77, New York, Oct. 1997. ACM Press.
[20]
C.-H. Hsu and W.-C. Feng. A power-aware run-time system for high-performance computing. In SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005.
[21]
W. Huang, J. Liu, B. Abali, and D. Panda. A case for high performance computing with virtual machines. In International Conference on Supercomputing, June 2006.
[22]
IBM T.J. Watson. Personal communications. Ruud Haring, July 2005.
[23]
E. Jul, H. M. Levy, N. C. Hutchinson, and A. P. Black. Fine-grained mobility in the emerald system. ACM Trans. Comput. Syst., 6(1):109--133, 1988.
[24]
M. Kozuch and M. Satyanarayanan. Internet suspend/resume. In IEEE Workshop on Mobile Computing Systems and Applications, pages 40-, 2002.
[25]
J. Liu, W. Huang, B. Abali, and D. Panda. High performance vmm-bypass i/o in virtual machines. In USENIX Conference, June 2006.
[26]
A. Menon, A. Cox, and W. Zwaenepoel. Optimizing network virtualization in xen. In USENIX Conference, June 2006.
[27]
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for bluegene/l systems. In International Parallel and Distributed Processing Symposium, 2004.
[28]
A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In International Conference on Supercomputing, pages 14--23, 2006.
[29]
S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of zap: A system for migrating computing environments. In OSDI, 2002.
[30]
I. Philp. Software failures and the road to a petaflop machine. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
[31]
M. L. Powell and B. P. Miller. Process migration in DEMOS/MP. In Symposium on Operating Systems Principles, pages 110--119, Oct. 1983.
[32]
S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. Scott. Toward efficient failre detection and recovery in hpc. In High Availability and Performance Computing Workshop, page (accepted), 2006.
[33]
R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426--435, 2003.
[34]
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Oct. 2003.
[35]
C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum. Optimizing the migration of virtual computers. In OSDI, 2002.
[36]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In DSN '06: Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), pages 249--258, 2006.
[37]
H. Song, C. Leangsuksun, and R. Nassar. Availability modeling and analysis on high performance cluster computing systems. In First International Conference on Availability, Reliability and Security, pages 305--313, 2006.
[38]
G. Stellner. CoCheck: checkpointing and process migration for MPI. In IEEE, editor, Proceedings of IPPS '96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15--19 April 1996, pages 526--531, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1996. IEEE Computer Society Press.
[39]
M. Theimer, K. A. Lantz, and D. R. Cheriton. Preemptable remote execution facilities for the v-system. In SOSP, pages 2--12, 1985.
[40]
C. Wang, F. Mueller, C. Engelmann, and S. Scott. A job pause service under lam/mpi+blcr for transparent fault tolerance. In International Parallel and Distributed Processing Symposium, page (accepted), Apr. 2007.
[41]
A. Whitaker, R. S. Cox, M. Shaw, and S. D. Gribble. Constructing services with interposable virtual hardware. In Symposium on Networked Systems Design and Implementation, pages 169--182, 2004.
[42]
F. Wong, R. Martin, R. Arpaci-Dusseau, and D. Culler. Architectural requirements and scalability of the NAS parallel benchmarks. In Supercomputing, 1999.
[43]
E. R. Zayas. Attacking the process migration bottleneck. In SOSP, pages 13--24, 1987.

Cited By

View all
  • (2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
  • (2023)Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587454(574-589)Online publication date: 8-May-2023
  • (2022)A Workload and Machine Categorization-Based Resource Allocation Framework for Load Balancing and Balanced Resource Utilization in the CloudInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159414:1(1-16)Online publication date: 10-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '07: Proceedings of the 21st annual international conference on Supercomputing
June 2007
315 pages
ISBN:9781595937681
DOI:10.1145/1274971
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high-performance computing
  2. proactive fault tolerance
  3. virtualization

Qualifiers

  • Article

Conference

ICS07
Sponsor:
ICS07: International Conference on Supercomputing
June 17 - 21, 2007
Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
  • (2023)Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587454(574-589)Online publication date: 8-May-2023
  • (2022)A Workload and Machine Categorization-Based Resource Allocation Framework for Load Balancing and Balanced Resource Utilization in the CloudInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159414:1(1-16)Online publication date: 10-Jun-2022
  • (2022)An Adaptive Push-Pull for Disseminating Dynamic Workload and Virtual Machine Live Migration in Cloud ComputingInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30159114:1(1-25)Online publication date: 29-Jun-2022
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Allocation of Resources for Cloud Survivability in Smart ManufacturingACM Transactions on Management Information Systems10.1145/353370113:4(1-11)Online publication date: 10-Aug-2022
  • (2022)Understanding the Design Space of AI-Mediated Social Interaction in Online Learning: Challenges and OpportunitiesProceedings of the ACM on Human-Computer Interaction10.1145/35129776:CSCW1(1-26)Online publication date: 7-Apr-2022
  • (2022)True or False: Studying the Work Practices of Professional Fact-CheckersProceedings of the ACM on Human-Computer Interaction10.1145/35129746:CSCW1(1-44)Online publication date: 7-Apr-2022
  • (2022)Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-makingProceedings of the ACM on Human-Computer Interaction10.1145/35129306:CSCW1(1-22)Online publication date: 7-Apr-2022
  • (2022)Active Prolonged Engagement EXpanded (APEX): A Toolkit for Supporting Evidence-Based Iterative Design Decisions for Collaborative, Embodied Museum ExhibitsProceedings of the ACM on Human-Computer Interaction10.1145/35128976:CSCW1(1-33)Online publication date: 7-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media