skip to main content
10.1145/2742854.2742903acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
poster

Programmer-directed partial redundancy for resilient HPC

Published: 06 May 2015 Publication History

Abstract

In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.

References

[1]
Marenostrum iii system architecture: http://www.bsc.es/marenostrum-support-services/mn3.
[2]
F. Cappello et al. Toward exascale resilience. Int. J. High Perform. Comput. Appl., Nov. 2009.
[3]
J. Dongarra et al. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl., Feb. 2011.
[4]
A. Duran et al. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 2011.
[5]
P. Ghosh et al. A prototype implementation of openmp task dependency support. In OpenMP in the Era of Low Power Devices and Accelerators, volume 8122 of Lecture Notes in Computer Science. Springer, 2013.
[6]
X. Teruel et al. Support for openmp tasks in nanos v4. In Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, 2007.

Cited By

View all
  • (2024)To Protect or Not To Protect: Probability-Aware Selective Protection for Sparse Iterative Solvers2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00028(229-238)Online publication date: 13-Nov-2024
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2019)Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent ErrorsInternational Journal of Networking and Computing10.15803/ijnc.9.1_29:1(2-27)Online publication date: 2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers
May 2015
413 pages
ISBN:9781450333580
DOI:10.1145/2742854
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015

Check for updates

Qualifiers

  • Poster

Funding Sources

  • European Community's Seventh Framework Programme

Conference

CF'15
Sponsor:
CF'15: Computing Frontiers Conference
May 18 - 21, 2015
Ischia, Italy

Acceptance Rates

CF '15 Paper Acceptance Rate 33 of 96 submissions, 34%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)To Protect or Not To Protect: Probability-Aware Selective Protection for Sparse Iterative Solvers2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00028(229-238)Online publication date: 13-Nov-2024
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2019)Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent ErrorsInternational Journal of Networking and Computing10.15803/ijnc.9.1_29:1(2-27)Online publication date: 2019
  • (2019)Replication is more efficient than you thinkProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356171(1-14)Online publication date: 17-Nov-2019
  • (2019)Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00013(31-40)Online publication date: May-2019
  • (2018)Comparative analysis of soft-error detection strategiesProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203240(173-182)Online publication date: 8-May-2018
  • (2018)Combining Checkpointing and Replication for Reliable Execution of Linear Workflows2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00126(793-802)Online publication date: May-2018
  • (2018)Coping with silent and fail-stop errors at scale by combining replication and checkpointingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.08.002122(209-225)Online publication date: Dec-2018
  • (2017)Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC ApplicationsProceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware10.1145/3152041.3152083(1-8)Online publication date: 12-Nov-2017
  • (2017)Identifying the Right Replication Level to Detect and Correct Silent Errors at ScaleProceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale10.1145/3086157.3086162(31-38)Online publication date: 26-Jun-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media