Skip to main content
Log in

Failure detection algorithm for Fail-Lagging model applied to HPC

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

It is essential to use fault tolerance techniques on exascale high-performance computing systems, but this faces many challenges such as higher probability of failure, more complex types of faults, and greater difficulty in failure detection. In this paper, we designed the Fail-Lagging model to describe HPC process-level failure. The failure model does not distinguish whether the failed process is crashed or slow, but is compatible with the possible behavior of the process due to various failures, such as crash, slow, recovery. The failure detection in Fail-Lagging model is implemented by local detection and global decision among processes, which depend on a robust and efficient communication topology. Robust means that failed processes do not easily corrupt the connectivity of the topology, and efficient means that the time complexity of the topology used for collective communication is as low as possible. For this purpose, we designed a torus-tree topology for failure detection, which is scalable even at the scale of an extremely large number of processes. The Fail-Lagging model supports common fault tolerance methods such as rollback, replication, redundancy, algorithm-based fault tolerance, etc. and is especially able to better enable the efficient forward recovery mode. We demonstrate with large-scale experiments that the torus-tree failure detection algorithm is robust and efficient, and we apply fault tolerance based on the Fail-Lagging model to iterative computation, enabling applications to react to faults in a timely manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Aguilera MK, Chen W, Toueg S (1998) Failure detection and consensus in the crash-recovery model. In: Kutten S (ed) Distributed Computing, 12th International Symposium, DISC ’98, Andros, Greece, September 24-26, Proceedings, Lecture Notes in Computer Science, vol 1499, pp 231–245. Springer. https://doi.org/10.1007/BFb0056486

  2. Albrecht JR, Tuttle C, Snoeren AC, Vahdat A (2006) Loose synchronization for large-scale networked systems. In: Adya A, Nahum EM (eds) Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, USA, May 30–June 3, pp 301–314. USENIX. http://www.usenix.org/events/usenix06/tech/albrecht.html

  3. Angskun T, Bosilca G, Dongarra JJ (2007) Binomial graph: a scalable and fault-tolerant logical network topology. In: Stojmenovic I, Thulasiram RK, Yang LT, Jia W, Guo M, de Mello RF (eds) Parallel and Distributed Processing and Applications, 5th International Symposium, ISPA 2007, Niagara Falls, Canada, August 29–31, 2007, Proceedings, Lecture Notes in Computer Science, vol 4742, pp 471–482. Springer. https://doi.org/10.1007/978-3-540-74742-0_43

  4. Angskun T, Fagg GE, Bosilca G, Pjesivac-Grbovic J, Dongarra JJ (2006) Scalable fault tolerant protocol for parallel runtime environments. In: Mohr B, Träff JL, Worringen J, Dongarra JJ (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User’s Group Meeting, Bonn, Germany, September 17–20, Proceedings, Lecture Notes in Computer Science, vol 4192, pp 141–149. Springer. https://doi.org/10.1007/11846802_25

  5. Arpaci-Dusseau RH, Arpaci-Dusseau AC (2001) Fail-stutter fault tolerance. In: Proceedings of HotOS-VIII: 8th Workshop on Hot Topics in Operating Systems, May 20–23, Elmau/Oberbayern, Germany, pp 33–38. IEEE Computer Society. https://doi.org/10.1109/HOTOS.2001.990058

  6. Bosilca G, Bouteiller A, Guermouche A, Hérault T, Robert Y, Sens P, Dongarra JJ (2016) Failure detection and propagation in HPC systems. In: West J, Pancake CM (eds) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13–18, pp 312–322. IEEE Computer Society. https://doi.org/10.1109/SC.2016.26

  7. Bosilca G, Bouteiller A, Guermouche A, Hérault T, Robert Y, Sens P, Dongarra JJ (2018) A failure detector for HPC platforms. Int J High Perform Comput Appl 32(1):139–158. https://doi.org/10.1177/1094342017711505

    Article  Google Scholar 

  8. Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM 43(2):225–267. https://doi.org/10.1145/226643.226647

    Article  MathSciNet  MATH  Google Scholar 

  9. Chen Z, Dongarra JJ (2008) Algorithm-based fault tolerance for fail-stop failures. IEEE Trans Parallel Distrib Syst 19(12):1628–1641. https://doi.org/10.1109/TPDS.2008.58

    Article  Google Scholar 

  10. Dwork C, Lynch NA, Stockmeyer LJ (1984) Consensus in the presence of partial synchrony (preliminary version). In: Kameda T, Misra J, Peters JG, Santoro N (eds) Proceedings of the Third Annual ACM Symposium on Principles of Distributed Computing, Vancouver, B. C., Canada, August 27–29, pp 103–118. ACM. https://dl.acm.org/citation.cfm?id=1599406

  11. Egwutuoha IP, Levy D, Selic B, Chen S (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326. https://doi.org/10.1007/s11227-013-0884-0

    Article  Google Scholar 

  12. Ferreira K, Stearley J, Laros JH, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12. https://doi.org/10.1145/2063384.2063443

  13. Graham N, Harary F, Livingston M, Stout QF (1993) Subcube fault-tolerance in hypercubes. Inf Comput 102(2):280–314. https://doi.org/10.1006/inco.1993.1010

    Article  MathSciNet  MATH  Google Scholar 

  14. Gunawi HS, Suminto RO, Sears R, Golliher C, Sundararaman S, Lin X, Emami T, Sheng W, Bidokhti N, McCaffrey C, Srinivasan D, Panda B, Baptist A, Grider G, Fields PM, Harms K, Ross RB, Jacobson A, Ricci R, Webb K, Alvaro P, Runesha HB, Hao M, Li H (2018) Fail-slow at scale: evidence of hardware performance faults in large production systems. ACM Trans Storage (TOS) 14(3):1–26. https://doi.org/10.1145/3242086

    Article  Google Scholar 

  15. Gupta S, Tiwari D, Jantzi C, Rogers JH, Maxwell D (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 22–25, pp 37–44. IEEE Computer Society. https://doi.org/10.1109/DSN.2015.52

  16. Hurfin M, Mostéfaoui A, Raynal M (1998) Consensus in asynchronous systems where processes can crash and recover. In: The Seventeenth Symposium on Reliable Distributed Systems, SRDS 1998, West Lafayette, Indiana, USA, October 20–22, Proceedings, pp 280–286. IEEE Computer Society. https://doi.org/10.1109/RELDIS.1998.740510

  17. Hursey J, Graham RL (2011) Building a fault tolerant mpi application: a ring communication example. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 1549–1556. https://doi.org/10.1109/RELDIS.1998.740510

  18. Kharbas K, Kim D, Hoefler T, Mueller F (2012) Assessing HPC failure detectors for MPI jobs. In: Stotzka R, Schiffers M, Cotronis Y (eds) Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2012, Munich, Germany, February 15–17, pp 81–88. IEEE. https://doi.org/10.1109/PDP.2012.11

  19. Lamport L, Shostak RE, Pease MC (1982) The byzantine generals problem. ACM Trans Program Lang Syst 4(3):382–401. https://doi.org/10.1145/357172.357176

    Article  MATH  Google Scholar 

  20. Losada N, González P, Martín MJ, Bosilca G, Bouteiller A, Teranishi K (2020) Fault tolerance of MPI applications in exascale systems: the ULFM solution. Future Gener Comput Syst 106:467–481. https://doi.org/10.1016/j.future.2020.01.026

    Article  Google Scholar 

  21. Schlichting RD, Schneider FB (1983) Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans Comput Syst (TOCS) 1(3):222–238. https://doi.org/10.1145/357369.357371

    Article  Google Scholar 

  22. Sloan J, Kumar R, Bronevetsky G (2013) An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, June 24–27, pp 1–12. IEEE Computer Society. https://doi.org/10.1109/DSN.2013.6575309

  23. Ye Y, Zhang Y, Ye W (2021) An application-level failure detection algorithm based on a robust and efficient torus-tree for HPC. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, September 30–Oct. 3, pp 484–492. IEEE. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00073

  24. Zhai J, Chen W (2018) A vision of post-exascale programming. Front Inf Technol Electron Eng 19(10):1261–1266. https://doi.org/10.1631/FITEE.1800442

    Article  Google Scholar 

  25. Zhong D, Bouteiller A, Luo X, Bosilca G (2019) Runtime level failure detection and propagation in HPC systems. In: Hoefler T, Träff JL (eds) Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11–13, pp 14:1–14:11. ACM. https://doi.org/10.1145/3343211.3343225

Download references

Acknowledgements

The authors would like to thank Jianfeng Zheng for interesting discussions related to this work. The work is partially supported by Key-Area Research and Development of Guangdong Province (No.2021B0101190003). The work is also partially supported by Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongdong Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, Y., Zhang, Y. & Ye, W. Failure detection algorithm for Fail-Lagging model applied to HPC. J Supercomput 78, 14009–14033 (2022). https://doi.org/10.1007/s11227-022-04347-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04347-0

Keywords

Navigation