Skip to main content
Log in

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)

    MATH  Google Scholar 

  2. Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999)

  3. Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004)

  4. łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  5. Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)

    Article  Google Scholar 

  6. Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006)

  7. Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015)

    Google Scholar 

  8. Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001)

  9. Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)

    Article  MATH  Google Scholar 

  10. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  11. Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995)

    Google Scholar 

  12. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)

    Article  Google Scholar 

  13. Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005)

    Google Scholar 

  14. Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)

    Article  Google Scholar 

  15. Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995)

Download references

Acknowledgements

The MPFL project is an on-going collaborative project between multiple teams from Jiangnan Institute of Computing Technology. We would like to thank every member for their constructive suggestions and contribution during the design phase of MPFL. This research was financially supported by The National Key Research and Development Program of China (Grant No. 2017YFB0202004)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Gao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, J., Wei, H., Yu, K. et al. A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems. Int J Parallel Prog 46, 749–761 (2018). https://doi.org/10.1007/s10766-017-0526-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0526-x

Keywords

Navigation