Abstract
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
Similar content being viewed by others
References
Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)
Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999)
Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004)
łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)
Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)
Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006)
Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015)
Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001)
Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)
Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995)
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)
Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005)
Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)
Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995)
Acknowledgements
The MPFL project is an on-going collaborative project between multiple teams from Jiangnan Institute of Computing Technology. We would like to thank every member for their constructive suggestions and contribution during the design phase of MPFL. This research was financially supported by The National Key Research and Development Program of China (Grant No. 2017YFB0202004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gao, J., Wei, H., Yu, K. et al. A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems. Int J Parallel Prog 46, 749–761 (2018). https://doi.org/10.1007/s10766-017-0526-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-017-0526-x