A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Gao, Jian; Wei, Hongmei; Yu, Kang; Qing, Peng

doi:10.1007/s10766-017-0526-x

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Published: 30 September 2017

Volume 46, pages 749–761, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Jian Gao ORCID: orcid.org/0000-0001-8453-3119¹,
Hongmei Wei¹,
Kang Yu¹ &
…
Peng Qing¹

177 Accesses
Explore all metrics

Abstract

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A brief introduction to distributed systems

Article Open access 16 August 2016

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Serverless Computing: Current Trends and Open Problems

References

Siewiorek, D., Swarz, R.: Reliable Computer Systems: Design and Evaluatuion. Digital Press, Newton (2017)
MATH Google Scholar
Liu, G., Mok, A.K., Yang, E.J.: Composite events for network event correlation. Integrated Network Management, 1999. In: Proceedings of the Sixth IFIP/IEEE International Symposium on Distributed Management for the Networked Millennium, pp. 247–260. IEEE (1999)
Rish, I., Brodie, M., Odintsova, N.: Real-time problem determination in distributed systems using active probing. In: Network Operations and Management Symposium, et al.: NOMS 2004. IEEE/IFIP, vol. 1, pp. 133–146. IEEE (2004)
łgorznder, M., Sethi, A.S.: A survey of fault localization techniques in computer networks. Sci. Comput. Program. 53(2), 165–194 (2004)
Article MathSciNet MATH Google Scholar
Ficco, M.: Security event correlation approach for cloud computing. Int. J. High Perform. Comput. Netw. 1 7(3), 173–185 (2013)
Article Google Scholar
Natu, M., Sethi, A.S.: Active probing approach for fault localization in computer networks. In: 4th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services, vol. 2006, pp. 25–33. IEEE (2006)
Patil, B.M., Pathak, V.K.: Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks. Trans. Netw. Commun. 3(4), 57 (2015)
Google Scholar
Panda, D.K.: InfiniBand Architecture[C]//Proceedings of the Ninth Symposium on High Performance Interconnects, p. 159. IEEE Computer Society (2001)
Peng, J., Lu, J., Law, K.H., et al.: ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers. Int. J. Numer. Anal. Methods Geomech. 28(12), 1207–1232 (2004)
Article MATH Google Scholar
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)
Article Google Scholar
Jakobson, G., Weissman, M.: Real-time Telecommunication Network Management: Extending Event Correlation with Temporal Constraints[M]//Integrated Network Manage- ment IV, pp. 290–301. Springer, Berlin (1995)
Google Scholar
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. IOP Publ. 78(1), 012022 (2007)
Article Google Scholar
Wu, L., Meng, D., Liang, Y., et al.: LUNF-A cluster job scheduling strategy using characterization of nodes’ failure. Jisuanji Yanjiu yu Fazhan (Comput. Res. Dev.) 42(6), 1000–1005 (2005)
Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)
Article Google Scholar
Anderson, J.D., Wendt, J.: Computational Fluid Dynamics[M]. McGraw-Hill, New York (1995)

Download references

Acknowledgements

The MPFL project is an on-going collaborative project between multiple teams from Jiangnan Institute of Computing Technology. We would like to thank every member for their constructive suggestions and contribution during the design phase of MPFL. This research was financially supported by The National Key Research and Development Program of China (Grant No. 2017YFB0202004)

Author information

Authors and Affiliations

Jiangnan Institute of Computing Technology, Wuxi, 214083, Jiangsu, China
Jian Gao, Hongmei Wei, Kang Yu & Peng Qing

Authors

Jian Gao
View author publications
You can also search for this author in PubMed Google Scholar
Hongmei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Kang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Qing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, J., Wei, H., Yu, K. et al. A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems. Int J Parallel Prog 46, 749–761 (2018). https://doi.org/10.1007/s10766-017-0526-x

Download citation

Received: 26 August 2017
Accepted: 20 September 2017
Published: 30 September 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10766-017-0526-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Serverless Computing: Current Trends and Open Problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Serverless Computing: Current Trends and Open Problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation