Iaso: an autonomous fault-tolerant management system for supercomputers

Lu, Kai; Wang, Xiaoping; Li, Gen; Wang, Ruibo; Chi, Wanqing; Liu, Yongpeng; Tang, Hongwei; Feng, Hua; Gao, Yinghui

doi:10.1007/s11704-014-3503-1

Iaso: an autonomous fault-tolerant management system for supercomputers

Research Article
Published: 26 May 2014

Volume 8, pages 378–390, (2014)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Kai Lu^1,2,
Xiaoping Wang^1,2,
Gen Li²,
Ruibo Wang²,
Wanqing Chi²,
Yongpeng Liu²,
Hongwei Tang²,
Hua Feng² &
…
Yinghui Gao³

117 Accesses
7 Citations
Explore all metrics

Abstract

With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the “reliability wall”, which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ADSOA - Fault Detection and Recovery Technology Based on Collective Intelligence

An Eminent Approach of Fault Management Using Proactive and Reactive Techniques in Distributed Computing

A Survey on Fault Management Techniques in Distributed Computing

References

Yang X, Wang Z, Xue J, Zhou Y. The reliability wall for exascale supercomputing. IEEE Transactions on Computers, 2012, 61(6): 767–779
Article MathSciNet Google Scholar
Li Y, Lan Z. Proactive fault manager for high performance computing. In: Proceedings of the 35th International Conference on Dependable Systems and Networks (Fast Abstract). 2005
Google Scholar
Shapiro MW. Self-healing in modern operating systems. Queue, 2004, 2(9): 66–75
Article Google Scholar
Oliner A, Stearley J. What supercomputers say: A study of five system logs. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2007, 575–584
Chapter Google Scholar
Sun X H, Lan Z, Li Y, Jin H, Zheng Z. Towards a faultaware computing environment. High Availability and Performance Computing Workshop, 2008
Google Scholar
Lan Z, Li Y, Gujrati P, Zheng Z, Thakur R, White J. A fault diagnosis and prognosis service for teragrid clusters. In: Proceedings of Tera-Grid, 2007
Google Scholar
Wang X, Luo J, Liu Y, Li S, Dong D. Component-based localization in sparse wireless networks. IEEE/ACM Transactions on Networking (ToN), 2011, 19(2): 540–548
Article Google Scholar
Takemiya H, Tanaka Y, Sekiguchi S, Ogata S, Kalia R K, Nakano A, Vashishta P. Sustainable adaptive grid supercomputing: multiscale simulation of semiconductor processing across the pacific. In: Proceedings of the ACM/IEEE SuperComputing. 2006, 23
Google Scholar
Wang X, Liu Y, Yang Z, Lu K, Luo J. OFA: an optimistic approach to conquer flip ambiguity in network localization. Computer Networks, 2013, 57(6): 1529–1544
Article Google Scholar
Santosd T, Santosd L, Farinon F, Homma R, Andraded R, Khairalla I, Lemos F. Integrating heterogenous applications in control centers based on smart grid concepts. In: Proceedings of the 2013 IEEE PES Conference on Innovative Smart Grid Technologies Latin America (ISGT LA). 2013, 1–6
Google Scholar
Wang X, Yang Z, Luo J, Shen C. Beyond rigidity: obtain localisability with noisy ranging measurement. International Journal of Ad Hoc and Ubiquitous Computing, 2011, 8(1): 114–124
Article Google Scholar
Valverde L, Rosa F, Bordons C. Design, planning and management of a hydrogen-based microgrid. IEEE Transactions on Industrial Informatics, 2013, 9(3): 1398–1404
Article Google Scholar
Zhang X, Zhou F, Zhu X, Sun H, Perrig A, Vasilakos A V, Guan H. DFL: Secure and practical fault localization for datacenter networks. IEEE/ACM Transactions on Networking, 2013
Google Scholar
Huebscher M C, McCann J A. A survey of autonomic computing—degrees, models, and applications. ACM Computing Surveys (CSUR), 2008, 40(3): 7:1–7:28
Article Google Scholar

Download references

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073, China
Kai Lu & Xiaoping Wang
College of Computer, National University of Defense Technology, Changsha, 410073, China
Kai Lu, Xiaoping Wang, Gen Li, Ruibo Wang, Wanqing Chi, Yongpeng Liu, Hongwei Tang & Hua Feng
ATR Laboratory, National University of Defense Technology, Changsha, 410073, China
Yinghui Gao

Authors

Kai Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gen Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruibo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wanqing Chi
View author publications
You can also search for this author in PubMed Google Scholar
Yongpeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Hua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yinghui Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Lu.

Additional information

Kai Lu, PhD, professor, Deputy Dean of College of Computer Science, National University of Defense Technology, China. His research interests include parallel and distributed system software, operating system, parallel tool suites and fault-tolerant computing technology.

Xiaoping Wang, PhD, assistant professor, National University of Defense Technology, China. His research interests include parallel computing, system software, and distributed computing.

Gen Li, PhD, assistant professor, National University of Defense Technology, China. His research interests include operating system and system security.

Ruibo Wang, PhD, assistant professor, National University of Defense Technology, China. His research interests include operating system and transactional memory.

Wanqing Chi, associate professor, National University of Defense Technology, China. His research interests include operating system, system software, and high performance computing.

Yongpeng Liu, PhD, assistant professor, National University of Defense Technology, China. His research interests include power management, fault tolerance, and high performance computing.

Hongwei Tang, assistant professor, National University of Defense Technology, China. His research interests include operating system, system firmware, high performance computing, distributed computing and computer architecture.

Hua Feng, associate professor, National University of Defense Technology, China. His research interests include operating system, system software, and computer architecture.

Yinghui Gao, PhD, associated professor, National University of Defense Technology, China. Her research interests include artificial intelligent, machine learning and pattern recognition.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, K., Wang, X., Li, G. et al. Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comput. Sci. 8, 378–390 (2014). https://doi.org/10.1007/s11704-014-3503-1

Download citation

Received: 07 December 2013
Accepted: 06 March 2014
Published: 26 May 2014
Issue Date: June 2014
DOI: https://doi.org/10.1007/s11704-014-3503-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iaso: an autonomous fault-tolerant management system for supercomputers

Abstract

Access this article

Similar content being viewed by others

ADSOA - Fault Detection and Recovery Technology Based on Collective Intelligence

An Eminent Approach of Fault Management Using Proactive and Reactive Techniques in Distributed Computing

A Survey on Fault Management Techniques in Distributed Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Iaso: an autonomous fault-tolerant management system for supercomputers

Abstract

Access this article

Similar content being viewed by others

ADSOA - Fault Detection and Recovery Technology Based on Collective Intelligence

An Eminent Approach of Fault Management Using Proactive and Reactive Techniques in Distributed Computing

A Survey on Fault Management Techniques in Distributed Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation