ABSTRACT
Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.
- Dongarra, J. 2013. Visit to the National University for Defense Technology Changsha, China. Oak Ridge National Laboratory. June 3, 2013. http://www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdfGoogle Scholar
- Ricoux, P. 2015. Addressing the Challenge of Exascale. The European Exascale Software Initiative, EESI2 Final Conference. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdfGoogle Scholar
- SLURM workload manager http://slurm.schedmd.comGoogle Scholar
- Open-source Ticket Request System http://www.otrs.orgGoogle Scholar
- Ganglia Monitoring System http://ganglia.sourceforge.netGoogle Scholar
- Zabbix Monitoring http://www.zabbix.comGoogle Scholar
- Nagios Monitoring https://www.nagios.orgGoogle Scholar
- Bright Cluster Manager http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpcGoogle Scholar
- Nikitenko D. et al. 2016 Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. Parallel Computational Technologies (PCT'2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, 2016. 20--30.Google Scholar
- Voevodin Vl. et al. 2012 Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numerical Methods and Programming. 2012. Vol. 13. 160--166. Stefanov K. et al. 2015. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science. Elsevier. 2015. Vol. 66. 625--634.Google Scholar
- Shvets P. et al. 2015. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. 11th Int. Conference on Parallel Processing and Applied Mathematics, Krakow, Poland, 6-9 September 2015.Google Scholar
Index Terms
- Resolving frontier problems of mastering large-scale supercomputer complexes
Recommendations
Combining AiG Agents with Unicore Grid for Improvement of User Support
CANDAR '13: Proceedings of the 2013 First International Symposium on Computing and NetworkingGrid computing has, in recent history, become an invaluable tool for scientific research. As grid middleware has matured, considerations have extended beyond the core functionality, towards greater usability. The aim of this paper is to consider how ...
Comments