poster

Resolving frontier problems of mastering large-scale supercomputer complexes

Authors:

Dmitry Nikitenko,

Vladimir Voevodin,

Sergey ZhumatiyAuthors Info & Claims

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Pages 349 - 352

https://doi.org/10.1145/2903150.2903481

Published: 16 May 2016 Publication History

Get Access

Abstract

Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.

References

[1]

Dongarra, J. 2013. Visit to the National University for Defense Technology Changsha, China. Oak Ridge National Laboratory. June 3, 2013. http://www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf

Google Scholar

[2]

Ricoux, P. 2015. Addressing the Challenge of Exascale. The European Exascale Software Initiative, EESI2 Final Conference. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdf

Google Scholar

[3]

SLURM workload manager http://slurm.schedmd.com

Google Scholar

[4]

Open-source Ticket Request System http://www.otrs.org

Google Scholar

[5]

Ganglia Monitoring System http://ganglia.sourceforge.net

Google Scholar

[6]

Zabbix Monitoring http://www.zabbix.com

Google Scholar

[7]

Nagios Monitoring https://www.nagios.org

Google Scholar

[8]

Bright Cluster Manager http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc

Google Scholar

[9]

Nikitenko D. et al. 2016 Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. Parallel Computational Technologies (PCT'2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, 2016. 20--30.

Google Scholar

[10]

Voevodin Vl. et al. 2012 Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numerical Methods and Programming. 2012. Vol. 13. 160--166. Stefanov K. et al. 2015. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science. Elsevier. 2015. Vol. 66. 625--634.

Google Scholar

[11]

Shvets P. et al. 2015. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. 11th Int. Conference on Parallel Processing and Applied Mathematics, Krakow, Poland, 6-9 September 2015.

Google Scholar

Cited By

View all

Nikitenko DZhumatiy SPaokin AVoevodin VVoevodin V(2019)Evolution of the Octoshell HPC Center Management SystemParallel Computational Technologies10.1007/978-3-030-28163-2_2(19-33)Online publication date: 2-Aug-2019
https://doi.org/10.1007/978-3-030-28163-2_2
Shvets PVoevodin VZhumatiy S(2019)HPC Software for Massive Analysis of the Parallel Efficiency of ApplicationsParallel Computational Technologies10.1007/978-3-030-28163-2_1(3-18)Online publication date: 2-Aug-2019
https://doi.org/10.1007/978-3-030-28163-2_1
Rekachinsky AChulkevich RKostenetskiy P(2018)Modeling parallel processing of databases on the central processor Intel Xeon Phi KNL2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2018.8400288(1605-1610)Online publication date: May-2018
https://doi.org/10.23919/MIPRO.2018.8400288
Show More Cited By

Index Terms

Resolving frontier problems of mastering large-scale supercomputer complexes
1. Information systems
  1. Information systems applications
    1. Enterprise information systems
      1. Enterprise applications

Recommendations

Mastering Linux
Combining AiG Agents with Unicore Grid for Improvement of User Support
CANDAR '13: Proceedings of the 2013 First International Symposium on Computing and Networking

Grid computing has, in recent history, become an invaluable tool for scientific research. As grid middleware has matured, considerations have extended beyond the core functionality, towards greater usability. The aim of this paper is to consider how ...
Mastering Red Hat Linux 7.1

Comments

Information & Contributors

Information

Published In

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

May 2016

487 pages

ISBN:9781450341288

DOI:10.1145/2903150

General Chairs:
Gianluca Palermo
Politecnico di Milano, IT
,
John Feo
Pacific Northwest National Laboratory and Northwest Institute for Advanced Computing
,
Program Chairs:
Antonino Tumeo
Pacific Northwest National Laboratory, USA
,
Hubertus Franke
New York University and IBM Research, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2016

Check for updates

Author Tags

Qualifiers

Poster

Conference

CF'16

Sponsor:

Micron Foundation
ACM
Politecnico di Milano
SIGMICRO
IBM

CF'16: Computing Frontiers Conference

May 16 - 19, 2016

Como, Italy

Acceptance Rates

CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
83
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Nikitenko DZhumatiy SPaokin AVoevodin VVoevodin V(2019)Evolution of the Octoshell HPC Center Management SystemParallel Computational Technologies10.1007/978-3-030-28163-2_2(19-33)Online publication date: 2-Aug-2019
https://doi.org/10.1007/978-3-030-28163-2_2
Shvets PVoevodin VZhumatiy S(2019)HPC Software for Massive Analysis of the Parallel Efficiency of ApplicationsParallel Computational Technologies10.1007/978-3-030-28163-2_1(3-18)Online publication date: 2-Aug-2019
https://doi.org/10.1007/978-3-030-28163-2_1
Rekachinsky AChulkevich RKostenetskiy P(2018)Modeling parallel processing of databases on the central processor Intel Xeon Phi KNL2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2018.8400288(1605-1610)Online publication date: May-2018
https://doi.org/10.23919/MIPRO.2018.8400288
Nikitenko DVoevodin VZhumatiy S(2018)Deep Analysis of Job State Statistics on Lomonosov-2 SupercomputerSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1802015:2(4-10)Online publication date: 15-Jun-2018
https://dl.acm.org/doi/10.14529/jsfi180201
Dudas AVostinar PSkrinarova JSilaci J(2018)Improved Process of Running Tasks in the High Performance Computing System2018 16th International Conference on Emerging eLearning Technologies and Applications (ICETA)10.1109/ICETA.2018.8572230(133-140)Online publication date: Nov-2018
https://doi.org/10.1109/ICETA.2018.8572230
Nikitenko DShvets PVoevodin VZhumatiy S(2018)Role-Dependent Resource Utilization Analysis for Large HPC CentersParallel Computational Technologies10.1007/978-3-319-99673-8_4(47-61)Online publication date: 26-Aug-2018
https://doi.org/10.1007/978-3-319-99673-8_4
Skrinarova JDudas AVesel E(2017)Model of education and training strategy for the management of HPC systems2017 IEEE 14th International Scientific Conference on Informatics10.1109/INFORMATICS.2017.8327282(400-405)Online publication date: Nov-2017
https://doi.org/10.1109/INFORMATICS.2017.8327282
Nikitenko DAntonov AShvets PSobolev SStefanov KVoevodin VVoevodin VZhumatiy S(2017)JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior AnalysisSupercomputing10.1007/978-3-319-71255-0_42(516-529)Online publication date: 15-Nov-2017
https://doi.org/10.1007/978-3-319-71255-0_42
Nikitenko DZheltkov A(2017)The Top50 List Vivification in the Evolution of HPC RankingsParallel Computational Technologies10.1007/978-3-319-67035-5_2(14-26)Online publication date: 27-Sep-2017
https://doi.org/10.1007/978-3-319-67035-5_2
Nikitenko DStefanov KZhumatiy SVoevodin VTeplov AShvets P(2016)System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC CenterAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-49956-7_24(305-318)Online publication date: 19-Nov-2016
https://doi.org/10.1007/978-3-319-49956-7_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Mastering Linux

Combining AiG Agents with Unicore Grid for Improvement of User Support

Mastering Red Hat Linux 7.1

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations