skip to main content
10.1145/2903150.2903481acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
poster

Resolving frontier problems of mastering large-scale supercomputer complexes

Published:16 May 2016Publication History

ABSTRACT

Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.

References

  1. Dongarra, J. 2013. Visit to the National University for Defense Technology Changsha, China. Oak Ridge National Laboratory. June 3, 2013. http://www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdfGoogle ScholarGoogle Scholar
  2. Ricoux, P. 2015. Addressing the Challenge of Exascale. The European Exascale Software Initiative, EESI2 Final Conference. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdfGoogle ScholarGoogle Scholar
  3. SLURM workload manager http://slurm.schedmd.comGoogle ScholarGoogle Scholar
  4. Open-source Ticket Request System http://www.otrs.orgGoogle ScholarGoogle Scholar
  5. Ganglia Monitoring System http://ganglia.sourceforge.netGoogle ScholarGoogle Scholar
  6. Zabbix Monitoring http://www.zabbix.comGoogle ScholarGoogle Scholar
  7. Nagios Monitoring https://www.nagios.orgGoogle ScholarGoogle Scholar
  8. Bright Cluster Manager http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpcGoogle ScholarGoogle Scholar
  9. Nikitenko D. et al. 2016 Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. Parallel Computational Technologies (PCT'2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, 2016. 20--30.Google ScholarGoogle Scholar
  10. Voevodin Vl. et al. 2012 Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numerical Methods and Programming. 2012. Vol. 13. 160--166. Stefanov K. et al. 2015. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science. Elsevier. 2015. Vol. 66. 625--634.Google ScholarGoogle Scholar
  11. Shvets P. et al. 2015. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. 11th Int. Conference on Parallel Processing and Applied Mathematics, Krakow, Poland, 6-9 September 2015.Google ScholarGoogle Scholar

Index Terms

  1. Resolving frontier problems of mastering large-scale supercomputer complexes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CF '16: Proceedings of the ACM International Conference on Computing Frontiers
      May 2016
      487 pages
      ISBN:9781450341288
      DOI:10.1145/2903150
      • General Chairs:
      • Gianluca Palermo,
      • John Feo,
      • Program Chairs:
      • Antonino Tumeo,
      • Hubertus Franke

      Copyright © 2016 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 May 2016

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      CF '16 Paper Acceptance Rate30of94submissions,32%Overall Acceptance Rate240of680submissions,35%

      Upcoming Conference

      CF '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader