Elsevier

Journal of Computational Science

Volume 10, September 2015, Pages 327-337
Journal of Computational Science

Deployment and testing of the sustained petascale Blue Waters system

https://doi.org/10.1016/j.jocs.2015.03.007Get rights and content

Highlights

  • The article presents experiences from the deployment of the largest Cray system.

  • The article presents the coordination and control processes to manage the deployment.

  • The article presents details and results from many acceptance tests.

  • Sustained petascale performance was measured for a broad mix of applications.

  • Blue Waters is one of the most powerful systems currently available for open-science.

Abstract

Deployment of a large parallel system typically involves several steps of preparation, delivery, installation, testing and acceptance, making such deployments a very complex process. Despite the availability of various petascale systems currently, the steps and lessons from their deployment are rarely described in the literature. This article documents our experiences from the deployment of the sustained petascale Blue Waters system at NCSA. Our presentation is focused on the final deployment steps, where the system was intensively tested and accepted by NCSA. Those experiences and lessons should be useful to guide similarly complex deployments of large systems in the future.

Introduction

Blue Waters is one of the most powerful supercomputers currently available for the open-science community. Sponsored by the US National Science Foundation (NSF) and installed at the National Center for Supercomputing Applications (NCSA) of the University of Illinois at Urbana-Champaign, Blue Waters is also the largest machine ever built by Cray. In addition, it has tremendous amounts of memory and persistent storage. Various application groups are achieving the sustained petascale capability of the system, and there is a huge potential for scientific discoveries in the coming years.

This article contains two contributions that are rare in the literature. First, it reveals several details from the machine deployment process, including methods and procedures that were followed for system assessment and acceptance. Second, it provides first-hand lessons that we learned from that deployment, based on the obstacles that we faced and the solutions adopted. These experiences should become useful to guide similarly complex deployments in the future. In addition, the article serves as an early evaluation of Blue Waters: the presented performance results can be used as a reference by the application groups, as they continue to tune their codes to the Blue Waters architecture.

The remainder of this article is organized as follows. Section 2 briefly describes the Blue Waters architecture, and Section 3 presents the timeline of its deployment. Section 4 shows the infrastructure created by NCSA to support the deployment. The acceptance tests and many of their results are presented in Section 5. Section 6 contains details of the post-acceptance upgrade of the system's acceleration capability, Section 7 lists reliability figures observed during initial operations, and Section 8 briefly outlines the Web-based environment created for user support. Major lessons from the deployment are highlighted in Section 9. Finally, Section 10 concludes our presentation.

Section snippets

Blue Waters architecture

Blue Waters has an architecture as depicted in Fig. 1. Its computational component is heterogeneous and contains both XE and XK compute nodes; an XE node contains two 2.3 GHz AMD-Interlagos ×86 processors with 16 integer cores per processor, whereas an XK node has one AMD-Interlagos processor and one NVIDIA-Kepler K20X GPU. The XE nodes have 64 GB of memory, while the XK nodes have 32 GB for the CPU and 6 GB for the GPU. Besides the XE and XK compute nodes, the system also has 784 service nodes,

Deployment timeline

The Blue Waters project started in 2006, in response to the NSF solicitation [1] fostering a sustained petascale system. NCSA was declared the winner of that competition in August 2007. In the following years, NCSA worked with vendors and with various application groups, selected by NSF, to prepare their applications to run at petascale level. As part of the contract signed with Cray, a Statement-Of-Work (SOW) was crafted, containing a sequence of steps for the delivery by Cray of the

Infrastructure created for acceptance

In preparation for acceptance of Blue Waters, NCSA staff designed hundreds of tests covering both functionality and performance for all system areas. Those tests encompassed all SOW items and also additional features that NCSA considered important for productive system operation. This section describes the main characteristics of this test design phase and of the infrastructure created for the management of their execution towards system acceptance.

Acceptance tests conducted

Beyond the tests with the Cray system, our acceptance tests covered the near-line storage (tape libraries), the external network connectivity, and a variety of procedures for user support. To better handle the various tests, we created an attribute called the test color. This attribute served as a coarse classification of the type of test, according to Table 1.

Specifically for acceptance of the components delivered by Cray, we designed the tests listed in Table 2. As execution of the tests

System upgrade post-acceptance

During the first months of post-acceptance operation, Blue Waters had a consistently high use of its XK nodes containing GPUs. To illustrate this fact, Fig. 6 shows the observed system utilization for the month of June 2013, when Blue Waters was still operating under its original configuration. The first plot, in Fig. 6(a), shows the overall utilization (i.e. XE and XK nodes combined), whereas the remaining plots show the utilization specific for the XE and XK regions, highlighted in blue and

System reliability

We now provide a preliminary analysis of reliability factors related to jobs that are run on Blue Waters. We start the section with an overview of the most critical factor that can lead to a failure of a job, namely node failures. Because checkpoint/restart is still the dominant technique adopted by users to tolerate system faults, we provide some guidance on the best parameters to use with checkpointing schemes.

User support environment

In parallel to installation of the computational system, NCSA also developed a broad infrastructure to support effective usage of the machine. The focal point of that infrastructure is the Blue Waters portal,4 a website containing information and resources that a user might need for his/her interaction with the system. That information includes documentation such as user guides, tutorials, materials from educational events, links to points of contact, and

Major lessons learned

In this section, we discuss some of the problems that we faced during the deployment process, how those problems were handled, and highlight major lessons that we learned from that process. We start with a chronological perspective, from when installation started.

In early 2012, our vision indicated three major risks for the Blue Waters project: (a) system scale, as Blue Waters had the greatest number of cabinets and the largest network ever built by Cray; (b) new disk controllers, which were

Conclusion

Many application groups are advancing their research via the enormous computational capabilities of Blue Waters. As an example, recent NAMD simulations enabled determination of the precise chemical structure of the HIV capsid [20], a protein shell that protects the virus’ genetic material and has become an attractive target for the development of new antiretroviral drugs.

In this article, we described the Blue Waters deployment process, presented results from its acceptance tests, and

Acknowledgments

We thank the entire Blue Waters team for participation in the deployment process and for valuable help in the numerous activities required by that deployment. We also thank many Cray employees who worked intensively on tasks related to this deployment, in particular Kevin Stelljes, Joe Glenski, Pat Duffy, Luiz DeRose and Peg Williams. Finally, we thank the anonymous reviewers for their comments and suggestions, which made the text more consistent and clear.

This work is part of the Blue Waters

References (20)

  • C.L. Mendes et al.

    Deploying a large petascale system: the Blue Waters experience

  • J.T. Daly

    A higher order estimate of the optimum checkpoint interval for restart dumps

    Fut. Generat. Comput. Syst.

    (2006)
  • NSF

    Leadership-Class System Acquisition – Creating a Petascale Computing Environment for Science and Engineering, Solicitation NSF 06-573

    (2006)
  • J. Towns

    Evolving from TeraGrid to XSEDE

  • Atlassian Pty. Ltd., JIRA Documentation. URL:...
  • T. Hoefler et al.

    A network performance measurement framework

  • J. Enos et al.

    Topology-aware job scheduling strategies

  • D.A. Donzis et al.

    Turbulence simulations on O(104) processors

  • MILC-Collaboration, MIMD Lattice Computation (MILC) Collaboration. Home-page Information Available Online at...
  • J.C. Phillips et al.

    Scalable molecular dynamics with NAMD

    J. Comput. Chem.

    (2005)
There are more references available in the full text version of this article.

Cited by (6)

View full text