skip to main content
10.1145/2063348.2063352acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Sustained systems performance monitoring at the U. S. Department of Defense high performance computing modernization program

Authors Info & Claims
Published:12 November 2011Publication History

ABSTRACT

The U. S. Department of Defense High Performance Computing Modernization Program (HPCMP) has implemented sustained systems performance testing on high performance computing systems in use at DoD Supercomputing Resource Centers. The intent is to monitor performance improvements by updates to the operating system, compiler suites, and numerical and communications libraries, and to monitor penalties arising from security patches. In practice, each system's workload is simulated by appropriate choices of user application codes representative of the HPCMP computational technical areas. Past successes include surfacing an imminent failure of an OST in a Cray XT3, incomplete configuration of a scheduler update on an SGI Altix 4700, performance issues associated with a communications library update for a Linux Networx Advanced Technology Cluster, and intermittent resetting of Intel Nehalem cores to standard mode from turbo mode. This history demonstrates that SSP testing is critical to deliver the highest quality of service to the HPCMP users.

References

  1. Bennett, P., Cable, S., Alter, R., Mahmoodi, M., and Oppe, T. 2006. Targeting CCM-, CEA-, and CSM-based computing to specific architectures based upon HPCMP systems assessment. In Proceedings of the HPCMP Users Group Conference 2006 (Denver, CO, June 26-29, 2006). UGC '06. IEEE Computer Society, Los Alamitos, CA, 360--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blackford, L., Cleary, A., Choi, J., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bleck, R. 2002. An oceanic general circulation model framed in hybrid isopycnic-Cartesian coordinates. Ocean Model., 4, 1 (Jan., 2002), 55--88.Google ScholarGoogle Scholar
  4. Cable, S., Oppe, T., Ward, W., Jr., Campbell, R., Jr., Gordnier, R., Burnley, V., Grismer, M., and Buning, P. 2005. CFD-based HPCMP systems assessment using AERO, AVUS, and OVERFLOW-2. In Proceedings of the HPCMP Users Group Conference 2005 (Nashville, TN, June 27-30, 2005). UGC '05. IEEE Computer Society, Los Alamitos, CA, 349--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cliburn, J. 2005. ERDC MSRC installs most powerful supercomputer in DoD, ERDC MSRC Major Shared Resource Center RESOURCE (Fall 2005), 10.Google ScholarGoogle Scholar
  6. Donagarra, J., and Luszczek, P. 2005. Introduction to the HPCChallenge Benchmark Suite. Technical Report ICL-UT-05-01. University of Tennessee, Knoxville.Google ScholarGoogle Scholar
  7. Fatoohi, R. 2008. Performance evaluation of NSF application benchmarks on parallel systems. In Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing (Miami, FL, Apr. 14--18, 2008), IPDPS '08. IEEE Computer Society, Los Alamitos, CA, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  8. Gordon, M. and Schmidt, M. 2005. Advances in electronic structure theory: GAMESS a decade later. In Theory and Applications of Computational Chemistry, the first forty years, eds. Dykstra, C., Frenking, G., Kim, K., and Scuseria, G. Elsevier, Amsterdam, The Netherlands.Google ScholarGoogle Scholar
  9. Hertel E., Jr., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGiaun, J., Pemey, S., Silling, S., Taylor, P., and Yarrington, L. 1992. CTH: A software family for multi-dimensional shock physics analysis. Technical Report SAND-92-2089C. Sandia National Laboratories, Albuquerque, New Mexico.Google ScholarGoogle Scholar
  10. Hertel, E., Jr., et al. 1993. CTH: A Software Family for Multi-Dimensional Shock Physics Analysis. In Proceedings of the 19 th International Symposium on Shock Waves (Marseilles, France, July 26-30, 1993). Springer-Verlag, Berlin, Germany. Volume 1, 377--382.Google ScholarGoogle Scholar
  11. Karypis, G. and Kumar, V. 1998. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48, 1 (Jan. 10, 1998), 96--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Karypis, G. and Kumar, V. 1999. Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Rev. 41, 2 (1999), 278--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Karypis, G., Schloegel, K., and Kumar, V. 1997. PARMETIS: Parallel graph partitioning scheme and matrix ordering library. Technical report, University of Minnesota, Department of Computer Science and Engineering.Google ScholarGoogle Scholar
  14. Kramer, W., Shalf, J., and Strohmaier, E. 2005. The NERSC Sustained System Performance (SSP) Metric. Paper LBNL-58868. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  15. Leach, C., Oppe, T., Ward, W., Jr., and Campbell, R., Jr. 2005. CWO-based HPCMP systems assessment using HYCOM and WRF. In Proceedings of the HPCMP Users Group Conference 2005 (Nashville, TN, June 27-30, 2005), UGC '05. IEEE Computer Society, Los Alamitos, CA, 356--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Schloegel, K., Karypis, G., and Kumar, V. 2000. A unified algorithm for load-balancing adaptive scientific simulations. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (Dallas, TX, Nov. 4-10, 2000), Supercomputing '00. IEEE Computer Society, Washington, D. C., Article 59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Schmidt, M., Baldridge, K., Boatz, J., Elbert, S., Gordon, M., Jensen, J., Koseki, S., Matsunaga, N., Nguyen, K., Su, S., Windus, T., Dupuis, M., and Montgomery, J. 1993. General Atomic and Molecular Electronic Structure System. J. Comput. Chem., 14, 11 (Nov. 1993), 1347--1363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tomaro, R., Strang, W., and Sankar, L. 1997. An implicit algorithm for solving time-dependent flows on unstructured grids. Paper. At 35th Aerospace Sciences Meeting and Exhibit (Reno, NV, Jan. 6-10, 1997), AIAA, Reston, VA, AIAA 97--0333.Google ScholarGoogle ScholarCross RefCross Ref
  19. Tracy, F., Oppe, T., Ward, W., Jr., and Peterkin, R., Jr. 2003. A survey of the algorithms in the TI-03 application benchmarking suite with emphasis on linear system solvers. In Proceedings of the 2003 Users Group Conference (Bellevue, WA, June 9-13, 2003), UGC '03. IEEE Computer Society, Los Alamitos, CA, 332--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tracy, F. 2005. Role of algorithms in understanding performance of the TI-05 benchmark suite. In Proceedings of the HPCMP Users Group Conference 2005 (Nashville, TN, June 27-30, 2005), UGC '05. IEEE Computer Society, Los Alamitos, CA, 420--426. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sustained systems performance monitoring at the U. S. Department of Defense high performance computing modernization program

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '11: State of the Practice Reports
      November 2011
      242 pages
      ISBN:9781450311397
      DOI:10.1145/2063348

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader