skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.

Conference ·

Abstract not provided.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA), Office of Defense Nuclear Security
DOE Contract Number:
NA0003525
OSTI ID:
1891963
Report Number(s):
SAND2021-10802C; 700889
Resource Relation:
Conference: Proposed for presentation at the IEEE HPEC held September 20-24, 2021 in ,
Country of Publication:
United States
Language:
English

References (25)

GPCNeT: designing a benchmark suite for inducing and measuring contention in HPC networks
  • Chunduri, Sudheer; Groves, Taylor; Mendygral, Peter
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356215
conference November 2019
Predicting application performance using supervised learning on communication features
  • Jain, Nikhil; Bhatele, Abhinav; Robson, Michael P.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503263
conference January 2013
Lpms conference August 2019
Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters
  • Pollard, Samuel D.; Jain, Nikhil; Herbein, Stephen
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00029
conference November 2018
Level-Spread: A New Job Allocation Policy for Dragonfly Networks conference May 2018
Choreo conference October 2013
Network-Aware Scheduling for Data-Parallel Jobs journal August 2015
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
  • Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18
conference November 2014
Holistic Measurement-Driven System Assessment conference September 2017
Integrating Low-latency Analysis into HPC System Monitoring
  • Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim
  • ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225086
conference August 2018
Quantifying the impact of network congestion on application performance and network metrics conference September 2020
Fast Parallel Algorithms for Short-Range Molecular Dynamics journal March 1995
Run-to-run variability on Xeon Phi based cray XC systems
  • Chunduri, Sudheer; Harms, Kevin; Parker, Scott
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126926
conference November 2017
Maximizing Throughput on a Dragonfly Network
  • Jain, Nikhil; Bhatele, Abhinav; Ni, Xiang
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.33
conference November 2014
Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm journal April 2013
Quiet Neighborhoods: Key to Protect Job Performance Predictability conference May 2015
The Case of Performance Variability on Dragonfly-based Systems conference May 2020
APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks conference May 2017
There goes the neighborhood: performance degradation due to nearby jobs
  • Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247
conference January 2013
Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters conference May 2014
Load Balancing in a Cluster Computer
  • Werstein, Paul; Situ, Hailing; Huang, Zhiyi
  • 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'06) https://doi.org/10.1109/PDCAT.2006.77
conference January 2006
Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems conference May 2017
Technology-Driven, Highly-Scalable Dragonfly Topology
  • Kim, John; Dally, Wiliam J.; Scott, Steve
  • 2008 35th International Symposium on Computer Architecture (ISCA), 2008 International Symposium on Computer Architecture https://doi.org/10.1109/ISCA.2008.19
conference June 2008
A new metric for ranking high-performance computing systems journal January 2016
The Outer Rim Simulation: A Path to Many-core Supercomputers journal November 2019