skip to main content
10.1145/3577193.3593732acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC

Published: 21 June 2023 Publication History

Abstract

Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy.
In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.

References

[1]
2022. Photonic Optical Circuit Switching | CALIENT Technologies. https://www.calient.net/
[2]
2023. Home | Community Earth System Model. https://www.cesm.ucar.edu/
[3]
Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Benn Thomsen, Kai Shi, and Hugh Williams. 2020. Sirius: A Flat Datacenter Network with Nanosecond Optical Switching. https://www.microsoft.com/en-us/research/publication/sirius-a-flat-datacenter-network-with-nanosecond-optical-switching/
[4]
K.J. Barker, A. Benner, R. Hoare, A. Hoisie, A.K. Jones, D.K. Kerbyson, D. Li, R. Melhem, R. Rajamony, E. Schenfeld, S. Shao, C. Stunkel, and P. Walker. 2005. On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 16--16.
[5]
Peter D. Barnes, Christopher D. Carothers, David R. Jefferson, and Justin M. LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS '13). Association for Computing Machinery, New York, NY, USA, 327--336.
[6]
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. Comput. Surveys 52, 4 (Aug. 2019), 65:1--65:43.
[7]
Maciej Besta and Torsten Hoefler. 2014. Slim fly: a cost effective low-diameter network topology. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, New Orleans, Louisana, 348--359.
[8]
Abhinav Bhatele, William D. Gropp, Nikhil Jain, and Laxmikant V. Kale. 2011. Avoiding hot-spots on two-level direct networks. In SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. ISSN: 2167-4337.
[9]
J. Brandt, K. Devine, A. Gentile, and K. Pedretti. 2014. Demonstrating improved application performance using dynamic monitoring and task mapping. In 2014 IEEE International Conference on Cluster Computing (CLUSTER). 408--415. ISSN: 2168-9253.
[10]
Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: a high-performance, low memory, modular time warp system. In Proceedings of the fourteenth workshop on Parallel and distributed simulation (PADS '00). IEEE Computer Society, USA, 53--60. https://ieeexplore.ieee.org/document/847144
[11]
Swiss National Supercomputing Centre(CSCS). 2018. Factsheet: "Piz Daint", one of the most powerful supercomputers in the world. https://www.cscs.ch/publications/news/2017/factsheetpizdaintoneofthemostpowerfulsupercomputersintheworld/
[12]
Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping Zhang, Xitao Wen, and Yan Chen. 2014. OSA: An Optical Switching Architecture for Data Center Networks With Unprecedented Flexibility. IEEE/ACM Transactions on Networking 22, 2 (April 2014), 498--511. 2013.2253120 Conference Name: IEEE/ACM Transactions on Networking.
[13]
Xiaoliang Chen, Roberto Proietti, Marjan Fariborz, Che-Yu Liu, and S. J. Ben Yoo. 2021. Machine-learning-aided cognitive reconfiguration for flexible-bandwidth HPC and data center networks [Invited]. Journal of Optical Communications and Networking 13, 6 (June 2021), C10--C20. Conference Name: Journal of Optical Communications and Networking.
[14]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 386--400.
[15]
Pedro Costa. 2018. A FFT-based finite-difference solver for massively-parallel direct numerical simulations of turbulent flows. Computers & Mathematics with Applications 76, 8 (Oct. 2018), 1853--1862.
[16]
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--14.
[17]
Matthieu Dorier, Misbah Mubarak, Rob Ross, Jianping Kelvin Li, Christopher D. Carothers, and Kwa-Liu Ma. 2016. Evaluation of Topology-Aware Broadcast Algorithms for Dragonfly Networks. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 40--49. ISSN: 2168-9253.
[18]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE, Salt Lake City, UT, 1--9.
[19]
Argonne Leadership Computing Facility(ALCF). 2022. ALCF Public Data. https://reports.alcf.anl.gov/data/index.html
[20]
Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2010. Helios: a hybrid electrical/optical switch architecture for modular data centers. ACM SIGCOMM Computer Communication Review 40, 4 (Aug. 2010), 339--350.
[21]
Guangnan Feng, Dezun Dong, and Yutong Lu. 2022. Optimized MPI collective algorithms for dragonfly topology. In Proceedings of the 36th ACM International Conference on Supercomputing (ICS '22). Association for Computing Machinery, New York, NY, USA, 1--11.
[22]
Pablo Fuentes, Enrique Vallejo, Cristóbal Camarero, Ramón Beivide, and Mateo Valero. 2015. Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns. In 2015 IEEE International Conference on Cluster Computing. 801--808. ISSN: 2168-9253.
[23]
Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche, Houman Rastegarfar, Madeleine Glick, and Daniel Kilper. 2016. ProjecToR: Agile Reconfigurable Data Center Interconnect. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 216--229.
[24]
Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar, Samir R. Das, Jon P. Longtin, Himanshu Shah, and Ashish Tanwer. 2014. FireFly: a reconfigurable wireless data center fabric using free-space optics. ACM SIGCOMM Computer Communication Review 44, 4 (Aug. 2014), 319--330.
[25]
Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing Throughput on a Dragonfly Network. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE, New Orleans, LA, USA, 336--347.
[26]
Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V. Kale. 2016. Evaluating HPC Networks via Simulation of Parallel Workloads. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 154--165. ISSN: 2167-4337.
[27]
Nan Jiang, John Kim, and William J. Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09). Association for Computing Machinery, New York, NY, USA, 220--231.
[28]
Tyson Jones, Anna Brown, Ian Bush, and Simon C. Benjamin. 2019. QuEST and High Performance Simulation of Quantum Computers. Scientific Reports 9, 1 (July 2019), 10736. Number: 1 Publisher: Nature Publishing Group.
[29]
Norman P Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings: Industrial Product. In The 50th Annual International Symposium on Computer Architecture (ISCA '23). ACM, New York, NY, USA, June 17--21, 2023, Orlando, FL, USA, 15 pages.
[30]
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture (ISCA '08). 77--88. ISSN: 1063-6897.
[31]
Los Alamos National Laboratory(LANL)LANL. 2020. Trinity: Advanced Technology System. https://www.lanl.gov/projects/trinity/
[32]
Sylvain Laizet and Ning Li. 2010. Incompact3d: a powerful tool to tackle turbulence problems with up to hundreds of thousands computational cores. (Nov. 2010). https://www.researchgate.net/publication/253825870_Incompact3d_a_powerful_tool_to_tackle_turbulence_problems_with_up_to_hundreds_of_thousands_computational_cores
[33]
Jianping Kelvin Li, Misbah Mubarak, Robert B. Ross, Christopher D. Carothers, and Kwan-Liu Ma. 2017. Visual Analytics Techniques for Exploring the Design Space of Large-Scale High-Radix Networks. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 193--203.
[34]
Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High Performance Interconnect Network for Tianhe System. Journal of Computer Science and Technology 30, 2 (March 2015), 259--272.
[35]
Gengchen Liu, Roberto Proietti, Marjan Fariborz, Pouya Fotouhi, Xian Xiao, and S. J. Ben Yoo. 2020. Architecture and performance studies of 3D-Hyper-FleX-LION for reconfigurable all-to-all HPC networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Atlanta, Georgia, 1--16.
[36]
He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, and Alex C. Snoeren. 2015. Scheduling techniques for hybrid circuit/packet networks. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT '15). Association for Computing Machinery, New York, NY, USA, 1--13.
[37]
Hao Lu, Michael Matheson, Vladyslav Oles, Austin Ellis, Wayne Joubert, and Feiyi Wang. 2022. Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale. IEEE Computer Society, 1123--1137. https://www.computer.org/csdl/proceedings-article/sc/2022/544400b123/1I0bTcFWPzW ISSN: 2167-4337.
[38]
Pavlos Maniotis, Nicolas Dupuis, Laurent Schares, Daniel M. Kuchta, Marc A. Taubenblatt, and Benjamin G. Lee. 2020. Intra-node high-performance computing network architecture with nanosecond-scale photonic switches [Invited]. Journal of Optical Communications and Networking 12, 12 (Dec. 2020), 367--377. Conference Name: Journal of Optical Communications and Networking.
[39]
Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40--50.
[40]
William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, and George Porter. 2020. Expanding across time to deliver bandwidth efficiency and low latency. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (NSDI'20). USENIX Association, USA, 1--18. https://www.usenix.org/conference/nsdi20/presentation/mellette
[41]
William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. 2017. RotorNet: A Scalable, Low-complexity, Optical Datacenter Network. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). Association for Computing Machinery, New York, NY, USA, 267--280.
[42]
George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and Keren Bergman. 2019. Bandwidth steering in HPC using silicon nanophotonics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Association for Computing Machinery, New York, NY, USA, 1--25.
[43]
Cyriel Minkenberg, German Rodriguez, Bogdan Prisacari, Laurent Schares, Philip Heidelberger, Dong Chen, and Craig Stunkel. 2016. Performance benefits of optical circuit switches for large-scale dragonfly networks. In 2016 Optical Fiber Communications Conference and Exhibition (OFC). 1--3. https://ieeexplore.ieee.org/document/7537792
[44]
Misbah Mubarak, Christopher D. Carothers, Robert Ross, and Philip Carns. 2012. Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, UT, 366--376.
[45]
Misbah Mubarak, Christopher D. Carothers, Robert B. Ross, and Philip Carns. 2017. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan. 2017), 87--100.
[46]
NERSC. 2021. Perlmutter System Details. https://docs.nersc.gov/systems/perlmutter/system_details/
[47]
Tatsuyoshi Ohmura, Yoichi Shimomura, Ryusuke Egawa, and Hiroyuki Takizawa. 2023. Toward Building a Digital Twin of Job Scheduling and Power Management on an HPC System. In Job Scheduling Strategies for Parallel Processing (Lecture Notes in Computer Science), Dalibor Klusáček, Corbalán Julita, and Gonzalo P. Rodrigo (Eds.). Springer Nature Switzerland, Cham, 47--67.
[48]
Flavio Pardo, Vladimir A. Aksyuk, Susanne Arney, H. Bair, Nagesh R. Basavanhally, David J. Bishop, Gregory R. Bogart, Cristian A. Bolle, J. E. Bower, Dustin Carr, H. B. Chan, Raymond A. Cirelli, E. Ferry, Robert E. Frahm, Arman Gasparyan, John V. Gates, C. Randy Giles, L. Gomez, Suresh Goyal, Dennis S. Greywall, Martin Haueis, R. C. Keller, Jungsang Kim, Fred P. Klemens, Paul R. Kolodner, Avi Kornblit, T. Kroupenkine, Warren Y.-C. Lai, Victor Lifton, Jian Liu, Yee L. Low, William M. Mansfield, Dan Marom, John F. Miner, David T. Neilson, Mark A. Paczkowski, C. S. Pai, A. G. Ramirez, David A. Ramsey, S. Rogers, Roland Ryf, Ronald E. Scotti, Herbert R. Shea, M. E. Simon, H. T. Soh, Hong Tang, J. A. Taylor, K. Teffeau, Joseph Vuillemin, and J. Weld. 2003. Optical MEMS devices for telecom systems. In Smart Sensors, Actuators, and MEMS, Vol. 5116. SPIE, 435--444.
[49]
Tirthak Patel, Zhengchun Liu, Raj Kettimuthu, Paul Rich, William Allcock, and Devesh Tiwari. 2020. Job characteristics on large-scale systems: long-term analysis, quantification, and implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Atlanta, Georgia, 1--17.
[50]
Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. 2018. Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 333--345.
[51]
George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2013. Integrating microsecond circuit switching into the data center. In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM (SIGCOMM '13). Association for Computing Machinery, New York, NY, USA, 447--458.
[52]
Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, and Amin Vahdat. 2022. Jupiter evolving: transforming google's datacenter network via optical circuit switches and software-defined networking. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 66--85.
[53]
Jessi Christa Rubio, Aira Villapando, Christian Matira, and Jeffrey Aborot. 2020. Correcting Job Walltime in a Resource-Constrained Environment. In Supercomputing Frontiers (Lecture Notes in Computer Science), Dhabaleswar K. Panda (Ed.). Springer International Publishing, Cham, 118--137.
[54]
Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer, Cham, 197--217.
[55]
Arjun Singh. 2005. LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS. Ph. D. Dissertation. Stanford University.
[56]
Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, and Adedolapo Okanlawon. 2019. Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (PEARC '19). Association for Computing Machinery, New York, NY, USA, 1--8.
[57]
Min Yee Teh, Yu-Han Hung, George Michelogiannakis, Shijia Yan, Madeleine Glick, John Shalf, and Keren Bergman. 2020. TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.
[58]
Min Yee Teh, Zhenguo Wu, and K. Bergman. 2020. Flexspander: augmenting expander networks in high-performance systems with optical bandwidth steering. IEEE/OSA Journal of Optical Communications and Networking (2020).
[59]
Min Yee Teh, Zhenguo Wu, Madeleine Glick, Sebastien Rumley, Manya Ghobadi, and Keren Bergman. 2022. Performance trade-offs in reconfigurable networks for HPC. Journal of Optical Communications and Networking 14, 6 (June 2022), 454--468. Conference Name: Journal of Optical Communications and Networking.
[60]
O. Tuncer, Yijia Zhang, V. Leung, and A. Coskun. 2017. Task mapping on a dragonfly supercomputer. Technical Report. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States). https://www.semanticscholar.org/paper/Task-mapping-on-a-dragonfly-supercomputer-Tuncer-Zhang/ac5416c4d080fabf5b983f86a5497f487389e9a9
[61]
Guohui Wang, David G. Andersen, Michael Kaminsky, Konstantina Papagiannaki, T.S. Eugene Ng, Michael Kozuch, and Michael Ryan. 2010. c-Through: part-time optics in data centers. ACM SIGCOMM Computer Communication Review 40, 4 (Aug. 2010), 327--338.
[62]
Hao Wang, Yi-Qin Dai, Jie Yu, and Yong Dong. 2021. Predicting running time of aerodynamic jobs in HPC system by combining supervised and unsupervised learning method. Advances in Aerodynamics 3, 1 (Aug. 2021), 22.
[63]
Xin Wang, Misbah Mubarak, Xu Yang, Robert B. Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Vancouver, BC, 1113--1122.
[64]
Ke Wen, Payman Samadi, Sébastien Rumley, Christine P. Chen, Yiwen Shen, Meisam Bahadori, Keren Bergman, and Jeremiah Wilke. 2016. Flexfly: Enabling a Reconfigurable Dragonfly through Silicon Photonics. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 166--177. ISSN: 2167-4337.
[65]
Jiabin Xie, Jianchao He, Yun Bao, and Xi Chen. 2021. A Low-Communication-Overhead Parallel DNS Method for the 3D Incompressible Wall Turbulence. International Journal of Computational Fluid Dynamics 35, 6 (July 2021), 413--432. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/10618562.2021.1971202.
[66]
Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and Zhiling Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 750--760.
[67]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing (Lecture Notes in Computer Science), Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer, Berlin, Heidelberg, 44--60.
[68]
Felix Zahn and Holger Fröning. 2020. On Network Locality in MPI-Based HPC Applications. In 49th International Conference on Parallel Processing - ICPP (ICPP '20). Association for Computing Machinery, New York, NY, USA, 1--10.
[69]
Zuoqing Zhao, Bingli Guo, Yu Shang, and Shanguo Huang. 2020. Hierarchical and reconfigurable optical/electrical interconnection network for high-performance computing. Journal of Optical Communications and Networking 12, 3 (March 2020), 50--61. Conference Name: Journal of Optical Communications and Networking.

Cited By

View all
  • (2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
  • (2024)UNR: Unified Notifiable RMA Library for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
  • (2024)MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time Traffic2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00073(765-779)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
    June 2023
    505 pages
    ISBN:9798400700569
    DOI:10.1145/3577193
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. reconfiguable network
    2. dragonfly
    3. HPC
    4. resource allocation

    Qualifiers

    • Research-article

    Conference

    ICS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)156
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
    • (2024)UNR: Unified Notifiable RMA Library for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
    • (2024)MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time Traffic2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00073(765-779)Online publication date: 27-May-2024
    • (2024)ACO-Optimized DRL Model for Energy-Efficient Resource Allocation in High-Performance ComputingMulti-Strategy Learning Environment10.1007/978-981-97-1488-9_11(143-154)Online publication date: 29-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media