research-article

GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC

Authors:

Yutong LuAuthors Info & Claims

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 437 - 449

https://doi.org/10.1145/3577193.3593732

Published: 21 June 2023 Publication History

Abstract

Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy.

In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.

References

[1]

2022. Photonic Optical Circuit Switching | CALIENT Technologies. https://www.calient.net/

[2]

2023. Home | Community Earth System Model. https://www.cesm.ucar.edu/

[3]

Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Benn Thomsen, Kai Shi, and Hugh Williams. 2020. Sirius: A Flat Datacenter Network with Nanosecond Optical Switching. https://www.microsoft.com/en-us/research/publication/sirius-a-flat-datacenter-network-with-nanosecond-optical-switching/

[4]

K.J. Barker, A. Benner, R. Hoare, A. Hoisie, A.K. Jones, D.K. Kerbyson, D. Li, R. Melhem, R. Rajamony, E. Schenfeld, S. Shao, C. Stunkel, and P. Walker. 2005. On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 16--16.

Digital Library

[5]

Peter D. Barnes, Christopher D. Carothers, David R. Jefferson, and Justin M. LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS '13). Association for Computing Machinery, New York, NY, USA, 327--336.

Digital Library

[6]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. Comput. Surveys 52, 4 (Aug. 2019), 65:1--65:43.

Digital Library

[7]

Maciej Besta and Torsten Hoefler. 2014. Slim fly: a cost effective low-diameter network topology. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, New Orleans, Louisana, 348--359.

Digital Library

[8]

Abhinav Bhatele, William D. Gropp, Nikhil Jain, and Laxmikant V. Kale. 2011. Avoiding hot-spots on two-level direct networks. In SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. ISSN: 2167-4337.

Digital Library

[9]

J. Brandt, K. Devine, A. Gentile, and K. Pedretti. 2014. Demonstrating improved application performance using dynamic monitoring and task mapping. In 2014 IEEE International Conference on Cluster Computing (CLUSTER). 408--415. ISSN: 2168-9253.

[10]

Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: a high-performance, low memory, modular time warp system. In Proceedings of the fourteenth workshop on Parallel and distributed simulation (PADS '00). IEEE Computer Society, USA, 53--60. https://ieeexplore.ieee.org/document/847144

Digital Library

[11]

Swiss National Supercomputing Centre(CSCS). 2018. Factsheet: "Piz Daint", one of the most powerful supercomputers in the world. https://www.cscs.ch/publications/news/2017/factsheetpizdaintoneofthemostpowerfulsupercomputersintheworld/

[12]

Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping Zhang, Xitao Wen, and Yan Chen. 2014. OSA: An Optical Switching Architecture for Data Center Networks With Unprecedented Flexibility. IEEE/ACM Transactions on Networking 22, 2 (April 2014), 498--511. 2013.2253120 Conference Name: IEEE/ACM Transactions on Networking.

[13]

Xiaoliang Chen, Roberto Proietti, Marjan Fariborz, Che-Yu Liu, and S. J. Ben Yoo. 2021. Machine-learning-aided cognitive reconfiguration for flexible-bandwidth HPC and data center networks [Invited]. Journal of Optical Communications and Networking 13, 6 (June 2021), C10--C20. Conference Name: Journal of Optical Communications and Networking.

[14]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 386--400.

Digital Library

[15]

Pedro Costa. 2018. A FFT-based finite-difference solver for massively-parallel direct numerical simulations of turbulent flows. Computers & Mathematics with Applications 76, 8 (Oct. 2018), 1853--1862.

[16]

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--14.

[17]

Matthieu Dorier, Misbah Mubarak, Rob Ross, Jianping Kelvin Li, Christopher D. Carothers, and Kwa-Liu Ma. 2016. Evaluation of Topology-Aware Broadcast Algorithms for Dragonfly Networks. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 40--49. ISSN: 2168-9253.

[18]

Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE, Salt Lake City, UT, 1--9.

Digital Library

[19]

Argonne Leadership Computing Facility(ALCF). 2022. ALCF Public Data. https://reports.alcf.anl.gov/data/index.html

[20]

Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2010. Helios: a hybrid electrical/optical switch architecture for modular data centers. ACM SIGCOMM Computer Communication Review 40, 4 (Aug. 2010), 339--350.

Digital Library

[21]

Guangnan Feng, Dezun Dong, and Yutong Lu. 2022. Optimized MPI collective algorithms for dragonfly topology. In Proceedings of the 36th ACM International Conference on Supercomputing (ICS '22). Association for Computing Machinery, New York, NY, USA, 1--11.

Digital Library

[22]

Pablo Fuentes, Enrique Vallejo, Cristóbal Camarero, Ramón Beivide, and Mateo Valero. 2015. Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns. In 2015 IEEE International Conference on Cluster Computing. 801--808. ISSN: 2168-9253.

Digital Library

[23]

Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche, Houman Rastegarfar, Madeleine Glick, and Daniel Kilper. 2016. ProjecToR: Agile Reconfigurable Data Center Interconnect. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 216--229.

Digital Library

[24]

Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar, Samir R. Das, Jon P. Longtin, Himanshu Shah, and Ashish Tanwer. 2014. FireFly: a reconfigurable wireless data center fabric using free-space optics. ACM SIGCOMM Computer Communication Review 44, 4 (Aug. 2014), 319--330.

Digital Library

[25]

Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing Throughput on a Dragonfly Network. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE, New Orleans, LA, USA, 336--347.

Digital Library

[26]

Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V. Kale. 2016. Evaluating HPC Networks via Simulation of Parallel Workloads. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 154--165. ISSN: 2167-4337.

[27]

Nan Jiang, John Kim, and William J. Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09). Association for Computing Machinery, New York, NY, USA, 220--231.

Digital Library

[28]

Tyson Jones, Anna Brown, Ian Bush, and Simon C. Benjamin. 2019. QuEST and High Performance Simulation of Quantum Computers. Scientific Reports 9, 1 (July 2019), 10736. Number: 1 Publisher: Nature Publishing Group.

[29]

Norman P Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings: Industrial Product. In The 50th Annual International Symposium on Computer Architecture (ISCA '23). ACM, New York, NY, USA, June 17--21, 2023, Orlando, FL, USA, 15 pages.

Digital Library

[30]

John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture (ISCA '08). 77--88. ISSN: 1063-6897.

Digital Library

[31]

Los Alamos National Laboratory(LANL)LANL. 2020. Trinity: Advanced Technology System. https://www.lanl.gov/projects/trinity/

[32]

Sylvain Laizet and Ning Li. 2010. Incompact3d: a powerful tool to tackle turbulence problems with up to hundreds of thousands computational cores. (Nov. 2010). https://www.researchgate.net/publication/253825870_Incompact3d_a_powerful_tool_to_tackle_turbulence_problems_with_up_to_hundreds_of_thousands_computational_cores

[33]

Jianping Kelvin Li, Misbah Mubarak, Robert B. Ross, Christopher D. Carothers, and Kwan-Liu Ma. 2017. Visual Analytics Techniques for Exploring the Design Space of Large-Scale High-Radix Networks. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Honolulu, HI, USA, 193--203.

[34]

Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High Performance Interconnect Network for Tianhe System. Journal of Computer Science and Technology 30, 2 (March 2015), 259--272.

[35]

Gengchen Liu, Roberto Proietti, Marjan Fariborz, Pouya Fotouhi, Xian Xiao, and S. J. Ben Yoo. 2020. Architecture and performance studies of 3D-Hyper-FleX-LION for reconfigurable all-to-all HPC networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Atlanta, Georgia, 1--16.

Digital Library

[36]

He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, and Alex C. Snoeren. 2015. Scheduling techniques for hybrid circuit/packet networks. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT '15). Association for Computing Machinery, New York, NY, USA, 1--13.

Digital Library

[37]

Hao Lu, Michael Matheson, Vladyslav Oles, Austin Ellis, Wayne Joubert, and Feiyi Wang. 2022. Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale. IEEE Computer Society, 1123--1137. https://www.computer.org/csdl/proceedings-article/sc/2022/544400b123/1I0bTcFWPzW ISSN: 2167-4337.

[38]

Pavlos Maniotis, Nicolas Dupuis, Laurent Schares, Daniel M. Kuchta, Marc A. Taubenblatt, and Benjamin G. Lee. 2020. Intra-node high-performance computing network architecture with nanosecond-scale photonic switches [Invited]. Journal of Optical Communications and Networking 12, 12 (Dec. 2020), 367--377. Conference Name: Journal of Optical Communications and Networking.

[39]

Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40--50.

[40]

William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, and George Porter. 2020. Expanding across time to deliver bandwidth efficiency and low latency. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (NSDI'20). USENIX Association, USA, 1--18. https://www.usenix.org/conference/nsdi20/presentation/mellette

Digital Library

[41]

William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. 2017. RotorNet: A Scalable, Low-complexity, Optical Datacenter Network. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). Association for Computing Machinery, New York, NY, USA, 267--280.

Digital Library

[42]

George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and Keren Bergman. 2019. Bandwidth steering in HPC using silicon nanophotonics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Association for Computing Machinery, New York, NY, USA, 1--25.

Digital Library

[43]

Cyriel Minkenberg, German Rodriguez, Bogdan Prisacari, Laurent Schares, Philip Heidelberger, Dong Chen, and Craig Stunkel. 2016. Performance benefits of optical circuit switches for large-scale dragonfly networks. In 2016 Optical Fiber Communications Conference and Exhibition (OFC). 1--3. https://ieeexplore.ieee.org/document/7537792

[44]

Misbah Mubarak, Christopher D. Carothers, Robert Ross, and Philip Carns. 2012. Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, UT, 366--376.

Digital Library

[45]

Misbah Mubarak, Christopher D. Carothers, Robert B. Ross, and Philip Carns. 2017. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan. 2017), 87--100.

Digital Library

[46]

NERSC. 2021. Perlmutter System Details. https://docs.nersc.gov/systems/perlmutter/system_details/

[47]

Tatsuyoshi Ohmura, Yoichi Shimomura, Ryusuke Egawa, and Hiroyuki Takizawa. 2023. Toward Building a Digital Twin of Job Scheduling and Power Management on an HPC System. In Job Scheduling Strategies for Parallel Processing (Lecture Notes in Computer Science), Dalibor Klusáček, Corbalán Julita, and Gonzalo P. Rodrigo (Eds.). Springer Nature Switzerland, Cham, 47--67.

Digital Library

[48]

Flavio Pardo, Vladimir A. Aksyuk, Susanne Arney, H. Bair, Nagesh R. Basavanhally, David J. Bishop, Gregory R. Bogart, Cristian A. Bolle, J. E. Bower, Dustin Carr, H. B. Chan, Raymond A. Cirelli, E. Ferry, Robert E. Frahm, Arman Gasparyan, John V. Gates, C. Randy Giles, L. Gomez, Suresh Goyal, Dennis S. Greywall, Martin Haueis, R. C. Keller, Jungsang Kim, Fred P. Klemens, Paul R. Kolodner, Avi Kornblit, T. Kroupenkine, Warren Y.-C. Lai, Victor Lifton, Jian Liu, Yee L. Low, William M. Mansfield, Dan Marom, John F. Miner, David T. Neilson, Mark A. Paczkowski, C. S. Pai, A. G. Ramirez, David A. Ramsey, S. Rogers, Roland Ryf, Ronald E. Scotti, Herbert R. Shea, M. E. Simon, H. T. Soh, Hong Tang, J. A. Taylor, K. Teffeau, Joseph Vuillemin, and J. Weld. 2003. Optical MEMS devices for telecom systems. In Smart Sensors, Actuators, and MEMS, Vol. 5116. SPIE, 435--444.

[49]

Tirthak Patel, Zhengchun Liu, Raj Kettimuthu, Paul Rich, William Allcock, and Devesh Tiwari. 2020. Job characteristics on large-scale systems: long-term analysis, quantification, and implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Atlanta, Georgia, 1--17.

Digital Library

[50]

Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. 2018. Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 333--345.

Digital Library

[51]

George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2013. Integrating microsecond circuit switching into the data center. In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM (SIGCOMM '13). Association for Computing Machinery, New York, NY, USA, 447--458.

Digital Library

[52]

Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, and Amin Vahdat. 2022. Jupiter evolving: transforming google's datacenter network via optical circuit switches and software-defined networking. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 66--85.

Digital Library

[53]

Jessi Christa Rubio, Aira Villapando, Christian Matira, and Jeffrey Aborot. 2020. Correcting Job Walltime in a Resource-Constrained Environment. In Supercomputing Frontiers (Lecture Notes in Computer Science), Dhabaleswar K. Panda (Ed.). Springer International Publishing, Cham, 118--137.

Digital Library

[54]

Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer, Cham, 197--217.

[55]

Arjun Singh. 2005. LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS. Ph. D. Dissertation. Stanford University.

[56]

Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, and Adedolapo Okanlawon. 2019. Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (PEARC '19). Association for Computing Machinery, New York, NY, USA, 1--8.

Digital Library

[57]

Min Yee Teh, Yu-Han Hung, George Michelogiannakis, Shijia Yan, Madeleine Glick, John Shalf, and Keren Bergman. 2020. TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.

[58]

Min Yee Teh, Zhenguo Wu, and K. Bergman. 2020. Flexspander: augmenting expander networks in high-performance systems with optical bandwidth steering. IEEE/OSA Journal of Optical Communications and Networking (2020).

[59]

Min Yee Teh, Zhenguo Wu, Madeleine Glick, Sebastien Rumley, Manya Ghobadi, and Keren Bergman. 2022. Performance trade-offs in reconfigurable networks for HPC. Journal of Optical Communications and Networking 14, 6 (June 2022), 454--468. Conference Name: Journal of Optical Communications and Networking.

[60]

O. Tuncer, Yijia Zhang, V. Leung, and A. Coskun. 2017. Task mapping on a dragonfly supercomputer. Technical Report. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States). https://www.semanticscholar.org/paper/Task-mapping-on-a-dragonfly-supercomputer-Tuncer-Zhang/ac5416c4d080fabf5b983f86a5497f487389e9a9

[61]

Guohui Wang, David G. Andersen, Michael Kaminsky, Konstantina Papagiannaki, T.S. Eugene Ng, Michael Kozuch, and Michael Ryan. 2010. c-Through: part-time optics in data centers. ACM SIGCOMM Computer Communication Review 40, 4 (Aug. 2010), 327--338.

Digital Library

[62]

Hao Wang, Yi-Qin Dai, Jie Yu, and Yong Dong. 2021. Predicting running time of aerodynamic jobs in HPC system by combining supervised and unsupervised learning method. Advances in Aerodynamics 3, 1 (Aug. 2021), 22.

[63]

Xin Wang, Misbah Mubarak, Xu Yang, Robert B. Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Vancouver, BC, 1113--1122.

[64]

Ke Wen, Payman Samadi, Sébastien Rumley, Christine P. Chen, Yiwen Shen, Meisam Bahadori, Keren Bergman, and Jeremiah Wilke. 2016. Flexfly: Enabling a Reconfigurable Dragonfly through Silicon Photonics. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 166--177. ISSN: 2167-4337.

[65]

Jiabin Xie, Jianchao He, Yun Bao, and Xi Chen. 2021. A Low-Communication-Overhead Parallel DNS Method for the 3D Incompressible Wall Turbulence. International Journal of Computational Fluid Dynamics 35, 6 (July 2021), 413--432. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/10618562.2021.1971202.

[66]

Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and Zhiling Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 750--760.

[67]

Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing (Lecture Notes in Computer Science), Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer, Berlin, Heidelberg, 44--60.

[68]

Felix Zahn and Holger Fröning. 2020. On Network Locality in MPI-Based HPC Applications. In 49th International Conference on Parallel Processing - ICPP (ICPP '20). Association for Computing Machinery, New York, NY, USA, 1--10.

Digital Library

[69]

Zuoqing Zhao, Bingli Guo, Yu Shang, and Shanguo Huang. 2020. Hierarchical and reconfigurable optical/electrical interconnection network for high-performance computing. Journal of Optical Communications and Networking 12, 3 (March 2020), 50--61. Conference Name: Journal of Optical Communications and Networking.

Cited By

Qin LGu HYu XCai ZLiu J(2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
https://doi.org/10.1364/JOCN.516031
Feng GXie JDong DLu Y(2024)UNR: Unified Notifiable RMA Library for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00111
Li ZChen ZTang YAi XZhu YZhao ZShao JLiu GLiu SLiu BXu Y(2024)MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time Traffic2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00073(765-779)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00073
Show More Cited By

Index Terms

GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC
1. Networks
  1. Network algorithms
    1. Control path algorithms
      1. Network resources allocation

Recommendations

Policy based resource allocation in IaaS cloud

In present scenario, most of the Infrastructure as a Service (IaaS) clouds use simple resource allocation policies like immediate and best effort. Immediate allocation policy allocates the resources if available, otherwise the request is rejected. Best-...
Towards reconfigurable accelerators in HPC: designing a multipurpose eFPGA tile for heterogeneous SoCs
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in Europe

The goal of modern high performance computing platforms is to combine low power consumption and high throughput. Within the European Processor Initiative (EPI), such an SoC platform to meet the novel exascale requirements is built and investigated. As ...
Resource reconstruction algorithms for on-demand allocation in virtual computing resource pool

Resource reconstruction algorithms are studied in this paper to solve the problem of resource on-demand allocation and improve the efficiency of resource utilization in virtual computing resource pool. Based on the idea of resource virtualization and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing

June 2023

505 pages

ISBN:9798400700569

DOI:10.1145/3577193

Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '23

Sponsor:

SIGARCH

ICS '23: 37th International Conference on Supercomputing

June 21 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
342
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)12

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qin LGu HYu XCai ZLiu J(2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
https://doi.org/10.1364/JOCN.516031
Feng GXie JDong DLu Y(2024)UNR: Unified Notifiable RMA Library for HPCProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00111
Li ZChen ZTang YAi XZhu YZhao ZShao JLiu GLiu SLiu BXu Y(2024)MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time Traffic2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00073(765-779)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00073
Jermanshiyamala AKumar NBelhe SSreekanth KRay SSengan S(2024)ACO-Optimized DRL Model for Energy-Efficient Resource Allocation in High-Performance ComputingMulti-Strategy Learning Environment10.1007/978-981-97-1488-9_11(143-154)Online publication date: 29-May-2024
https://doi.org/10.1007/978-981-97-1488-9_11

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten