ABSTRACT
Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation.
We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).
- Barefoot. 2019. TOFINO: World’s fastest P4-programmable Ethernet switch ASICs. https://barefootnetworks.com/products/brief-tofino/ Google Scholar
- Mike Barnett, Lance Shuler, Robert van De Geijn, Satya Gupta, David G Payne, and Jerrell Watts. 1994. Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference. 357–364. https://ieeexplore.ieee.org/abstract/document/296665 Google ScholarCross Ref
- Theophilus A Benson. 2019. In-network compute: Considered armed and dangerous. In Proceedings of the Workshop on Hot Topics in Operating Systems. 216–224. Google ScholarDigital Library
- Li Chen, Ge Chen, Justinas Lingys, and Kai Chen. 2018. Programmable switch as a parallel computing device. arXiv preprint arXiv:1803.01491. Google Scholar
- Xiang Chen, Qun Huang, Peiqiao Wang, Zili Meng, Hongyan Liu, Yuxin Chen, Dong Zhang, Haifeng Zhou, Boyang Zhou, and Chunming Wu. 2021. LightNF: Simplifying Network Function Offloading in Programmable Networks. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1–10. Google Scholar
- Eyal Cidon, Sean Choi, Sachin Katti, and Nick McKeown. 2017. AppSwitch: Application-Layer Load Balancing within a Software Switch. In Proceedings of the First Asia-Pacific Workshop on Networking (APNet’17). Association for Computing Machinery, New York, NY, USA. 64–70. isbn:9781450352444 https://doi.org/10.1145/3106989.3106998 Google ScholarDigital Library
- Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, arxiv:1803.05880 Google Scholar
- Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 248–255. https://www.image-net.org/papers/imagenet_cvpr09.pdf Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, arxiv:1810.04805 Google Scholar
- Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. 2012. High performance network virtualization with SR-IOV. J. Parallel and Distrib. Comput., 72, 11 (2012), 1471–1480. Google ScholarDigital Library
- Yaozu Dong, Zhao Yu, and Greg Rose. 2008. SR-IOV Networking in Xen: Architecture, Design and Implementation.. In Workshop on I/O Virtualization. 2. Google Scholar
- Jiawei Fei, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning. In Proceedings of SIGCOMM. Google ScholarDigital Library
- Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network Aggregation for Shared Machine Learning Clusters. Proceedings of Machine Learning and Systems, 3 (2021), 829–844. Google Scholar
- Jinkun Geng, Dan Li, and Shuai Wang. 2019. Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 100–111. Google ScholarCross Ref
- Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, and Vladimir Koushnir. 2016. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10. https://ieeexplore.ieee.org/abstract/document/7830486/ Google ScholarCross Ref
- Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, and Ophir Maor. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Streaming-Aggregation Hardware Design and Evaluation. In International Conference on High Performance Computing. 41–59. https://link.springer.com/chapter/10.1007/978-3-030-50743-5_3 Google Scholar
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity Ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference. 202–215. https://dl.acm.org/doi/pdf/10.1145/2934872.2934908 Google ScholarDigital Library
- Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. 2015. SoftNIC: A software NIC to augment hardware. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155. Google Scholar
- Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288. Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Google ScholarCross Ref
- Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. 2020. MasQ: RDMA for Virtual Private Cloud. SIGCOMM ’20. Association for Computing Machinery, New York, NY, USA. 1–14. isbn:9781450379557 https://doi.org/10.1145/3387514.3405849 Google ScholarDigital Library
- Huggingface. 2020. Transformers:State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0. https://github.com/huggingface/transformers Google Scholar
- Sylvain Jeaugey. 2017. NCCL 2.0. http://on-demand.gputechconf .com/gtc/2017/ presentation/s7155-jeaugey-nccl.pdf Google Scholar
- Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu, and Mengxian Chi. 2018. Improving the performance of distributed tensorflow with RDMA. International Journal of Parallel Programming, 46, 4 (2018), 674–685. Google ScholarDigital Library
- Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, and Liwei Yu. 2018. Highly scalable deep learning training system with mixed-precision: Training Imagenet in four minutes. arXiv preprint arXiv:1807.11205, arxiv:1807.11205 Google Scholar
- Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-RTT Coordination. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA. 35–49. isbn:978-1-939133-01-4 https://www.usenix.org/conference/nsdi18/presentation/jin Google Scholar
- Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 121–136. isbn:9781450350853 https://doi.org/10.1145/3132747.3132764 Google ScholarDigital Library
- Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi. 2021. SiP-ML: high-bandwidth optical network interconnects for machine learning training. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 657–675. Google ScholarDigital Library
- Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, and Srinivasan Seshan. 2018. Hyperloop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA. 297–312. isbn:9781450355674 https://doi.org/10.1145/3230543.3230572 Google ScholarDigital Library
- Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA. 113–126. isbn:978-1-931971-49-2 https://www.usenix.org/conference/nsdi19/presentation/kim Google ScholarDigital Library
- Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 996–1009. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google Scholar
- Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, and Nate Foster. 2019. PicNIC: predictable virtualized NIC. In Proceedings of the ACM Special Interest Group on Data Communication. 351–366. Google ScholarDigital Library
- ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 741–761. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/lao Google Scholar
- Alberto Lerner, Rana Hussein, Philippe Cudre-Mauroux, and U eXascale Infolab. 2019. The Case for Network Accelerated Query Processing.. In CIDR. Google Scholar
- Mingfan Li, Ke Wen, Han Lin, Xu Jin, Zheng Wu, Hong An, and Mengxian Chi. 2019. Improving the performance of distributed mxnet with rdma. International Journal of Parallel Programming, 47, 3 (2019), 467–480. Google ScholarDigital Library
- Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279–291. https://ieeexplore.ieee.org/abstract/document/8980345 Google ScholarDigital Library
- Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. 2019. DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA. 143–157. isbn:978-1-939133-09-0 https://www.usenix.org/conference/fast19/presentation/liu Google Scholar
- Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze, and Jacob Nelson. 2020. PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training. Proc. of MLSys. Google Scholar
- Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and efficient $GPU$ cluster scheduling. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 289–304. Google Scholar
- Mellanox. 2022. ConnectX-5 EN Single/Dual-Port Adapter Supporting 100Gb/s Ethernet. https://www.mellanox.com/products/ethernet-adapters/connectx-5-en Google Scholar
- Mellanox. 2022. InfiniBand Switch Silicon: Mellanox Quantum. https://www.mellanox.com/products/infiniband-switches-ic/quantum Google Scholar
- Jeffrey C Mogul. 2003. TCP Offload Is a Dumb Idea Whose Time Has Come.. In HotOS. 25–30. Google Scholar
- Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. 2019. Jumpgate: In-network processing as a service for data analytics. In 11th $USENIX$ Workshop on Hot Topics in Cloud Computing (HotCloud 19). Google Scholar
- Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 481–498. Google Scholar
- NVIDIA. 2017. NVIDIA DGX-1 with Tesla V100 System Architecture. https://www.nvidia.com/en-us/data-center/resources/dgx-1-system-architecture-whitepaper/ Google Scholar
- NVIDIA. 2019. NCCL: Optimized primitives for collective multi-GPU communication. https://github.com/NVIDIA/nccl Google Scholar
- NVIDIA. 2019. NVIDIA NVLink Fabric. https://www.nvidia.com/en-sg/data-center/nvlink/ Google Scholar
- NVIDIA. 2020. NVIDIA V100: The First Tensor Core GPU. https://www.nvidia.com/en-sg/data-center/v100/ Google Scholar
- NVIDIA. 2021. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl Google Scholar
- NVIDIA. 2023. GeForce RTX 2080. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080/ Google Scholar
- Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29. https://dl.acm.org/doi/abs/10.1145/3341301.3359642 Google ScholarDigital Library
- Rolf Rabenseifner. 1997. A new optimized MPI reduce algorithm. https://fs.hlrs.de/projects/par/mpi//myreduce.html Google Scholar
- Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. Better language models and their implications. OpenAI Blog, https://openai. com/blog/better-language-models Google Scholar
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQUAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, arxiv:1606.05250 Google Scholar
- Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel Hack, and Song Jiang. 2017. irdma: Efficient use of rdma in distributed deep learning systems. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 231–238. Google Scholar
- Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 785–808. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/sapio Google Scholar
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, arxiv:1802.05799 Google Scholar
- Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. 172–180. arxiv:1811.11141 Google ScholarDigital Library
- Jinwoo Shin and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. Google Scholar
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, arxiv:1909.08053 Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, arxiv:1409.1556 Google Scholar
- Brent E Stephens, Darius Grassi, Hamidreza Almasi, Tao Ji, Balajee Vamanan, and Aditya Akella. 2021. TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP). In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks. 61–68. Google ScholarDigital Library
- PyTorch Team. 2023. PyTorch. https://github.com/pytorch/pytorch Google Scholar
- TensorFlow. 2019. A benchmark framework for Tensorflow. https://github.com/tensorflow/benchmarks Google Scholar
- Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. 2020. Cheetah: Accelerating Database Queries with Switch Pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2407–2422. Google ScholarDigital Library
- Raajay Viswanathan, Arjun Balasubramanian, and Aditya Akella. 2020. Network-accelerated distributed machine learning for multi-tenant settings. In Proceedings of the 11th ACM Symposium on Cloud Computing. 447–461. Google ScholarDigital Library
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, arxiv:1804.07461 Google Scholar
- Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940. Google Scholar
- Xilinx. 2023. Virtex UltraScale - Xilinx. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale.html#productAdvantages Google Scholar
- Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast distributed deep learning over rdma. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–14. Google ScholarDigital Library
- Weihong Yang, Yang Qin, Zukai Jiang, and Xiaowen Chu. 2021. Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks. In ICC 2021-IEEE International Conference on Communications. 1–6. Google Scholar
- Yifan Yuan, Omar Alama, Jiawei Fei, Jacob Nelson, Dan R. K. Ports, Amedeo Sapio, Marco Canini, and Nam Sung Kim. 2022. Unlocking the Power of Inline Floating-Point Operations on Programmable Switches. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Google Scholar
- Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. 2019. Harmonia: Near-Linear Scalability for Replicated Storage with in-Network Conflict Detection. Proc. VLDB Endow., 13, 3 (2019), Nov., 376–389. issn:2150-8097 https://doi.org/10.14778/3368289.3368301 Google ScholarDigital Library
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA. 523–536. isbn:9781450335423 https://doi.org/10.1145/2785956.2787484 Google ScholarDigital Library
Index Terms
In-Network Aggregation with Transport Transparency for Distributed Training
Recommendations
Hardware-accelerated generation of 3D diffusion-limited aggregation structures
The diffusion and aggregation of particles in a medium can result in complex geometric forms with an artistic interpretation, yet these aggregates can represent many natural processes as well. Although the method is quite simple, it takes many particles ...
The development of Mellanox/NVIDIA GPUDirect over InfiniBand: a new model for GPU to GPU communications
TG '11: Proceedings of the 2011 TeraGrid Conference: Extreme Digital DiscoveryThe usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can ...
MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisXeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86__64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for ...
Comments