ABSTRACT
As the majority of popular deep neural network (DNN) frameworks focus on a closed format CUDA implementations based on one or more NVIDIA GPUs, they cannot efficiently leverage other devices in cluster mode to accelerate the training and inference of DNNs except NVIDIA GPUs. To accelerate DNNs using heterogeneous multi-/many-core clusters, we propose an OpenCL-based DNN framework called UHCL-Darknet. First, we design a unified OpenCL platform model for the heterogeneous cluster called UHCL, and an adaptive runtime system with the affinity-based dynamic scheduler for UHCL, enabling transparent utilization of a wide variety of vendor-specific OpenCL devices in the heterogeneous cluster. Then, we extend Darknet to UHCL by introducing the parallel optimization of DNNs, such as paralleling Winogrand-based convolutions and auto-tuning parameterized OpenCL kernels. The training and inference of art-of-data DNN models (e.g., YOLOv2, ResNet-50, and DenseNet-201) are executed on an experimental heterogeneous cluster. Results show that UHCL-Darknet is a scalable and portable DNN framework for heterogeneous clusters, and achieves 1.9x and 2.2x speedups on average respectively for the image throughput of data-parallel training and inference on the experimental heterogeneous cluster.
- Martín Abadi, Paul Barham, Jianmin Chen, and et.al. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265--283. http://dl.acm.org/citation.cfm?id=3026877.3026899 Google ScholarDigital Library
- Ashwin M. Aji, Antonio J. Peña, Pavan Balaji, and Wu chun Feng. 2016. MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput. 58 (2016), 37--55. Google ScholarDigital Library
- Albano Alves, José Rufino, António Pina, and Luís Paulo Santos. 2013. clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters. In Euro-Par 2012: Parallel Processing Workshops. Springer Berlin Heidelberg, Berlin, Heidelberg, 112--122. Google ScholarDigital Library
- Ryo Aoki, Shuichi Oikawa, Takashi Nakamura, and Satoshi Miki. 2011. Hybrid OpenCL: Enhancing OpenCL for Distributed Processing. In IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications. 149--154. Google ScholarDigital Library
- Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 193--205. Google ScholarDigital Library
- Olivier Beaumont, Lionel Eyraud-Dubois, and Suraj Kumar. 2017. Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 768--777.Google ScholarCross Ref
- Cheng Chen, Jianbin Fang, Tao Tang, and Canqun Yang. 2017. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 8 (01 Aug 2017), 791--811. Google ScholarDigital Library
- Cen Chen, Kenli Li, Aijia Ouyang, ZhuoTang, and Keqin Li. 2016. GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data. IEEE Transactions on Parallel & Distributed Systems (2016), 542--551.Google Scholar
- Thanh Tuan Dao and Jaejin Lee. 2018. An Auto-Tuner for OpenCL Work-Group Size on GPUs. IEEE Transactions on Parallel and Distributed Systems 29, 2 (Feb 2018), 283--296.Google ScholarCross Ref
- Jianbin Fang, Ana Lucia Varbanescu, Xiangke Liao, and Henk Sips. 2015. Evaluating vector data type usage in OpenCL kernels. Concurrency and Computation: Practice and Experience 27, 17 (2015), 4586--4602. Google ScholarDigital Library
- Edgar Gabriel, Graham E. Fagg, George Bosilca, and Thara Angskun. 2017. Open MPI. https://www.open-mpi.org. (May 2017).Google Scholar
- Mehdi Goli, Luke Iwanski, and Andrew Richards. 2017. Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 8, 4 pages. Google ScholarDigital Library
- Junli Gu, Yibing Liu, Yuan Gao, and Maohua Zhu. 2016. OpenCL Caffe: Accelerating and Enabling a Cross Platform Machine Learning Framework. In Proceedings of the 4th International Workshop on OpenCL (IWOCL '16). ACM, New York, NY, USA, Article 8, 5 pages. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google Scholar
- Sylvain Henry, Alexandre Denis, Denis Barthou, Marie-Christine Counilh, and Raymond Namyst. 2014. Toward OpenCL Automatic Multi-Device Support. In Euro-Par 2014 Parallel Processing. Springer International Publishing, Porto, Portugal, 776--787. https://hal.inria.fr/hal-01005765Google Scholar
- Gao Huang, Zhuang Liu, and Laurens van der Maaten. 2017. Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269.Google Scholar
- Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2016. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 00. 2592--2600.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). ACM, New York, NY, USA, 675--678. Google ScholarDigital Library
- Junghyun Kim, Gangwon Jo, Jaehoon Jung, Jungwon Kim, and Jaejin Lee. 2016. A Distributed OpenCL Framework Using Redundant Computation and Data Replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 553--569. Google ScholarDigital Library
- Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 341--352. Google ScholarDigital Library
- Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4013--4021.Google Scholar
- Shijie Li, Yong Dou, Xin Niu, Qi Lv, and Qiang Wang. 2017. A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230 (2017), 48--59. Google ScholarDigital Library
- Fengshun Lu, Junqiang Song, Fukang Yin, and Xiaoqian Zhu. 2012. Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Computer Physics Communications 183, 6 (2012), 1172--1181.Google ScholarCross Ref
- Microsoft. 2018. Microsoft Cognitive Toolkit (CNTK). https://github.com/Microsoft/CNTK. (March 2018).Google Scholar
- Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. 195--202. Google ScholarDigital Library
- Adam Paszke and Sam Gross. 2018. PyTorch. http://pytorch.org/. (March 2018).Google Scholar
- Hugh Perkins. 2016. cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL. CoRR abs/1606.04884 (2016). arXiv:1606.04884Google Scholar
- Joseph Redmon. 2013--2017. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/. (2013-2017).Google Scholar
- Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Honolulu, HI, USA, 7263--7271.Google Scholar
- Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017). ACM, New York, NY, USA, 11--20. Google ScholarDigital Library
- Yi Yang, Min Feng, and Srimat T. Chakradhar. 2016. HppCnn: A High-Performance, Portable Deep-Learning Library for GPGPUs. In 2016 45th International Conference on Parallel Processing (ICPP), Vol. 00. 582--587.Google Scholar
- Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, and Zheng Wang. 2018. Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach. In the 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18). Vancouver, British Columbia CANADA. arXiv:1802.02760 http://arxiv.org/abs/1802.02760Google ScholarCross Ref
Index Terms
- UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
PIPSEA: A Practical IPsec Gateway on Embedded APUs
CCS '16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications SecurityAccelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between ...
LibWater: heterogeneous distributed computing made easy
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingClusters of heterogeneous nodes composed of multi-core CPUs and GPUs are increasingly being used for High Performance Computing (HPC) due to the benefits in peak performance and energy efficiency. In order to fully harvest the computational capabilities ...
Comments