skip to main content
10.1145/3225058.3225107acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

Authors Info & Claims
Published:13 August 2018Publication History

ABSTRACT

As the majority of popular deep neural network (DNN) frameworks focus on a closed format CUDA implementations based on one or more NVIDIA GPUs, they cannot efficiently leverage other devices in cluster mode to accelerate the training and inference of DNNs except NVIDIA GPUs. To accelerate DNNs using heterogeneous multi-/many-core clusters, we propose an OpenCL-based DNN framework called UHCL-Darknet. First, we design a unified OpenCL platform model for the heterogeneous cluster called UHCL, and an adaptive runtime system with the affinity-based dynamic scheduler for UHCL, enabling transparent utilization of a wide variety of vendor-specific OpenCL devices in the heterogeneous cluster. Then, we extend Darknet to UHCL by introducing the parallel optimization of DNNs, such as paralleling Winogrand-based convolutions and auto-tuning parameterized OpenCL kernels. The training and inference of art-of-data DNN models (e.g., YOLOv2, ResNet-50, and DenseNet-201) are executed on an experimental heterogeneous cluster. Results show that UHCL-Darknet is a scalable and portable DNN framework for heterogeneous clusters, and achieves 1.9x and 2.2x speedups on average respectively for the image throughput of data-parallel training and inference on the experimental heterogeneous cluster.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, and et.al. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265--283. http://dl.acm.org/citation.cfm?id=3026877.3026899 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ashwin M. Aji, Antonio J. Peña, Pavan Balaji, and Wu chun Feng. 2016. MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput. 58 (2016), 37--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Albano Alves, José Rufino, António Pina, and Luís Paulo Santos. 2013. clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters. In Euro-Par 2012: Parallel Processing Workshops. Springer Berlin Heidelberg, Berlin, Heidelberg, 112--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ryo Aoki, Shuichi Oikawa, Takashi Nakamura, and Satoshi Miki. 2011. Hybrid OpenCL: Enhancing OpenCL for Distributed Processing. In IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications. 149--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 193--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Olivier Beaumont, Lionel Eyraud-Dubois, and Suraj Kumar. 2017. Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 768--777.Google ScholarGoogle ScholarCross RefCross Ref
  7. Cheng Chen, Jianbin Fang, Tao Tang, and Canqun Yang. 2017. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 8 (01 Aug 2017), 791--811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cen Chen, Kenli Li, Aijia Ouyang, ZhuoTang, and Keqin Li. 2016. GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data. IEEE Transactions on Parallel & Distributed Systems (2016), 542--551.Google ScholarGoogle Scholar
  9. Thanh Tuan Dao and Jaejin Lee. 2018. An Auto-Tuner for OpenCL Work-Group Size on GPUs. IEEE Transactions on Parallel and Distributed Systems 29, 2 (Feb 2018), 283--296.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jianbin Fang, Ana Lucia Varbanescu, Xiangke Liao, and Henk Sips. 2015. Evaluating vector data type usage in OpenCL kernels. Concurrency and Computation: Practice and Experience 27, 17 (2015), 4586--4602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Edgar Gabriel, Graham E. Fagg, George Bosilca, and Thara Angskun. 2017. Open MPI. https://www.open-mpi.org. (May 2017).Google ScholarGoogle Scholar
  12. Mehdi Goli, Luke Iwanski, and Andrew Richards. 2017. Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 8, 4 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Junli Gu, Yibing Liu, Yuan Gao, and Maohua Zhu. 2016. OpenCL Caffe: Accelerating and Enabling a Cross Platform Machine Learning Framework. In Proceedings of the 4th International Workshop on OpenCL (IWOCL '16). ACM, New York, NY, USA, Article 8, 5 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarGoogle Scholar
  15. Sylvain Henry, Alexandre Denis, Denis Barthou, Marie-Christine Counilh, and Raymond Namyst. 2014. Toward OpenCL Automatic Multi-Device Support. In Euro-Par 2014 Parallel Processing. Springer International Publishing, Porto, Portugal, 776--787. https://hal.inria.fr/hal-01005765Google ScholarGoogle Scholar
  16. Gao Huang, Zhuang Liu, and Laurens van der Maaten. 2017. Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269.Google ScholarGoogle Scholar
  17. Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2016. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 00. 2592--2600.Google ScholarGoogle Scholar
  18. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). ACM, New York, NY, USA, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Junghyun Kim, Gangwon Jo, Jaehoon Jung, Jungwon Kim, and Jaejin Lee. 2016. A Distributed OpenCL Framework Using Redundant Computation and Data Replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 553--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4013--4021.Google ScholarGoogle Scholar
  22. Shijie Li, Yong Dou, Xin Niu, Qi Lv, and Qiang Wang. 2017. A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230 (2017), 48--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fengshun Lu, Junqiang Song, Fukang Yin, and Xiaoqian Zhu. 2012. Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Computer Physics Communications 183, 6 (2012), 1172--1181.Google ScholarGoogle ScholarCross RefCross Ref
  24. Microsoft. 2018. Microsoft Cognitive Toolkit (CNTK). https://github.com/Microsoft/CNTK. (March 2018).Google ScholarGoogle Scholar
  25. Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. 195--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Adam Paszke and Sam Gross. 2018. PyTorch. http://pytorch.org/. (March 2018).Google ScholarGoogle Scholar
  27. Hugh Perkins. 2016. cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL. CoRR abs/1606.04884 (2016). arXiv:1606.04884Google ScholarGoogle Scholar
  28. Joseph Redmon. 2013--2017. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/. (2013-2017).Google ScholarGoogle Scholar
  29. Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Honolulu, HI, USA, 7263--7271.Google ScholarGoogle Scholar
  30. Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017). ACM, New York, NY, USA, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yi Yang, Min Feng, and Srimat T. Chakradhar. 2016. HppCnn: A High-Performance, Portable Deep-Learning Library for GPGPUs. In 2016 45th International Conference on Parallel Processing (ICPP), Vol. 00. 582--587.Google ScholarGoogle Scholar
  32. Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, and Zheng Wang. 2018. Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach. In the 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18). Vancouver, British Columbia CANADA. arXiv:1802.02760 http://arxiv.org/abs/1802.02760Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
          August 2018
          945 pages
          ISBN:9781450365109
          DOI:10.1145/3225058

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 August 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          ICPP '18 Paper Acceptance Rate91of313submissions,29%Overall Acceptance Rate91of313submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader