research-article

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

Authors:
Longlong Liao

National University of Defense Technology, State Key Laboratory of High Performance Computing, Changsha, Hunan, China

National University of Defense Technology, State Key Laboratory of High Performance Computing, Changsha, Hunan, China
View Profile

,
Kenli Li

Hunan University, National Supercomputing Center in Changsha, Changsha, Hunan, China

Hunan University, National Supercomputing Center in Changsha, Changsha, Hunan, China
View Profile

,
Keqin Li

State University of New York, National Supercomputing Center in Changsha, New Paltz, USA

State University of New York, National Supercomputing Center in Changsha, New Paltz, USA
View Profile

,
Canqun Yang

National University of Defense Technology, State Key Laboratory of High Performance Computing, Changsha, Hunan, China

National University of Defense Technology, State Key Laboratory of High Performance Computing, Changsha, Hunan, China
View Profile

,
Qi Tian

University of Texas at San Antonio, San Antonio, USA

University of Texas at San Antonio, San Antonio, USA
View Profile

ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingAugust 2018Article No.: 44Pages 1–10https://doi.org/10.1145/3225058.3225107

Published:13 August 2018Publication History

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Pages 1–10

ABSTRACT

As the majority of popular deep neural network (DNN) frameworks focus on a closed format CUDA implementations based on one or more NVIDIA GPUs, they cannot efficiently leverage other devices in cluster mode to accelerate the training and inference of DNNs except NVIDIA GPUs. To accelerate DNNs using heterogeneous multi-/many-core clusters, we propose an OpenCL-based DNN framework called UHCL-Darknet. First, we design a unified OpenCL platform model for the heterogeneous cluster called UHCL, and an adaptive runtime system with the affinity-based dynamic scheduler for UHCL, enabling transparent utilization of a wide variety of vendor-specific OpenCL devices in the heterogeneous cluster. Then, we extend Darknet to UHCL by introducing the parallel optimization of DNNs, such as paralleling Winogrand-based convolutions and auto-tuning parameterized OpenCL kernels. The training and inference of art-of-data DNN models (e.g., YOLOv2, ResNet-50, and DenseNet-201) are executed on an experimental heterogeneous cluster. Results show that UHCL-Darknet is a scalable and portable DNN framework for heterogeneous clusters, and achieves 1.9x and 2.2x speedups on average respectively for the image throughput of data-parallel training and inference on the experimental heterogeneous cluster.

References

Martín Abadi, Paul Barham, Jianmin Chen, and et.al. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 265--283. http://dl.acm.org/citation.cfm?id=3026877.3026899 Google ScholarDigital Library
Ashwin M. Aji, Antonio J. Peña, Pavan Balaji, and Wu chun Feng. 2016. MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput. 58 (2016), 37--55. Google ScholarDigital Library
Albano Alves, José Rufino, António Pina, and Luís Paulo Santos. 2013. clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters. In Euro-Par 2012: Parallel Processing Workshops. Springer Berlin Heidelberg, Berlin, Heidelberg, 112--122. Google ScholarDigital Library
Ryo Aoki, Shuichi Oikawa, Takashi Nakamura, and Satoshi Miki. 2011. Hybrid OpenCL: Enhancing OpenCL for Distributed Processing. In IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications. 149--154. Google ScholarDigital Library
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 193--205. Google ScholarDigital Library
Olivier Beaumont, Lionel Eyraud-Dubois, and Suraj Kumar. 2017. Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 768--777.Google ScholarCross Ref
Cheng Chen, Jianbin Fang, Tao Tang, and Canqun Yang. 2017. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99, 8 (01 Aug 2017), 791--811. Google ScholarDigital Library
Cen Chen, Kenli Li, Aijia Ouyang, ZhuoTang, and Keqin Li. 2016. GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data. IEEE Transactions on Parallel & Distributed Systems (2016), 542--551.Google Scholar
Thanh Tuan Dao and Jaejin Lee. 2018. An Auto-Tuner for OpenCL Work-Group Size on GPUs. IEEE Transactions on Parallel and Distributed Systems 29, 2 (Feb 2018), 283--296.Google ScholarCross Ref
Jianbin Fang, Ana Lucia Varbanescu, Xiangke Liao, and Henk Sips. 2015. Evaluating vector data type usage in OpenCL kernels. Concurrency and Computation: Practice and Experience 27, 17 (2015), 4586--4602. Google ScholarDigital Library
Edgar Gabriel, Graham E. Fagg, George Bosilca, and Thara Angskun. 2017. Open MPI. https://www.open-mpi.org. (May 2017).Google Scholar
Mehdi Goli, Luke Iwanski, and Andrew Richards. 2017. Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices. In Proceedings of the 5th International Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA, Article 8, 4 pages. Google ScholarDigital Library
Junli Gu, Yibing Liu, Yuan Gao, and Maohua Zhu. 2016. OpenCL Caffe: Accelerating and Enabling a Cross Platform Machine Learning Framework. In Proceedings of the 4th International Workshop on OpenCL (IWOCL '16). ACM, New York, NY, USA, Article 8, 5 pages. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google Scholar
Sylvain Henry, Alexandre Denis, Denis Barthou, Marie-Christine Counilh, and Raymond Namyst. 2014. Toward OpenCL Automatic Multi-Device Support. In Euro-Par 2014 Parallel Processing. Springer International Publishing, Porto, Portugal, 776--787. https://hal.inria.fr/hal-01005765Google Scholar
Gao Huang, Zhuang Liu, and Laurens van der Maaten. 2017. Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269.Google Scholar
Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2016. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 00. 2592--2600.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). ACM, New York, NY, USA, 675--678. Google ScholarDigital Library
Junghyun Kim, Gangwon Jo, Jaehoon Jung, Jungwon Kim, and Jaejin Lee. 2016. A Distributed OpenCL Framework Using Redundant Computation and Data Replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 553--569. Google ScholarDigital Library
Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 341--352. Google ScholarDigital Library
Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4013--4021.Google Scholar
Shijie Li, Yong Dou, Xin Niu, Qi Lv, and Qiang Wang. 2017. A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230 (2017), 48--59. Google ScholarDigital Library
Fengshun Lu, Junqiang Song, Fukang Yin, and Xiaoqian Zhu. 2012. Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Computer Physics Communications 183, 6 (2012), 1172--1181.Google ScholarCross Ref
Microsoft. 2018. Microsoft Cognitive Toolkit (CNTK). https://github.com/Microsoft/CNTK. (March 2018).Google Scholar
Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. 195--202. Google ScholarDigital Library
Adam Paszke and Sam Gross. 2018. PyTorch. http://pytorch.org/. (March 2018).Google Scholar
Hugh Perkins. 2016. cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL. CoRR abs/1606.04884 (2016). arXiv:1606.04884Google Scholar
Joseph Redmon. 2013--2017. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/. (2013-2017).Google Scholar
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Honolulu, HI, USA, 7263--7271.Google Scholar
Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017). ACM, New York, NY, USA, 11--20. Google ScholarDigital Library
Yi Yang, Min Feng, and Srimat T. Chakradhar. 2016. HppCnn: A High-Performance, Portable Deep-Learning Library for GPGPUs. In 2016 45th International Conference on Parallel Processing (ICPP), Vol. 00. 582--587.Google Scholar
Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, and Zheng Wang. 2018. Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach. In the 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18). Vancouver, British Columbia CANADA. arXiv:1802.02760 http://arxiv.org/abs/1802.02760Google ScholarCross Ref

Index Terms

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
2. Computing methodologies
  1. Artificial intelligence

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
PIPSEA: A Practical IPsec Gateway on Embedded APUs
CCS '16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security

Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between ...
Read More
LibWater: heterogeneous distributed computing made easy
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Clusters of heterogeneous nodes composed of multi-core CPUs and GPUs are increasingly being used for High Performance Computing (HPC) due to the benefits in peak performance and energy efficiency. In order to fully harvest the computational capabilities ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Heterogeneous computing
OpenCL
automatic scheduling
deep neural network framework
runtime system
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ICPP '18 Paper Acceptance Rate91of313submissions,29%Overall Acceptance Rate91of313submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 331
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

PIPSEA: A Practical IPsec Gateway on Embedded APUs

LibWater: heterogeneous distributed computing made easy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

PIPSEA: A Practical IPsec Gateway on Embedded APUs

LibWater: heterogeneous distributed computing made easy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media