DLPlib: A Library for Deep Learning Processor

Lan, Hui-Ying; Wu, Lin-Yang; Zhang, Xiao; Tao, Jin-Hua; Chen, Xun-Yu; Wang, Bing-Rui; Wang, Yu-Qing; Guo, Qi; Chen, Yun-Ji

doi:10.1007/s11390-017-1722-2

DLPlib: A Library for Deep Learning Processor

Regular Paper
Published: 13 March 2017

Volume 32, pages 286–296, (2017)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Hui-Ying Lan^1,2,3,
Lin-Yang Wu^1,2,3,
Xiao Zhang^1,2,
Jin-Hua Tao^1,2,
Xun-Yu Chen^1,2,
Bing-Rui Wang^1,2,4,
Yu-Qing Wang^1,2,4,
Qi Guo^1,2 &
…
Yun-Ji Chen^1,2

293 Accesses
5 Citations
7 Altmetric
Explore all metrics

Abstract

Recently, deep learning processors have become one of the most promising solutions of accelerating deep learning algorithms. Currently, the only method of programming the deep learning processors is through writing assembly instructions by bare hands, which costs a lot of programming efforts and causes very low efficiency. One solution is to integrate the deep learning processors as a new back-end into one prevalent high-level deep learning framework (e.g., TPU (tensor processing unit) is integrated into Tensorflow directly). However, this will obstruct other frameworks to profit from the programming interface. The alternative approach is to design a framework-independent low-level library for deep learning processors (e.g., the deep learning library for GPU, cuDNN). In this fashion, the library could be conveniently invoked in high-level programming frameworks and provides more generality. In order to allow more deep learning frameworks to gain benefits from this environment, we envision it as a low-level library which could be easily embedded into current high-level frameworks and provide high performance. Three major issues of designing such a library are discussed. The first one is the design of data structures. Data structures should be as few as possible while being able to support all possible operations. This will allow us to optimize the data structures easier without compromising the generality. The second one is the selection of operations, which should provide a rather wide range of operations to support various types of networks with high efficiency. The third is the design of the API, which should provide a flexible and user-friendly programming model and should be easy to be embedded into existing deep learning frameworks. Considering all the above issues, we propose DLPlib, a tensor-filter based library designed specific for deep learning processors. It contains two major data structures, tensor and filter, and a set of operators including basic neural network primitives and matrix/vector operations. It provides a descriptor-based API exposed as a C++ interface. The library achieves a speedup of 0.79x compared with the performance of hand-written assembly instructions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Zhang S J, Du Z D, Zhang L, Lan H Y, Liu S L, Li L, Guo Q, Chen T S, Chen Y. Cambricon-X: An accelerator for sparse neural networks. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2016.
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In Proc. the 26th Annual Conference on Neural Information Processing Systems, Dec. 2012, pp.1106-1114.
Sun Y, Liang D, Wang X G, Tang X O. DeepID3: Face recognition with very deep neural networks. arXiv:1502.00873, 2015. http://arxiv.org/abs/1502.00873, Feb. 2017.
Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp.3128-3137.
Eriguchi A, Hashimoto K, Tsuruoka Y. Tree-to-sequence attentional neural machine translation. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, Aug. 2016.
Ren S Q, He K M, Girshick R B, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. Annual Conference on Neural Information Processing Systems, Dec. 2015, pp.91-99.
Farabet C, Poulet C, Han J Y, LeCun Y. CNP: An FPGA-based processor for convolutional networks. In Proc. the 19th International Conference on Field Programmable Logic and Applications, Aug.31-Sept.2, 2009, pp.32-37.
Zhang C, Li P, Sun G Y, Guan Y J, Xiao B J, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proc. the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2015, pp.161-170.
Chen T S, Du Z D, Sun N H et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th ACM Int. Conf. Languages and Operating Systems, Mar. 2014, pp.269-284.
Chen Y, Luo T, Liu S et al. DaDianNao: A machine-learning supercomputer. In Proc. the 47th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp.609-622.
Liu S L, Du Z D, Tao J H et al. Cambricon: An instruction set architecture for neural networks. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture (ISCA), Jun. 2016, pp.393-405.
Chakradhar S T, Sankaradass M, Jakkula V, Cadambi S. A dynamically configurable coprocessor for convolutional neural networks. In Proc. the 37th International Symposium on Computer Architecture, Jun. 2010, pp.247-257.
Chi P, Li S C, Xu C et al. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture (ISCA), Jun. 2016, pp.27-39.
Shafiee A, Nag A, Muralimanohar N et al. ISAAC: A convolutional neural network accelerator with In-Situ analog arithmetic in crossbars. In Proc. the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, Jun. 2016, pp.14-26.
Du Z D, Fasthuber R, Chen T S et al. ShiDianNao: Shifting vision processing closer to the sensor. In Proc. the 42nd Annual Int. Symp. Computer Architecture, Jun. 2015, pp.92-104.
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: Efficient primitives for deep learning. arXin: 1410.0759, 2014. http://arxiv.org/abs/1410.0759, Feb. 2017.
Abadi M, Barham P, Chen J et al. Tensorflow: A system for large-scale machine learning. In Proc. the 12th USENIX Symp. Operating Systems Design and Implementation, Nov. 2016, pp.265-283.
Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
Article Google Scholar
Szegedy C, Liu W, Jia Y Q et al. Going deeper with convolutions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, Jun. 2015.
Krizhevsky A. Cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks. https://code.google.com/p/cuda-convnet, Feb. 2017.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd Int. Conf. Machine Learning, Jul. 2015, pp.448-456.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, Jun. 2016, pp.770-778.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. http://arxin.org/abs/1409.1556, Feb. 2017.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Hui-Ying Lan, Lin-Yang Wu, Xiao Zhang, Jin-Hua Tao, Xun-Yu Chen, Bing-Rui Wang, Yu-Qing Wang, Qi Guo & Yun-Ji Chen
Microprocessor Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Hui-Ying Lan, Lin-Yang Wu, Xiao Zhang, Jin-Hua Tao, Xun-Yu Chen, Bing-Rui Wang, Yu-Qing Wang, Qi Guo & Yun-Ji Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Hui-Ying Lan & Lin-Yang Wu
Department of Computer Science, University of Science and Technology of China, Hefei, 230026, China
Bing-Rui Wang & Yu-Qing Wang

Authors

Hui-Ying Lan
View author publications
You can also search for this author in PubMed Google Scholar
Lin-Yang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Hua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Xun-Yu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bing-Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Ji Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui-Ying Lan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 152 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lan, HY., Wu, LY., Zhang, X. et al. DLPlib: A Library for Deep Learning Processor. J. Comput. Sci. Technol. 32, 286–296 (2017). https://doi.org/10.1007/s11390-017-1722-2

Download citation

Received: 02 November 2016
Revised: 13 February 2017
Published: 13 March 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s11390-017-1722-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DLPlib: A Library for Deep Learning Processor

Abstract

Access this article

Similar content being viewed by others

DLIR: An Intermediate Representation for Deep Learning Processors

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

Benchmarking the NVIDIA V100 GPU and Tensor Cores

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

DLIR: An Intermediate Representation for Deep Learning Processors

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

Benchmarking the NVIDIA V100 GPU and Tensor Cores

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation