Skip to main content

Advertisement

Log in

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Convolutional neural networks (CNNs) have been widely applied for image recognition, face detection, and video analysis because of their ability to achieve accuracy close to or even better than human level perception. However, different features of convolution layers and fully connected layers have brought many challenges to the implementation of CNN on FPGA platforms, because different accelerator units must be designed to process the whole networks. In order to overcome this problem, this work proposes a pipelined accelerator towards uniformed computing for convolutional neural networks. For the convolution layer, the accelerator first repositions the input features into matrix on-the-fly when they are stored to FPGA on-chip buffers, thus the computation of convolution layer can be completed through matrix multiplication. For the fully connected layer, the batch-based method is used to reduce the required memory bandwidth, which also can be completed through matrix multiplication. Then a pipelined computation method for matrix multiplication is proposed to increase the throughput and also reduce the buffer requirement. The experiment results show that the proposed accelerator surpasses CPUs and GPUs platform in terms of energy efficiency. The proposed accelerator can achieve the throughput of 49.31 GFLOPS, which is done using only 198 DSP modules. Compared to the state-of-the-art implementatuion, our accelerator has better hardware utilization efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Caffe, J.Y.: An open source convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia. ACM (2014)

  2. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning. ACM, pp. 160–167 (2008)

  3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  4. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp. 396–404 (1990)

  5. Li, H., Fan, X., Jiao, L., Cao, W., Zhou, X., Wang, L.: A high performance fpga-based accelerator for large-scale convolutional neural networks. In: 2016 26th International Conference on Field Programmable Logic and Applications (FPL). IEEE, pp. 1–9 (2016)

  6. Moini, S., Alizadeh, B., Emad, M., Ebrahimpour, R.: A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications. IEEE Trans. Circuits Syst. 64(10), 1217–1221 (2017)

  7. Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P,, Gray, J., et al.: A reconfigurable fabric for accelerating large-scale datacenter services. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, pp. 13–24 (2014)

  8. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., et al.: Going deeper with embedded fpga platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 26–35 (2016)

  9. Rahman, A., Lee, J., Choi, K.: Efficient fpga acceleration of convolutional neural networks using logical-3D compute array. In: Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp. 1393–1398 (2016)

  10. Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., Seo, J.s., Cao, Y.: Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 16–25 (2016)

  11. Sun, F., Wang, C., Gong, L., Xu, C., Zhang, Y., Lu, Y., Li, X., Zhou, X.: A power-efficient accelerator for convolutional neural networks. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE (2017, in press)

  12. Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in Neural Information Processing Systems, pp. 1988–1996 (2014)

  13. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  14. Wang, C., Li, X., Chen, P., Zhang, J., Feng, X., Zhou, X.: Regarding processors and reconfigurable ip cores as services. In: 2012 IEEE Ninth International Conference on Services Computing (SCC). IEEE, pp. 668–669 (2012)

  15. Wang, C., Li, X., Zhou, X.: Soda: software defined fpga based accelerators for big data. In: Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp. 884–887 (2015)

  16. Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: Dlau: a scalable deep learning accelerator unit on fpga. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 36(3), 513–517 (2017)

    Google Scholar 

  17. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing fpga-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, pp. 161–170 (2015)

  18. Zhao, Y., Yu, Q., Zhou, X., Zhou, X., Li, X., Wang, C.: Pie: A pipeline energy-efficient accelerator for inference process in deep neural networks. In: 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp. 1067–1074 (2016)

Download references

Acknowledgements

This work was supported by the NSFC (No. 61379040), Anhui Provincial NSF (No. 1608085QF12), Suzhou Research Foundation (No. SYG201625), CCF-Venustech Hongyan Research Initiative (No. CCF-VenustechRP1026002), Youth Innovation Promotion Association CAS (No. 2017497), and Fundamental Research Funds for the Central Universities (WK2150110003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, F., Wang, C., Gong, L. et al. UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs. Int J Parallel Prog 46, 776–787 (2018). https://doi.org/10.1007/s10766-017-0522-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0522-1

Keywords

Navigation