Skip to main content

Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Abstract

Convolutional Neural Networks (CNNs) are widely applied in various machine learning applications and very time-consuming. Most of CNNs’ execution time is consumed by convolutional layers. A common approach to implementing convolutions is the FFT-based one, which can reduce the arithmetic complexity of convolutions without losing too much precision. As the performance of ARMv8 multi-core CPUs improves, they can also be utilized to perform CNNs like Intel X86 CPUs. In this paper, we present a new parallel FFT-based convolution implementation on ARMv8 multi-core CPUs. The implementation makes efficient use of ARMv8 multi-core CPUs through a series of computation and memory optimizations. The experiment results on two ARMv8 multi-core CPUs demonstrate that our new implementation gives much better performance than two existing approaches in most cases.

Granted by the National Key Research and Development Program of China (No. 2018YFB0204301), and the National Natural Science Foundation of China under grant nos. 61602500, 91530324 and 91430218.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient simd compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Experience 30(23), e4800 (2018)

    Article  Google Scholar 

  2. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)

    Article  MathSciNet  Google Scholar 

  3. Dukhan, M.: NNPACK (2019). https://github.com/Maratyszcza/NNPACK. Accessed 3 Jan 2019

  4. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  5. Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)

    Article  MathSciNet  Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  7. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)

    Google Scholar 

  8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  9. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)

    Google Scholar 

  10. Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved gpu acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)

    Article  Google Scholar 

  11. Mathieu, M., Henaff, M., Lecun, Y.: Fast training of convolutional networks through FFTS. In: International Conference on Learning Representations (ICLR2014), CBLS, April 2014 (2014)

    Google Scholar 

  12. Phytium: FT-1500A/16 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=9. Accessed 3 Jan 2020

  13. Phytium: FT-2000plus/64 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=7. Accessed 3 Jan 2020

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  15. Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., LeCun, Y.: Fast convolutional nets with FBFFT: a GPU performance evaluation. In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015 (2015)

    Google Scholar 

  16. Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107

    Article  Google Scholar 

  17. Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, July 2019. https://doi.org/10.1109/IJCNN.2019.8852012

  18. Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)

    Google Scholar 

  19. Zlateski, A., Jia, Z., Li, K., Durand, F.: FFT convolutions are faster than winograd on modern CPUs, here is why. arXiv preprint arXiv:1809.07851 (2018)

  20. Zlateski, A., Lee, K., Seung, H.S.: ZNN-a fast and scalable algorithm for training 3D convolutional networks on multi-core and many-core shared memory machines. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 801–811. IEEE (2016)

    Google Scholar 

  21. Zlateski, A., Lee, K., Seung, H.S.: ZNN i: maximizing the inference throughput of 3d convolutional networks on CPUs and GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 73. IEEE Press (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J. (2020). Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57675-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57674-5

  • Online ISBN: 978-3-030-57675-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics