Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

Wang, Qinglin; Li, Dongsheng; Huang, Xiandong; Shen, Siqi; Mei, Songzhu; Liu, Jie

doi:10.1007/978-3-030-57675-2_16

Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs

Qinglin Wang ORCID: orcid.org/0000-0002-8286-6566^10,11,
Dongsheng Li^10,11,
Xiandong Huang^10,11,
Siqi Shen^10,11,
Songzhu Mei^10,11 &
…
Jie Liu^10,11

Conference paper
First Online: 18 August 2020

1586 Accesses
15 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Abstract

Convolutional Neural Networks (CNNs) are widely applied in various machine learning applications and very time-consuming. Most of CNNs’ execution time is consumed by convolutional layers. A common approach to implementing convolutions is the FFT-based one, which can reduce the arithmetic complexity of convolutions without losing too much precision. As the performance of ARMv8 multi-core CPUs improves, they can also be utilized to perform CNNs like Intel X86 CPUs. In this paper, we present a new parallel FFT-based convolution implementation on ARMv8 multi-core CPUs. The implementation makes efficient use of ARMv8 multi-core CPUs through a series of computation and memory optimizations. The experiment results on two ARMv8 multi-core CPUs demonstrate that our new implementation gives much better performance than two existing approaches in most cases.

Granted by the National Key Research and Development Program of China (No. 2018YFB0204301), and the National Natural Science Foundation of China under grant nos. 61602500, 91530324 and 91430218.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient simd compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Experience 30(23), e4800 (2018)
Article Google Scholar
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MathSciNet Google Scholar
Dukhan, M.: NNPACK (2019). https://github.com/Maratyszcza/NNPACK. Accessed 3 Jan 2019
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)
Google Scholar
Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved gpu acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)
Article Google Scholar
Mathieu, M., Henaff, M., Lecun, Y.: Fast training of convolutional networks through FFTS. In: International Conference on Learning Representations (ICLR2014), CBLS, April 2014 (2014)
Google Scholar
Phytium: FT-1500A/16 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=9. Accessed 3 Jan 2020
Phytium: FT-2000plus/64 (2020). http://www.phytium.com.cn/Product/detail?language=1&product_id=7. Accessed 3 Jan 2020
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., LeCun, Y.: Fast convolutional nets with FBFFT: a GPU performance evaluation. In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015 (2015)
Google Scholar
Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107
Article Google Scholar
Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, July 2019. https://doi.org/10.1109/IJCNN.2019.8852012
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)
Google Scholar
Zlateski, A., Jia, Z., Li, K., Durand, F.: FFT convolutions are faster than winograd on modern CPUs, here is why. arXiv preprint arXiv:1809.07851 (2018)
Zlateski, A., Lee, K., Seung, H.S.: ZNN-a fast and scalable algorithm for training 3D convolutional networks on multi-core and many-core shared memory machines. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 801–811. IEEE (2016)
Google Scholar
Zlateski, A., Lee, K., Seung, H.S.: ZNN i: maximizing the inference throughput of 3d convolutional networks on CPUs and GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 73. IEEE Press (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073, China
Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei & Jie Liu
College of Computer, National University of Defense Technology, Changsha, 410073, China
Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei & Jie Liu

Authors

Qinglin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiandong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Siqi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Songzhu Mei
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J. (2020). Optimizing FFT-Based Convolution on ARMv8 Multi-core CPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_16
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics