research-article

Efficient Large Integer Multiplication with Arm SVE Instructions

Authors:

Takuya Edamatsu,

Daisuke TakahashiAuthors Info & Claims

HPCAsia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 9 - 17

https://doi.org/10.1145/3578178.3578193

Published: 27 February 2023 Publication History

Abstract

In this study, we implement large integer multiplication with the Arm Scalable Vector Extension (SVE) instructions. SVE is a single instruction, multiple data (SIMD) instruction set for the Arm AArch64 architecture. We use a reduced-radix representation technique because SIMD instructions do not retain the carry that occurs when partial products are added in large integer multiplication computations. Furthermore, we develop and implement a multiplication algorithm based on the Basecase method, which allows the application of ordinary multiplication instructions to special integers in reduced-radix representation. To evaluate performance, we compare our multiplication implementation on an A64FX processor with the GNU Multiple Precision Arithmetic Library (GMP). We show that processing with SVE was faster than GMP for multiplication with operands larger than 2,048 bits. The performance gain was up to 36%. These results suggest that SVE instructions have the potential to be faster than scalar instructions for large integer multiplication, especially for large operands.

References

[1]

Arm. 2020. ARM C Language Extensions for SVE. https://developer.arm.com/documentation/100987/latest

[2]

Arm. 2021. Introduction to SVE2. https://developer.arm.com/documentation/102340/0001/Introducing-SVE2

[3]

Arm. 2022. Arm Architecture Reference Manual. https://developer.arm.com/documentation/ddi0487/ha/

[4]

L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Software 28, 2 (2002), 135–151.

Digital Library

[5]

Bérenger Bramas. 2021. A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE). PeerJ Computer Science 7(2021), e769.

[6]

Richard Brent and Paul Zimmermann. 2010. Modern Computer Arithmetic. Cambridge University Press.

[7]

Benjamin Buhrow, Barry Gilbert, and Clifton Haider. 2022. Parallel modular multiplication using 512-bit advanced vector instructions. Journal of Cryptographic Engineering 12, 1 (2022), 95–105.

[8]

Marco Cococcioni, Federico Rossi, Emanuele Ruffaldi, and Sergio Saponara. 2020. Fast deep neural networks for image processing using posits and ARM scalable vector extension. Journal of Real-Time Image Processing 17, 3 (2020), 759–771.

Digital Library

[9]

Takuya Edamatsu and Daisuke Takahashi. 2019. Accelerating Large Integer Multiplication Using Intel AVX-512IFMA. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 60–74.

[10]

Fujitsu. 2020. A64FX Microarchitecture Manual. https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.3.pdf

[11]

Torbjörn Granlund. 1996. GNU MP. The GNU Multiple Precision Arithmetic Library 2, 2 (1996).

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[13]

Intel. 2000. Using Streaming SIMD Extensions (SSE2) to Perform Big Multiplications, version 2.0. Technical Report AP-941248606-001 (2000).

[14]

Intel. 2021. Cryptography Processing with 3rd Gen Intel Xeon Scalable Processors. https://www.intel.com/content/dam/www/central-libraries/us/en/documents/cryptography-processing-with-3rd-gen-intel-xeon-scalable-processors-19-may-2021.pdf

[15]

Intel. 2021. Deep Learning with Intel AVX-512 and Intel Deep Learning Boost Tuning Guide on 3rd Generation Intel Xeon Scalable Processors. https://www.intel.com/content/dam/develop/external/us/en/documents/Deep-Learning-with-Intel-AVX512-and-Intel-Deep-Learning-Boost-Tuning-Guide-on-3rd-Generation-Intel-Xeon-Scalable-Processors.pdf

[16]

Intel. 2022. Intel 64 and IA-32 Architectures Software Developer’s Manual. https://cdrdv2.intel.com/v1/dl/getContent/671200

[17]

Intel. 2022. Intel Intrinsics Guide. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

[18]

A. Karatsuba and Y. Ofman. 1963. Multiplication of Multidigit Numbers on Automata. Soviet Physics Doklady 7(1963), 595.

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).

[20]

Jinpil Lee, Francesco Petrogalli, Graham Hunter, and Mitsuhisa Sato. 2017. Extending OpenMP SIMD support for target specific code and application to ARM SVE. In International Workshop on OpenMP. Springer, 62–74.

[21]

Victor S Miller. 1985. Use of elliptic curves in cryptography. In Conference on the theory and application of cryptographic techniques. Springer, 417–426.

[22]

Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, 2017. The ARM scalable vector extension. IEEE MICRO 37, 2 (2017), 26–39.

Digital Library

[23]

Takuya Edamatsu and Daisuke Takahashi. 2018. Acceleration of Large Integer Multiplication with Intel AVX-512 Instructions. In 2018 IEEE 20th International Conference on High Performance Computing and Communications (HPCC). IEEE, 211–218.

[24]

Andrei L Toom. 1963. The Complexity of a Scheme of Functional Elements Realizing the Multiplication of Integers. Soviet Mathematics Doklady 3 (1963), 714–716.

[25]

Xiuwen Wan, Naijie Gu, and Junjie Su. 2021. Accelerating Level 2 BLAS Based on ARM SVE. In 2021 4th international conference on advanced electronic materials, computers and software engineering (AEMCSE). IEEE, 1018–1022.

[26]

Stephen Wolfram. 1991. Mathematica: a system for doing mathematics by computer. Addison Wesley Longman Publishing Co., Inc.

[27]

Toshio Yoshida. 2018. Fujitsu high performance CPU for the Post-K Computer. In Hot Chips, Vol. 30. 22.

Cited By

Satya Murthy NCatthoor FVerhelst M(2024)Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processorsJournal of Systems Architecture10.1016/j.sysarc.2024.103236154(103236)Online publication date: Sep-2024
https://doi.org/10.1016/j.sysarc.2024.103236
Ren PSuda RSuppakitpaisarn V(2023)Efficient Additions and Montgomery Reductions of Large Integers for SIMD2023 IEEE 30th Symposium on Computer Arithmetic (ARITH)10.1109/ARITH58626.2023.00034(48-59)Online publication date: 4-Sep-2023
https://doi.org/10.1109/ARITH58626.2023.00034

Index Terms

Efficient Large Integer Multiplication with Arm SVE Instructions
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Vector / streaming algorithms

Recommendations

Optimization of NumPy Transcendental Functions for Arm SVE
HPCAsia '23 Workshops: Proceedings of the HPC Asia 2023 Workshops

The high-performance computing instruction set for the Arm processors, Scalable Vector Extension (SVE), is adopted in "Fujitsu A64FX" of the Fugaku supercomputer, and in "Graviton3" of AWS. NumPy, the primary Python numerical computing package, does not ...
Low overhead dynamic binary translation on ARM
PLDI '17

The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Augmenting a processor with special hardware that is able to apply a Single Instruction to Multiple Data(SIMD) at the same time is a cost effective way of improving processor performance. It also offers a means of improving the ratio of processor ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

February 2023

161 pages

ISBN:9781450398053

DOI:10.1145/3578178

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

JSPS KAKENHI

Conference

HPC ASIA 2023

HPC ASIA 2023: International Conference on High Performance Computing in Asia-Pacific Region

February 27 - March 2, 2023

Singapore, Singapore

Acceptance Rates

HPCAsia '23 Paper Acceptance Rate 15 of 34 submissions, 44%;

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
258
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Satya Murthy NCatthoor FVerhelst M(2024)Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processorsJournal of Systems Architecture10.1016/j.sysarc.2024.103236154(103236)Online publication date: Sep-2024
https://doi.org/10.1016/j.sysarc.2024.103236
Ren PSuda RSuppakitpaisarn V(2023)Efficient Additions and Montgomery Reductions of Large Integers for SIMD2023 IEEE 30th Symposium on Computer Arithmetic (ARITH)10.1109/ARITH58626.2023.00034(48-59)Online publication date: 4-Sep-2023
https://doi.org/10.1109/ARITH58626.2023.00034

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten