Abstract
Many modern mobile processors support new SIMD extensions (e.g. NEON engine) and previous applications (e.g. image processing, cryptography) written in SISD are accelerated by re-writing the previous implementations in SIMD instruction sets. Particularly, integer multiplication and squaring operations are the most expensive in Public Key Cryptography (PKC). Many works have been conducted to reduce the execution timing in NEON instruction set. However, ARM–NEON processor also supports powerful ARM instruction set as well. By exploiting the ARM instruction together with NEON engine, we can achieve further improved performance. After this observation, we introduce new parallel approach for integer multiplication and squaring operations on ARM–NEON processors. Unlike previous implementations, we mix-use both ARM and NEON instructions to hide computation latency for ARM into NEON. Since ARM and NEON modules are separated units, the assignments are successfully issued independently. The integer multiplication and squaring are finely divided into several sub-tasks and the sub-tasks are properly assigned to ARM and NEON in order to balance the workloads. Finally, the proposed implementations outperform the best-known results on the identical ARM–NEON processors by 22.4% and 18.3% for 2048-bit integer multiplication and squaring, respectively.
This work was supported by the Energy Efficiency & Resources Core Technology Program of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea. (No. 20152000000170). Hwajeong Seo was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-2014-0-00743) supervised by the IITP (Institute for Information & communications Technology Promotion). Zhi Hu was partially supported by the Natural Science Foundation of China (Grant No. 61602526).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
When m-bit multiplication is required, one m-bit multiplication is evenly divided into 4 m/2-bit multiplication operations. Among them 3 m/2-bit multiplication is performed in COS method on NEON engine. On ARM processor, m/2-bit multiplication is performed in hybrid-scanning method (width: 64-bit).
- 2.
\(\mathtt{umlal\; a,b,c,d:}\; \{\mathtt{b,a}\} \leftarrow \{\mathtt{b,a}\}+ \mathtt{c} \times \mathtt{d}\).
- 3.
If we define multi-core processing through OpenMP library and execute multiple threads, the performance is enhanced by the number of threads.
References
Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33027-8_19
Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–489. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43414-7_24
Koziel, B., Jalali, A., Azarderakhsh, R., Jao, D., Mozaffari-Kermani, M.: NEON-SIDH: efficient implementation of supersingular isogeny Diffie-Hellman Key exchange protocol on ARM. In: Foresti, S., Persiano, G. (eds.) CANS 2016. LNCS, vol. 10052, pp. 88–103. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48965-0_6
Martins, P., Sousa, L.: On the evaluation of multi-core systems with SIMD engines for public-key cryptography. In: 2014 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW), pp. 48–53. IEEE (2014)
Martins, P., Sousa, L.: Stretching the limits of programmable embedded devices for public-key cryptography. In: Proceedings of the Second Workshop on Cryptography and Security in Computing Systems, pp. 19–24. ACM (2015)
Pabbuleti, K.C., Mane, D.H., Desai, A., Albert, C., Schaumont, P.: SIMD acceleration of modular arithmetic on contemporary embedded platforms. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2013)
Seo, H., Liu, Z., Choi, J., Kim, H.: Multi-precision squaring for public-key cryptography on embedded microprocessors. In: Paul, G., Vaudenay, S. (eds.) INDOCRYPT 2013. LNCS, vol. 8250, pp. 227–243. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03515-4_15
Seo, H., Liu, Z., Großschädl, J., Choi, J., Kim, H.: Montgomery modular multiplication on ARM-NEON revisited. In: Lee, J., Kim, J. (eds.) ICISC 2014. LNCS, vol. 8949, pp. 328–342. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15943-0_20
Seo, H., Liu, Z., Großschädl, J., Kim, H.: Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation. IACR Cryptology ePrint Archive 2015:465 (2015)
Seo, H., Liu, Z., Nogami, Y., Park, T., Choi, J., Zhou, L., Kim, H.: Faster ECC over \(\mathbb{F}_{2^{521}-1}\) (feat. NEON). In: Kwon, S., Yun, A. (eds.) ICISC 2015. LNCS, vol. 9558, pp. 169–181. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30840-1_11
Seo, H., et al.: Parallel implementations of LEA, revisited. In: Choi, D., Guilley, S. (eds.) WISA 2016. LNCS, vol. 10144, pp. 318–330. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56549-1_27
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Seo, H., Park, T., Ji, J., Hu, Z., Kim, H. (2018). ARM/NEON Co-design of Multiplication/Squaring. In: Kang, B., Kim, T. (eds) Information Security Applications. WISA 2017. Lecture Notes in Computer Science(), vol 10763. Springer, Cham. https://doi.org/10.1007/978-3-319-93563-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-93563-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93562-1
Online ISBN: 978-3-319-93563-8
eBook Packages: Computer ScienceComputer Science (R0)