Skip to main content

Fast Computation of the Exact Number of Magic Series with an Improved Montgomery Multiplication Algorithm

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2020)

Abstract

The numbers of magic series of large orders are computed on Intel Xeon Phi processors with an improved and optimized Montgomery multiplication algorithm. The number of magic series can be efficiently computed by Kinnaes’ formula, of which the most time-consuming element is modular multiplication. We use Montgomery multiplication for faster modular multiplication, and the number of operations is reduced through procedural simplifications. Modular addition, subtraction, and multiplication operations are vectorized by using the following instructions: Intel Advanced Vector Extensions (AVX), Intel Advanced Vector Extensions 2 (AVX2), and Intel Advanced Vector Extensions 512 (AVX-512). The number of magic series of order 8000 is computed on multiple nodes of an Intel Xeon Phi processor with a total execution time of 1806 days. Results are compared with salient studies in the literature to confirm the efficacy of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Biggs, N.L.: The roots of combinatorics. Historia Math. 6(2), 109–136 (1979). https://doi.org/10.1016/0315-0860(79)90074-0

    Article  MathSciNet  MATH  Google Scholar 

  2. Nordgren, R.P.: On properties of special magic square matrices. Linear Algebra Appl. 437(8), 2009–2025 (2012). https://doi.org/10.1016/j.laa.2012.05.031

    Article  MathSciNet  MATH  Google Scholar 

  3. Cammann, S.: The evolution of magic squares in China. J. Am. Orient. Soc. 80(2), 116–124 (1960). https://doi.org/10.2307/595587

    Article  MATH  Google Scholar 

  4. Xin, G.: Constructing all magic squares of order three. Discrete Math. 308(15), 3393–3398 (2008). https://doi.org/10.1016/j.disc.2007.06.022

    Article  MathSciNet  MATH  Google Scholar 

  5. Beeler, M.: Appendix 5: The Order 5 Magic Squares (1973). (Privately Published)

    Google Scholar 

  6. Pinn, K., Wieczerkowski, C.: Number of magic squares from parallel tempering Monte Carlo. Int. J. Mod. Phys. C 9(4), 541–546 (1998). https://doi.org/10.1142/S0129183198000443

    Article  Google Scholar 

  7. Trump, W.: Magic Series. http://www.trump.de/magic-squares/magic-series

  8. Beck, M., van Herick, A.: Enumeration of \(4 \times 4\) magic squares. Math. Comput. 80, 617–621 (2011). https://doi.org/10.1090/S0025-5718-10-02347-1

    Article  MathSciNet  MATH  Google Scholar 

  9. Ripatti, A.: On the number of semi-magic squares of order 6 (2018). arXiv: 1807.02983

  10. Kato, G., Minato, S.: Enumeration of associative magic squares of order 7 (2019). arXiv: 1906.07461

  11. Libis, C., Phillips, J.D., Spall, M.: How many magic squares are there? Math. Mag. 73(1), 57–58 (2000). https://doi.org/10.1080/0025570X.2000.11996804

    Article  MathSciNet  MATH  Google Scholar 

  12. Kraitchik, M.: Mathematical Recreations, 2nd revised edn. Dover Publications (2006)

    Google Scholar 

  13. Bottomley, H.: Partition and composition calculator. http://www.se16.info/js/partitions.htm

  14. Gerbicz, R.: Robert Gerbicz’s Home Page. https://sites.google.com/site/robertgerbicz

  15. Kinnaes, D.: Calculating exact values of \(N(x, m)\) without using recurrence relations (2013). http://www.trump.de/magic-squares/magic-series/kinnaes-algorithm.pdf

  16. Endo, K.: Private Communication (2019)

    Google Scholar 

  17. Quist, M.: Asymptotic enumeration of magic series (2013). arXiv: 1306.0616

  18. Kinnaes, D.: Private Communication (2019)

    Google Scholar 

  19. Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44, 519–521 (1985). https://doi.org/10.1090/S0025-5718-1985-0777282-X

    Article  MathSciNet  MATH  Google Scholar 

  20. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer’s Manual. https://software.intel.com/en-us/articles/intel-sdm

  21. Koç, Ç.K., Acar, T., Kaliski Jr., B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996). https://doi.org/10.1109/40.502403

    Article  Google Scholar 

  22. Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–489. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43414-7_24

    Chapter  Google Scholar 

  23. Takahashi, D.: Computation of the \(100\) quadrillionth hexadecimal digit of \(\pi \) on a cluster of Intel Xeon Phi processors. Parallel Comput. 75, 1–10 (2018). https://doi.org/10.1016/j.parco.2018.02.002

    Article  MathSciNet  Google Scholar 

  24. Dussé, S.R., Kaliski Jr., B.S.: A cryptographic library for the Motorola DSP56000. In: Damgård, I.B. (ed.) EUROCRYPT 1990. LNCS, vol. 473, pp. 230–244. Springer, Heidelberg (1991). https://doi.org/10.1007/3-540-46877-3_21

  25. Walter, C.D.: Montgomery’s multiplication technique: how to make it smaller and faster. In: Koç, Ç.K., Paar, C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 80–93. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48059-5_9

    Chapter  Google Scholar 

  26. OpenMP Architecture Review Boards: OpenMP. https://www.openmp.org

  27. Selberg, A.: An elementary proof of Dirichlet’s theorem about primes in an arithmetic progression. Ann. Math. 50(2), 297–304 (1949). https://doi.org/10.2307/1969454

    Article  MathSciNet  MATH  Google Scholar 

  28. Trump, W.: Private Communication (2019)

    Google Scholar 

  29. Revol, N., Rouillier, F.: Motivations for an arbitrary precision interval arithmetic and the MPFI library. Reliable Comput. 11(4), 275–290 (2005). https://doi.org/10.1007/s11155-005-6891-y

    Article  MathSciNet  MATH  Google Scholar 

  30. Adams, W.W., Goldstein, L.J.: Introduction to Number Theory. Prentice-Hall (1976)

    Google Scholar 

  31. Childs, L.N.: A Concrete Introduction to Higher Algebra, 3rd edn. Springer, New York (2009). https://doi.org/10.1007/978-0-387-74725-5

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yukimasa Sugizaki .

Editor information

Editors and Affiliations

Appendix Proofs of Algorithms

Appendix Proofs of Algorithms

Theorem 1

Let N be a prime number, r be a prime factor of \(N - 1\), and z be a positive integer such that \(0< z < N\). Then, \(\omega = z^{(N - 1)/r} \bmod N\) is a primitive r-th root of unity in \(\mathbb {Z}/{N}\mathbb {Z}\) if \(\omega \not \equiv 1 \pmod N\).

Proof

Let \(\mathop {\mathrm {ord}}(z)\) be an order of an integer z in \(\mathbb {Z}/{N}\mathbb {Z}\) for prime N. In other words, \(\mathop {\mathrm {ord}}(z)\) is the smallest positive integer which is greater than 0 such that \(z^{\mathop {\mathrm {ord}}(z)}\) is congruent to 1 modulo N  [30, 31]. We show that the order of \(\omega \) is r if \(\omega \not \equiv 1 \pmod N\).

It holds that   [30]. Since Lagrange’s theorem states that \(\mathop {\mathrm {ord}}(z)\) divides \(N - 1\), there exists an integer s which divides \(N - 1\) and satisfies \(\mathop {\mathrm {ord}}(z) = \frac{N - 1}{s}\). Then, it holds that

Because \(\gcd (r, s)\) is equal to 1 or r since r is prime, the order of \(\omega \) is 1 or r.

Now, if the order of \(\omega \) is 1, then \(\omega = \omega ^{\mathop {\mathrm {ord}}(\omega )} \equiv 1 \pmod N\). The contraposition of this statement shows that the order of \(\omega \) is r if \(\omega \not \equiv 1 \pmod N\).

Theorem 2

Replacing the first \(e - 1\) moduli \(\beta \) with \(2 \beta \) does not change the result of Algorithm 4.

Proof

Denote variables after substitution with superscript \(\prime \). We show that \(C_e^\prime = C_e\).

If \(i = 0\), then \(t_0^\prime = a_0 B = t_0\) and hence \(q_0^\prime = \mu t_0^\prime \bmod 2\beta = \mu t_0 \bmod 2\beta \). Therefore, there are two cases where \(q_0^\prime < \beta \) and \(q_0^\prime \ge \beta \). \(q_0^\prime < \beta \), and thus \(q_0^\prime = q_0\), is the same case as Algorithm 4, so \(C_1^\prime = C_1\). As for \(q_0^\prime \ge \beta \), and thus \(q_0^\prime = q_0 + \beta \),

$$\begin{aligned} C_1^\prime&= (t_0^\prime + q_0^\prime N)/\beta \\&= (t_0 + q_0 N + \beta N)/\beta \\&= C_1 + N. \end{aligned}$$

Assume that \(C_i^\prime \) is equal to \(C_i\) or \(C_i + N\) where \(1 \le i \le e - 2\).

If \(C_i^\prime = C_i\), then

$$\begin{aligned} t_i^\prime&= C_i^\prime + a_i B \\&= C_i + a_i B \\&= t_i \end{aligned}$$

and hence \(q_i^\prime = \mu t_i^\prime \bmod 2\beta = \mu t_i \bmod 2\beta \). Therefore,

$$\begin{aligned} C_{i + 1}^\prime&= (t_i^\prime + q_i^\prime N)/\beta \\&= {\left\{ \begin{array}{ll} (t_i + q_i N)/\beta = C_{i + 1} &{} \text {if }q_i^\prime < \beta \text { and thus }q_i^\prime = q_i \\ (t_i + q_i N + \beta N)/\beta = C_{i + 1} + N &{} \text {if }q_i^\prime \ge \beta \text { and thus }q_i^\prime = q_i + \beta \end{array}\right. } \end{aligned}$$

On the other hand, if \(C_i^\prime = C_i + N\), then

$$\begin{aligned} t_i^\prime&= C_i^\prime + a_i B \\&= C_i + N + a_i B \\&= t_i + N \end{aligned}$$

and hence

$$\begin{aligned} q_i^\prime&= \mu t_i^\prime \bmod 2\beta \\&= \mu (t_i + N) \bmod 2\beta \\&= \{ (\mu t_i \bmod 2\beta ) + (\mu N \bmod 2\beta ) \} \bmod 2\beta . \end{aligned}$$

Here, \(\mu t_i\) is equal to \(q_i\) or \(q_i + \beta \) modulo \(2\beta \), and \(\mu N\) is equal to \(-1\) or \((-1 + \beta )\) modulo \(2\beta \) since \(\mu = -N^{-1} \bmod \beta \). Therefore, \(q_i^\prime \) becomes

$$\begin{aligned} q_i^\prime = {\left\{ \begin{array}{ll} q_i - 1 &{} \text {if }\mu t_i = q_i \bmod 2 \beta , \; \mu N = -1 \bmod 2 \beta \\ q_i - 1 + \beta &{} \text {if }\mu t_i = q_i \bmod 2 \beta , \; \mu N = (-1 + \beta ) \bmod 2 \beta \\ q_i + \beta - 1 = q_i - 1 + \beta &{} \text {if }\mu t_i = (q_i + \beta ) \bmod 2 \beta , \; \mu N = -1 \bmod 2 \beta \\ q_i + \beta - 1 + \beta = q_i - 1 &{} \text {if }\mu t_i = (q_i + \beta ) \bmod 2\beta , \; \mu N = (-1 + \beta ) \bmod 2\beta \\ \end{array}\right. } \end{aligned}$$

modulo \(2 \beta \). Therefore,

$$\begin{aligned} C_{i + 1}^\prime&= (t_i^\prime + q_i^\prime N)/\beta \\&= {\left\{ \begin{array}{ll} (t_i + N + q_i N - N)/\beta = C_{i + 1} &{} \text {if }q_i^\prime = q_i - 1 \\ (t_i + N + q_i N - N + \beta N)/\beta = C_{i + 1} + N &{} \text {if }q_i^\prime = q_i - 1 + \beta \\ \end{array}\right. } \end{aligned}$$

By mathematical induction, it holds that \(C_{i + 1}^\prime \) is equal to \(C_{i + 1}\) or \(C_{i + 1} + N\) for \(i = 1, 2, \dots , e - 2\).

If \(i = e - 1\), then \(C_{e - 1}^\prime \) is equal to \(C_{e - 1}\) or \(C_{e - 1} + N\). As for \(C_{e - 1}^\prime = C_{e - 1}\), this is the same case as Algorithm 4, so \(C_e^\prime = C_e\). As for \(C_{e - 1}^\prime = C_{e - 1} + N\),

$$\begin{aligned} t_{e - 1}^\prime&= C_{e - 1}^\prime + a_{e - 1} B \\&= C_{e - 1} + N + a_{e - 1} B \\&= t_{e - 1} + N \end{aligned}$$

and hence

$$\begin{aligned} q_{e - 1}^\prime&= \mu t_{e - 1}^\prime \bmod \beta \\&= (\mu t_{e - 1} + \mu N) \bmod \beta \\&= (q_{e - 1} - 1) \bmod \beta . \end{aligned}$$

Therefore,

$$\begin{aligned} C_e^\prime&= (t_{e - 1}^\prime + q_{e - 1}^\prime N)/\beta \\&= (t_{e - 1} + N + q_{e - 1} N - N)/\beta \\&= (t_{e - 1} + q_{e - 1} N)/\beta \\&= C_e, \end{aligned}$$

assuming that \(q_{e - 1}^\prime N = q_{e - 1} N - N\) does not overflow.

From \(0 \le C_i < 2 N\), it follows that \(0 \le C_i^\prime < 3 N\). Therefore, to avoid overflow in 32-bit registers of processors, it is required that \(3 N < 2^{64}\). From the prerequisite of Algorithm 4, \(N < \beta ^e\) and hence \(N< 2^{62} = 2^{64}/4 < 2^{64}/3\) when \(e = 2\) and \(\beta = 2^{31}\). Thus, the condition is inherently satisfied.

Theorem 3

Replacing \(s_2 \leftarrow a_1 b_0 + t \bmod \beta \) by \(s_2 \leftarrow a_1 b_0 + t\), and \(s_3 \leftarrow a_1 b_1 + {\lfloor {}{t/\beta }\rfloor {}}\) by \(s_3 \leftarrow a_1 b_1\) does not change the result of Algorithm 5 when \(N < 2^{61}\).

Proof

Denote variables after substitution with superscript \(\prime \). We show that \(u^\prime = u\).

It holds that

$$\begin{aligned} s_2^\prime&= a_1 b_0 + t \\&= (a_1 b_0 + t \bmod \beta ) - t \bmod \beta + t \\&= s_2 + t - t \bmod \beta . \end{aligned}$$

Therefore,

$$\begin{aligned} r^\prime&= s_2^\prime \mu \bmod \beta \\&= (s_2 + t - t \bmod \beta ) \mu \bmod \beta \\&= \{ s_2 \mu \bmod \beta + (t - t \bmod \beta ) \mu \bmod \beta \} \bmod \beta . \end{aligned}$$

Since \(t - t \bmod \beta \) eliminates the remainder of t divided by \(\beta \), it holds that \((t - t \bmod \beta ) \mu \bmod \beta = 0\), and hence

$$\begin{aligned} r^\prime&= (s_2 \mu \bmod \beta + 0) \bmod \beta \\&= s_2 \mu \bmod \beta \\&= r. \end{aligned}$$

Furthermore, since

$$\begin{aligned} s_3^\prime&= a_1 b_1 \\&= (a_1 b_1 + {\lfloor {}{t/\beta }\rfloor {}}) - {\lfloor {}{t/\beta }\rfloor {}} \\&= s_3 - {\lfloor {}{t/\beta }\rfloor {}}, \end{aligned}$$

it holds that

$$\begin{aligned} u^\prime&= (r^\prime N_0 + s_2^\prime )/\beta + r^\prime N_1 + s_3^\prime \\&= (r N_0 + s_2 + t - t \bmod \beta )/\beta + r N_1 + s_3 - {\lfloor {}{t/\beta }\rfloor {}} \\&= (r N_0 + s_2 + t - t \bmod \beta )/\beta + r N_1 + s_3 - (t - t \bmod \beta )/\beta \\&= (r N_0 + s_2)/\beta + r N_1 + s_3 \\&= u. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sugizaki, Y., Takahashi, D. (2020). Fast Computation of the Exact Number of Magic Series with an Improved Montgomery Multiplication Algorithm. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12453. Springer, Cham. https://doi.org/10.1007/978-3-030-60239-0_25

Download citation

Publish with us

Policies and ethics