From: Peter-Lawrence.Montgomery Date: Mon, 8 Jul 1996 12:06:05 +0200 Subject: Mersenne: Using DFT's modulo a prime (long) Silverman wrote: > Don't use the weighted Mersenne convolution. Use ordinary > convolutions mod p_i. i.e. to compute x^2 mod 2^p-1 > reduce x mod p_i for i=1,2,3.... and p_i appropriately > chosen so that product(p_i) > 2^p-1, and for hich arithmetic > mod p_i is easy. ^^^^ which > > Then simply do an ordinary FFT in single precision integer arithmetic > and paste the results back together with the CRT, We don't need product(p_i) > 2^p - 1. We need only a few primes. And we can use weighted convolutions. I glanced at the Crandall-Fagin article on the weighted DFT (discrete Fourier transform). Suppose we want x^2 where we are given x with 0 <= x < 2^p - 1, using a cyclic convolution (i.e., DFT) of length 2^ell. For purpose of illustration, suppose p = 197 and ell = 3. Write x = x0 + x1*2^25 + x2*2^50 + x3*2^74 + x4*2^99 + x5*2^124 + x6*2^148 + x7*2^173 where 0 <= x0, x1, x3, x4, x6 < 2^25 0 <= x2, x5, x7 < 2^24. The exponents 0, 25, 50, 74, 99, 124, 148, 173 are ceiling(197*j/8) for j = 0, 1, 2, 3, 4, 5, 6, 7. [Note. We can -- and probably should -- use signed digits for x0, ..., x7. For example, the range of x0 can be |x0| < = 2^24 rather than 0 <= x0 < 2^25.] Introduce R = 2^(197/8) = 2^24.625. Then x = x0 + x1*2^(3/8)*R + x2*2^(3/4)*R^2 + x3*2^(1/8)*R^3 + x4*2^(1/2)*R^4 + x5*2^(7/8)*R^5 + x6*2^(1/4)*R^6 + x7*2^(5/8)*R^7 Square this polynomial and reduce modulo R^8 - 1 = 2^p - 1: x*x == y0 + y1*2^(3/8)*R + y2*2^(3/4)*R^2 + y3*2^(1/8)*R^3 + y4*2^(1/2)*R^4 + y5*2^(7/8)*R^5 + y6*2^(1/4)*R^6 + y7*2^(5/8)*R^7 (mod R^8 - 1) where y0 = x0*x0 + 4*x1*x7 + 4*x2*x6 + 4*x3*x5 + 2*x4*x4 y1 = 2*x0*x1 + 4*x2*x7 + 2*x3*x6 + 4*x4*x5 y2 = x1*x1 + 2*x0*x2 + 2*x3*x7 + 2*x4*x6 + 2*x5*x5 y3 = 2*x0*x3 + 4*x1*x2 + 4*x4*x7 + 4*x5*x6 y4 = 2*x2*x2 + 2*x1*x3 + 2*x0*x4 + 4*x5*x7 + y6*y6 y5 = 2*x0*x5 + 2*x1*x4 + 2*x2*x3 + 2*x6*x7 y6 = x3*x3 + 4*x2*x4 + 4*x1*x5 + 2*x0*x6 + 2*x7*x7 y7 = 2*x0*x7 + 2*x1*x6 + 4*x2*x5 + 2*x3*x4 Using the bounds 0 <= xi < 2^25, we see that 0 <= yi < 16*2^50 for all i (a bound which can probably be improved). More generally, the yi are integers in the interval [0, 2 * 2^ell * 4^ceiling(p/2^ell)]. [If we use signed digits, then |yi| <= 16 * 2^48.] The {yi*(powers of 2^(1/8))} are the Discrete Fourier Transform (DFT) of {xi*(powers of 2^(1/8))} with itself. One can compute this DFT using an FFT over the complex numbers, a pointwise squaring, and an inverse FFT. Another DFT algorithm uses arithmetic modulo distinct primes q1, q2, ..., qk. We require A) q1 * q2 * ... * qk > (bound on yi) (2*(bound on yi) if using signed digits) B) 2^(1/2^ell) must exist modulo each qi C) If we use the standard FFT algorithm (with modular arithmetic in place of complex arithmetic), then we need a (2^ell)-th primitive root of unity. This means qi == 1 (mod 2^ell). For ell = 3 (and 2^ell = 8) we might use q1 = 73 (or 89, 233, 257, ...). We can check that 5^8 == 2 (mod 73) and 10^4 == -1 (mod 73), so B) and C) hold. To compute all yi mod 73 (given the xi), the steps are: 1) Reduce all xi mod 73. Multiply each remainder by the proper power of 2^(1/8) == 5 (mod 73). [These powers can be computed once and stored in a table.] 2) Take the DFT of the sequence in 1) with itself. 3) Multiply the outputs in 2) by powers of 2^(-1/8) mod 73, to get {yi mod 73}. - -------------- On a machine with 64-bit integers, such as the DEC Alpha, we might select one q near 2^63. Using ell = 16 (i.e., DFT of order 2^16) we can let the digits xi of input number x be as large as 2^23. The bounds on the output digits yi are 2 * 2^16 * (2^23 - 1)^2 = 2^63 - 2^39 + 2^16 < q With 2^16 digits in radix 2^23, one can represent numbers as large as 2^(23*65536) = 2^1507328. If we use two primes q1 and q2, each around 2^63, then we can use a convolution length 2^15 with radix 2^54. The output yi are bounded by 2 * 2^15 * (2^54)^2 = 2^124 < q1 * q2. This will work for p < 54 * 32768 = 1769472. The convolutions modulo q1 and modulo q2 can be computed in parallel if the machine has multiple processors. Otherwise computations can be interleaved to utilize the machine's functional units. The Chinese Remainder Theorem (CRT) enables us to recover outputs modulo q1*q2 given the remainders modulo q1 and modulo q2. Multiplications per q Construct xi 2^ell modular multiplications by powers of 2^(2^(-ell)) Forward and inverse convolutions 2^(ell-1)*(ell - 2) + 1 modular multiplications by roots of unity Pointwise product 2^ell modular squarings Divide by power of 2, multiply by constant for CRT 2^ell modular multiplications CRT 2^ell double-length products ------------------------------ 2^ell * ell + 2 modular multiplications by elements taken from tables. 2^ell modular squarings 2^ell double-length products The Alpha lacks an integer divide instruction, but one can compute a modular product with a few simple operations. It has MULQ (multiply low) and UMULH (unsigned multiply high) instructions If 0 <= a, b < q, and q * qinv == 1 (mod 2^64), then a * b = UMULH(a, b) * 2^64 + MULQ(a, b) Let quot = MULQ(qinv, MULQ(a, b)) Then q*quot == q * (qinv * (a * b)) == a * b (mod 2^64) So q * quot = UMULH(q, quot) * 2^64 + MULQ(q, quot) = UMULH(q, quot) * 2^64 + MULQ(a, b) a * b - q * quot = 2^64 * (UMULH(a, b) - UMULH(q, quot)) a * b/2^64 == UMULH(a, b) - UMULH(q, a*b*qinv mod 2^64) When b comes from a table, we can also look up b*qinv (mod 2^64). Then one MULQ, two UMULH, and a few simple instructions (such as checking whether which UMULH is larger) will give us a * b / 2^64 mod q, which is as useful as a * b mod q. The modular multiplication takes three half-multiplies when one operand is in a table, and four half-multiplies in the general case. With ell = 15 and and 2 q's, we estimate 2 * (3 * 15 + 4 + 2) * 2^15 = 102 * 32768 ~= 3.34 million MULQ or UMULH instructions. With complex arithmetic, we might try order 2^17 with radix to achieve p ~= 1.7 million. The output coefficients are bounded by 2 * 2^17 * (2^13)^2 = 2^44, which is rather close to the 53 bits of precision available on many machines. The costs are: Multiplications Compute xi, by multiplying by powers of 2^(2^(-ell)) 2^ell real Forward transform 2^(ell-1)*(ell-3) complex Pointwise squaring 2^ell complex Inverse transform 2^(ell-1)*(ell-3) complex Divide by power of 2 2^ell real ---------------- 2^ell*(ell - 2) complex 2 * 2^ell real Counting a complex multiplication as four real multiplications, this is 2^ell * (4*ell - 8) real multiplications, which evaluates to 7.8 million real multiplications when ell = 17. [This estimate is unduly pessimistic -- it can surely be improved by knowing that the final output is all-real.] We can mix the FFT over the complex numbers and the FFT modulo a prime. For example, if we estimate that a complex FFT will provide 40 bits of accuracy, and a convolution modulo q ~= 2^63 will give 63 bits of accuracy, then we can allow each yi to have 40 + 63 = 103 bits. We might use two FFTs (one complex, one mod q) each of order 2^15, restricting xi to 2^42, which allows p <= 42 * 32768 ~= 1.37 million. Using a shared loop structure for the integer and complex FFTs will let one intermix integer and floating point computations for good machine performance. - ------------ On a 32-bit machine, only a few q are available if ell is large. We can use q = 2043904001 if ell <= 15, and q = 3539599361 if ell <= 16. Another possibility is to choose q == 7 (mod 8); then 2 is a (2^ell)-th power mod q for every ell >= 1. The DFT modulo q can be done by Nussbaumer's algorithm (see discussion in early May archives). Using four primes near 2^31 and DFTs of order 2^15, we can use radix 2^54, and p <= 54 * 32768 ~= 1.7 million.