From: Peter-Lawrence.Montgomery
Date: Mon, 8 Jul 1996 12:06:05 +0200
Subject: Mersenne: Using DFT's modulo a prime (long)

Silverman wrote:

> Don't use the weighted Mersenne convolution. Use ordinary
> convolutions mod p_i.   i.e. to compute x^2 mod 2^p-1
> reduce x mod p_i  for i=1,2,3....   and p_i appropriately
> chosen so that product(p_i) > 2^p-1, and for hich arithmetic
> mod p_i is easy.                             ^^^^  which
>
> Then simply do an ordinary FFT in single precision integer arithmetic
> and paste the results back together with the CRT,

        We don't need product(p_i) > 2^p - 1.
We need only a few primes.  And we can use weighted convolutions.

        I glanced at the Crandall-Fagin article on the weighted DFT
(discrete Fourier transform).  Suppose we want x^2 where we are given x
with 0 <= x < 2^p - 1, using a cyclic convolution (i.e., DFT) of length 2^ell.

        For purpose of illustration, suppose p = 197 and ell = 3.  Write

                x =   x0 + x1*2^25 + x2*2^50 + x3*2^74
                    + x4*2^99 + x5*2^124 + x6*2^148 + x7*2^173
where
                0 <= x0, x1, x3, x4, x6 < 2^25
                0 <= x2, x5, x7 < 2^24.

The exponents 0, 25, 50, 74, 99, 124, 148, 173
are ceiling(197*j/8) for j = 0, 1, 2, 3, 4, 5, 6, 7.
[Note.  We can -- and probably should -- use signed digits
for x0, ..., x7.  For example, the range of x0
can be |x0| < = 2^24 rather than 0 <= x0 < 2^25.]

        Introduce R = 2^(197/8) = 2^24.625.  Then

                x =   x0             + x1*2^(3/8)*R
                    + x2*2^(3/4)*R^2 + x3*2^(1/8)*R^3
                    + x4*2^(1/2)*R^4 + x5*2^(7/8)*R^5
                    + x6*2^(1/4)*R^6 + x7*2^(5/8)*R^7

Square this polynomial and reduce modulo R^8 - 1 = 2^p - 1:

                x*x  ==   y0             + y1*2^(3/8)*R
                        + y2*2^(3/4)*R^2 + y3*2^(1/8)*R^3
                        + y4*2^(1/2)*R^4 + y5*2^(7/8)*R^5
                        + y6*2^(1/4)*R^6 + y7*2^(5/8)*R^7   (mod R^8 - 1)
where
                y0 = x0*x0 + 4*x1*x7 + 4*x2*x6 + 4*x3*x5 + 2*x4*x4
                y1 = 2*x0*x1 + 4*x2*x7 + 2*x3*x6 + 4*x4*x5
                y2 = x1*x1 + 2*x0*x2 + 2*x3*x7 + 2*x4*x6 + 2*x5*x5
                y3 = 2*x0*x3 + 4*x1*x2 + 4*x4*x7 + 4*x5*x6
                y4 = 2*x2*x2 + 2*x1*x3 + 2*x0*x4 + 4*x5*x7 + y6*y6
                y5 = 2*x0*x5 + 2*x1*x4 + 2*x2*x3 + 2*x6*x7
                y6 = x3*x3 + 4*x2*x4 + 4*x1*x5 + 2*x0*x6 + 2*x7*x7
                y7 = 2*x0*x7 + 2*x1*x6 + 4*x2*x5 + 2*x3*x4

Using the bounds 0 <= xi < 2^25, we see that 0 <= yi < 16*2^50
for all i (a bound which can probably be improved).
More generally, the yi are integers in the interval
[0, 2 * 2^ell * 4^ceiling(p/2^ell)].
[If we use signed digits, then |yi| <= 16 * 2^48.]

        The {yi*(powers of 2^(1/8))} are the Discrete Fourier Transform
(DFT) of {xi*(powers of 2^(1/8))} with itself.  One can compute this DFT
using an FFT over the complex numbers, a pointwise squaring,
and an inverse FFT.  Another DFT algorithm uses
arithmetic modulo distinct primes q1, q2, ..., qk.  We require

        A)  q1 * q2 * ... * qk > (bound on yi)
                                 (2*(bound on yi) if using signed digits)

        B)  2^(1/2^ell) must exist modulo each qi

        C)  If we use the standard FFT algorithm
            (with modular arithmetic in place of complex arithmetic),
            then we need a (2^ell)-th primitive root of unity.
            This means qi == 1 (mod 2^ell).

For ell = 3 (and 2^ell = 8) we might use q1 = 73 (or 89, 233, 257, ...).
We can check that 5^8 == 2 (mod 73) and 10^4 == -1 (mod 73),
so B) and C) hold.  To compute all yi mod 73 (given the xi), the steps are:

        1)  Reduce all xi mod 73.  Multiply each remainder by the proper
            power of 2^(1/8) == 5 (mod 73).   [These powers can
            be computed once and stored in a table.]

        2)  Take the DFT of the sequence in 1) with itself.

        3)  Multiply the outputs in 2) by powers of 2^(-1/8) mod 73,
            to get {yi mod 73}.

- --------------

        On a machine with 64-bit integers, such as the DEC Alpha,
we might select one q near 2^63.  Using ell = 16
(i.e., DFT of order 2^16) we can let the digits xi of input number x
be as large as 2^23.  The bounds on the output digits yi are

        2 * 2^16 * (2^23 - 1)^2 = 2^63 - 2^39 + 2^16 < q

With 2^16 digits in radix 2^23, one can represent numbers
as large as 2^(23*65536) = 2^1507328.

        If we use two primes q1 and q2, each around 2^63,
then we can use a convolution length 2^15 with radix 2^54.
The output yi are bounded by

        2 * 2^15 * (2^54)^2 = 2^124 < q1 * q2.

This will work for p < 54 * 32768 = 1769472.
The convolutions modulo q1 and modulo q2 can be computed
in parallel if the machine has multiple processors.  Otherwise 
computations can be interleaved to utilize the machine's functional units.
The Chinese Remainder Theorem (CRT) enables us to recover outputs
modulo q1*q2 given the remainders modulo q1 and modulo q2.

                                          Multiplications per q

        Construct xi                      2^ell modular multiplications
                                              by powers of 2^(2^(-ell))

        Forward and inverse convolutions  2^(ell-1)*(ell - 2) + 1 modular
                                          multiplications by roots of unity

        Pointwise product                 2^ell modular squarings

        Divide by power of 2,
            multiply by constant for CRT  2^ell modular multiplications

        CRT                               2^ell double-length products
                                         ------------------------------
                                          2^ell * ell + 2
                                                modular multiplications by
                                                elements taken from tables.
                                          2^ell modular squarings
                                          2^ell double-length products

        The Alpha lacks an integer divide instruction, but one can
compute a modular product with a few simple operations.
It has MULQ (multiply low) and UMULH (unsigned multiply high) instructions
If 0 <= a, b < q, and q * qinv == 1 (mod 2^64), then

          a * b = UMULH(a, b) * 2^64 + MULQ(a, b)
   Let     quot = MULQ(qinv, MULQ(a, b))
   Then           q*quot == q * (qinv * (a * b)) == a * b (mod 2^64)
   So  q * quot = UMULH(q, quot) * 2^64 + MULQ(q, quot)
                = UMULH(q, quot) * 2^64 + MULQ(a, b)
       a * b - q * quot = 2^64 * (UMULH(a, b) - UMULH(q, quot))
       a * b/2^64 == UMULH(a, b) - UMULH(q, a*b*qinv mod 2^64)

When b comes from a table, we can also look up b*qinv (mod 2^64).
Then one MULQ, two UMULH, and a few simple instructions
(such as checking whether which UMULH is larger) will
give us a * b / 2^64 mod q, which is as useful as a * b mod q.
The modular multiplication takes three half-multiplies when one operand
is in a table, and four half-multiplies in the general case.

        With ell = 15 and and 2 q's, we estimate

        2 * (3 * 15 + 4 + 2) * 2^15  = 102 *  32768 ~= 3.34 million

MULQ or UMULH instructions.

        With complex arithmetic, we might try order 2^17 with radix
to achieve p ~= 1.7 million. The output coefficients are bounded
by 2 * 2^17 * (2^13)^2 = 2^44, which is rather close to the 53 bits
of precision available on many machines.  The costs are:

                                                Multiplications

        Compute xi, by multiplying by powers
            of 2^(2^(-ell))                     2^ell real
        Forward transform                       2^(ell-1)*(ell-3) complex
        Pointwise squaring                      2^ell complex
        Inverse transform                       2^(ell-1)*(ell-3) complex
        Divide by power of 2                    2^ell real
                                               ----------------
                                                2^ell*(ell - 2) complex
                                                2 * 2^ell real

Counting a complex multiplication as four real multiplications,
this is 2^ell * (4*ell - 8) real multiplications, which
evaluates to 7.8 million real multiplications when ell = 17.
[This estimate is unduly pessimistic -- it can surely be
improved by knowing that the final output is all-real.]

        We can mix the FFT over the complex numbers and the
FFT modulo a prime.  For example, if we estimate that a
complex FFT will provide 40 bits of accuracy, and a
convolution modulo q ~= 2^63 will give 63 bits of accuracy,
then we can allow each yi to have 40 + 63 = 103 bits.
We might use two FFTs (one complex, one mod q)
each of order 2^15, restricting xi to 2^42,
which allows p <= 42 * 32768 ~= 1.37 million.
Using a shared loop structure for the integer and complex FFTs
will let one intermix integer and floating point computations
for good machine performance.

- ------------

        On a 32-bit machine, only a few q are available if ell is large.
We can use q = 2043904001 if ell <= 15, and q = 3539599361 if ell <= 16.
Another possibility is to choose q == 7 (mod 8);
then 2 is a (2^ell)-th power mod q for every ell >= 1.
The DFT modulo q can be done by Nussbaumer's algorithm
(see discussion in early May archives).
Using four primes near 2^31 and DFTs of order 2^15,
we can use radix 2^54, and p <= 54 * 32768 ~= 1.7 million.