ECC Addition on Reconfigurable Hardware

Document Sample
ECC Addition on Reconfigurable Hardware Powered By Docstoc
					      Elliptic Curve Arithmetic on Reconfigurable
                                   Siddaveerasharan Devarkal
                                         Duncan A. Buell
                         Department of Computer Science and Engineering
                             University of South Carolina, Columbia
                                  {devarkal |}

         We present an implementation of point addition on elliptic curves for an Elliptic Curve
         Cryptosystem on a reconfigurable hardware platform. We have implemented the design for 14, 17
         and 30 bit wide point co-ordinates, all of which fit in a single Virtex-II XC2V6000 FPGA chip.
         We also discuss larger bit-width implementations for which we have to partition the design across
         multiple FPGAs and choreograph the data movement between them. The kernel operation is multi-
         precise multiplication for which we use Karatsuba’s divide-and-conquer algorithm, and for modular
         reduction of double-length product we use the technique proposed by Peter L. Montgomery [6].
         The reconfigurable hardware is the Starbridge Systems HC 36m Hypercomputer  [5].

1. Introduction
In recent years the need for secure communication over computer networks has grown significantly,
especially with the widespread use of a possibly transparent medium like Internet for online banking and
other forms of e-commerce. Such applications use public key cryptosystems like RSA and Elliptic Curve
Cryptosystem (ECC).

Elliptic Curve Cryptosystems are emerging as a new generation of cryptosystems based on public key
cryptography. They offer the smallest key size and highest strength per bit compared to any other public
key cryptosystem, since there is currently no known sub-exponential time algorithm to solve the discrete
logarithm problem. Smaller key sizes make them highly suitable for hardware implementation on FPGAs.

The kernel operation in elliptic curve arithmetic is multi-precise multiplication for wide-length operands. A
lot of research is being carried out in finding an efficient architecture for doing such wide-length
multiplication [1-4]. Karatsuba’s divide-and-conquer multiplier was found to be feasible for up to 256-bit
operands on a single Virtex XC2V6000 chip, and for anything longer than that hybrid multipliers are
suggested [1].

Remainder of the paper is organized into six sections; Section 2 deals with the NIST proposed elliptic
curves. In section 3 we talk about the arithmetic over such curves. In Section 4 we talk about the
reconfigurable hardware used, and then implementation is described in section 5. Summary and future
work come in section 6.

2. NIST recommended elliptic curves for cryptography
In February 2000 NIST published recommendations for choice of underlying finite fields and type of
curves for cryptographic systems based on elliptic curves [7]. There are two underlying finite fields: prime
fields GF(P) and polynomial binary fields GF(2m). Elliptic curves over prime fields are defined for 192,
224, 256, 384 and 521 bits, and that over binary fields are of 163, 233, 283, 409 and 571 bits. NIST
suggests randomly generated curve for both prime fields and binary fields. For binary fields NIST also
suggests a special curve called Koblitz in addition to a randomly generated curve.
In this paper we only deal with prime fields and the curve equation for large prime fields can be represented
in a short Weierstrass equation as Y2 = X3 + AX + B where constant A = -3 because it leads to efficient
implementation of point doubling [8].
3. Elliptic Curve Arithmetic
For pedagogical purposes we will use canonical representation of an elliptic curve proposed by NIST and
we will only deal with prime fields defined modulo large prime p. Such a curve can be written in
homogenous form as
                                Y2Z = X3 + AXZ2 + BZ3            --------- (1)
for constants A and B. the co-ordinates X, Y, Z and p are each of the width mentioned above.

Figure 1 gives the hierarchy of arithmetic operations involved in the elliptic cure arithmetic. At the heart of
the computations of the NIST standards are the large bit-width multi-precise arithmetic operations. The
fundamental operation to be explored is multiplication of a point P = (x, y, z) on the group of the curve by a
scalar M. This is done using the additive analog of the standard recursive doubling method for
exponentiation. To compute M . P, we write M in binary and process the bits of M sequentially. If we
process the bits from right to left, then, as we move left, we double P successively to get 2P, 4P, 8P, and so
forth. If the kth bit of M is set, we add in the multiple 2 k . P to a running sum initialized to the zero point of
the curve. At the end of this process for a 192-bit multiplier, we will have performed 191 doublings and on
average 96 additions (analogous to the squarings and multiplications needed for exponentiation) on the

          Multiplication of a point P = (x,y,z) on the elliptic curve by a scalar M

                         Elliptic Curve Point Additions and Doublings

 Addition, Subtraction, Multiplication and Left-Shift (all modulo a large integer prime)

                               Figure 1. Elliptic Curve Arithmetic Hierarchy

Drilling down to the next layer of the computation, each elliptic curve addition and doubling requires a
fixed number of modular multiplications, additions, shifts, and similar basic arithmetic operations. The
actual number depends on the way in which the curve is represented; our version uses 8 multiplications and
5 squarings for doubling a point and 12 multiplications and 2 squarings for a general point addition.
Usually it is the modular multiplications that will dominate the running time, and running time will scale
exactly with the number of arithmetic operations needed. In software, there is an advantage to treating
squaring differently from multiplication, in that pairs of cross product terms are identical and need only be
calculated once, but in hardware that advantage often disappears, especially if one has to factor in the
possible cost of implementing hardware both for a multiplication unit and a squaring unit.

With our curve as written in (1), we can add points P1 = (x1, y1, z1) and P2 = (x2, y2, z2) to get P3 = (x3,
y3, z3) by computing
                 m = 3x2 + Az2
                 x3 = 2yz(m2 - 8xy2z)
                 y3 = -m3 + 12mxy2z - 8y4z2
                 z3 = 8y3z3
if P1 = P2 and
                 u = y1z2 - y2z1
                 v = x1z1 - x2z1
                 x3 = 2v(z1z2u2 - (x1z2 + x2z1)v2)
                 y3 = u(3(x1z2 + x2z1)v2 - 2z1z2u2) - (y1z2 + y2z1)v3
                 z3 = 2 z1z2 v3
if P1  P2.

In the former case of doubling a point P1, we can compute the triple (x3, y3, z3) with five squarings, eight
multiplications, five additions, and five shifts as in Figure 1. In the latter case of distinct points P1 and P2,
we can compute the triple (x3, y3, z3) with two squarings, twelve multiplications, seven additions, and two
shifts as in Figure 2.

Addition of two operands results in a sum that is at most 1-bit longer than the operands and this can be
reduced using one subtraction. Same is the case with subtraction and shift operation. But, multiplication
results in a double-length product and reducing that needs a multi-precise division. Division in hardware is
expensive and has to be avoided. So, we use Montgomery Multiplication instead of regular multiplication
and reduction; by using this method reduction is eliminated at the cost of two additional ordinary
multiplications and two low-cost shift operations. We assume that squaring and multiplication are of equal
cost when implemented in hardware, since if the silicon itself is a major resource constraint, then
implementing separate units for squaring and for multiplication is not likely to be beneficial.

Montgomery Multiplication:
    For modulus N, compute R R’ – N N’ = 1 for R = 2k>N
    Given a double-length product T, compute
         m = (T (mod R)) N’ (mod R)
         t = (T + m N)/R
Mod R operation is choosing the rightmost k bits and division by R would be shifting right by k bits.

 Step 1:                                                Step 4:
      x2 = x 2                    SQUARE                     m = (3x2)3 + (Az2)2              ADD
      y2 = y2                     SQUARE                     4y4z2 = (y2)1 . (4y2z2)3         MUL
      z2 = z2                     SQUARE                     z3 = (2yz)2 . (4y2z2)3           MUL
      xy = x . y                  MUL                   Step 5:
      yz = y . z                  MUL                        m2 = ((m)4)2                     SQUARE
                                                             8y4z2 = (4y4z2)4 << 1            SHIFT
 Step 2:                                                Step 6:
      Az2 = A . (z2)1             MUL                        u = (m2)5 - (8xy2z)3             ADD
      xy2z = (xy)1 . (yz)1        MUL
      2x2 = (x2)1 << 1            SHIFT                 Step 7:
      2yz = (yz)1 << 1            SHIFT                      v = (u)6 - (4xy2z)3              ADD
                                                             x3 = (2yz)2 . (u)6               MUL
 Step 3:
      3x2 = (x2)1 + (2x2)2        ADD                   Step 8:
      4xy2z = (xy2z)2 << 2        SHIFT                      w = (m)4 . (v)7                  MUL
      8xy2z = (xy2z)2 << 3        SHIFT
      4y2z2 = ((2yz)2)2           SQUARE                Step 9:
                                                             y3 = -(w)8 _ (8y4z2)5            ADD

                       Figure 1 Arithmetic Schedule for Doubling a Point (Y-coordinate)

4. Star Bridge System’s HC36m Hypercomputer
We use Star Bridge System’s (SBS) HC 36m Hypercomputer as our target reconfigurable platform. It
consists of seven Field-Programmable-Gate-Arrays (FPGA) interfaced in a proprietary manner. Primary
building block of the architecture is the processing element (PE) consisting of a Xilinx Virtex XC2V6000
FPGA chip coupled to a set of four DDR RAM modules each of which has a storage capacity of 512MB.
The FPGA chip, which forms the primary source of reconfigurable logic, is provided with a 90-bit wide
communication link to each of the memory modules. Four such PEs are arranged in a cross-point
connection with 50-bit wide communications between each other forming, what is referred to as, a “Quad
structure” shown in figure 4[5]. Another Virtex XC2V6000 FPGA chip serves as a cross-point switch
enabling inter-FPGA communication. The Hypercomputer comes with an additional two Virtex XC2V4000
chips serving as a bus controller and a router respectively. The Hypercomputer runs on dual Intel  Xeon
processors linked to the FPGA interface through a 64-bit bi-directional PCIX bus running at 66MHz.
HC 36m Hypercomputer comes with a software development tool called Viva . It has graphical user
interface where in you click and drag objects onto an editor referred to as “sheet”. The designs created on
the sheet are then compiled in their proprietary compiler/synthesizer before calling the standard Xilinx
tools for place and route. One innovative feature of Viva is that it can be used to program one or all of the
four FPGA chips in the quad.

 Step 1:                                                Step 5:
      x1z2 = x1 . z2                   MUL                   a = (u2z1z2)4 - (wv2)4                ADD
      x2z1 = x2 . z1                   MUL                   tv3 = (t)2 . (v3)4                    MUL
      y1z2 = y1 . z2                   MUL                   z1z2v3 = (z1z2)1 . (v3)4              MUL
      y2z1 = y2 . z1                   MUL
      z1z2 = z1 . z2                   MUL              Step 6:
                                                             2a = (a)5 << 1                        SHIFT
 Step 2:                                                     z3 = (z1z2v3)5 << 1                   SHIFT
      u = (y1z2)1 - (y2z1)1            ADD
      t = (y1z2)1 + (y2z1)1            ADD              Step 7:
      v = (x1z2)1 - (x2z1)1            ADD                   c = (wv2)4 - (2a)6                    ADD
      w = (x1z2)1 + (x2z1)1            ADD                   x3 = (v)2 - (2a)6                     MUL

 Step 3:                                                Step 8:
      u2 = ((u)2)2                     SQUARE                uc = (u)2 . (c)7                      MUL
      v2 = ((v)2)2                     SQUARE
                                                        Step 9:
 Step 4:                                                     y3 = (uc)8 - (tv3)5                   ADD
      u2z1z2 = (z1z2)1 . (u2)3         MUL
      v3 = (v)2 . (v2)3                MUL
      wv2 = (w)2 . (v2)3               MUL

                       Figure 2 Arithmetic Schedule for Adding Two Points (Y-coordinate)

                           Figure 4. HC 36m Hypercomputer’s Quad Structure
5. Implementation
In this section we describe the implementation of Elliptic Curve Point Addition (ECCAdd) on the SBS’s
HC 36m Hypercomputer. To start with, we first discuss the implementation of small bit-width versions of
ECCAdd that would fit in a single chip, thus, avoiding the complications resulting from moving the design
off-chip. Later we discuss longer bit-with implementations for which design has to be spread across
multiple chips.

Addition of two points on our chosen elliptic curve requires 12 multiplications, 2 squarings, 7
additions/subtractions and 2 left-shift operations, all modulo a large prime. The arithmetic operations are
executed in the order of steps given in figure 3. The complete ECCAdd design is built using four building
blocks corresponding to modular multiplication, addition, subtraction and left-shift. As said earlier, the
arithmetic operation of primary concern is multiplication and this forms the major bottleneck in the
performance of the ECCAdd.

5.1 Divide-and-Conquer Multiplier (D&Q)
Before talking about the implementation for elliptic point addition we will look into the implementation of
multi-precise multiplier. We use Karatsuba and Offman algorithm [9] for the multi-precision multiplication
using divide-and-conquer approach. Unlike the naïve divide-and-conquer multiplier, which would require
four sub-units of multipliers to form the partial products, the Karatsuba and Offman’s approach requires
only three multipliers. This would considerably reduce the number base multipliers used for larger bit-
widths. Base multipliers are 18-bit hardware multipliers and there are only 144 such multipliers available in
Virtex XC2V6000 chips. These hardware multipliers take only one clock cycle and thus provide a
considerable improvement in speed.

Karatsuba and Offman Algorithm:
    If A = Ah . 2n/2 + Al and B = Bh . 2n/2 + Bl
    Then, A . B = T0 . 2n + (T2 – (T1 + T0)) . 2n/2 + T1
                      where      T0 = Ah Bh
                                 T1 = Al Bl
                                 T2 = (Ah + Al ) + (Bh + Bl)

The D&Q multiplier is recursive, and we designed it bottom-up, starting at 32 bits using three hardware
multipliers for 16-Bit multiplication. The multiplier as we have designed it can be easily scaled to 256 bits
and above. As expected, the 32, 64, 128 and 256-bit D&Q implementations take 3, 9, 27 and 81 hardware
multipliers blocks as seen in Table 1. There is an overhead of six clock cycles, at each level of hierarchy,
for units other than multipliers. The 30ns clock period for 128 and 256-bit implementation is because of the
adder units for the large partial products. These big adders can be split into two smaller adders that work
under 15ns period at the cost of one extra clock cycle.

                    D&Q          Slices MULT18x18 Clock Cycles Clock Period (ns)
                    32-Bit        585          3           7          15
                    64-Bit       2485          9          13          15
                   128-Bit       8770         27          19          30
                   256-Bit       29812        81          25          30
                                   Table 1. D&Q Resource Utilization

5.2 Small Bit-width ECCAdd
In the next three sub-sections we discuss the implementation of complete elliptic point addition. For small
bit-width implementation we use as many arithmetic blocks as required till we run out physical resources.
This simplifies the design considerably and also eases the process of scaling the design to higher bit-widths.

We have seen that for our chosen curve we require 14 Montgomery multipliers for single ECCAdd and
each Montgomery multiplier requires 3 D&Q multipliers. So, for complete ECCAdd it would require 42
D&Q multipliers. 32-bit D&Q requires 3 hardware multipliers so that for complete ECCAdd of 32-bit
operands it would require 126 hardware multipliers. Therefore on a single Virtex XC2V6000 chip, which
has 144 hardware multipliers we can only fit up to 32-bit ECCAdd module. For any bit-width greater than
32 we will run out of both slices and hardware multipliers and the design has to be spread across multiple
chips. For 14- and 17-bit designs directly 42 hardware multipliers are used as seen in Table 2. The drastic
rise in the slice usage for implementations from 17-bit to 24-bit is due to the overhead that comes with the
use of adders in D&Q multipliers. From this we could imagine the rise in the slice usage when we move to
operands above 32-bit.

                  ECCAdd         Slices MULT18x18 Clock Cycles Clock Period (ns)
                   14-Bit        2969         42          34          15
                   17-Bit        4390         42          34          15
                   24-Bit        20649       126         147          20
                   30-Bit        23599       126         147          25
                                 Table 2. ECCAdd Resource Utilization

5.3 Multi-chip Implementation
For a 64-bit D&Q we require nine hardware multipliers so that for a 64-bit ECCAdd we would require 42 *
9 = 378 hardware multipliers. Since, two chips have only 144*2 = 288 hardware multipliers, and we require
378, we have to place the 64-bit ECCAdd on three chips. This leads to the question of how to partition the
design across three chips. If we just go by the number of Montgomery multipliers that can be fit on single
chip, we see that only five such units can placed on one FPGA chip. This places the first two steps of
figure-1 on chip-1, next two steps on chip-2 and the remainder of the design on chip-3. Dataflow is from
host to chip-1, chip-1 to chip-2, chip-2 to chip-3, chip-3 to chip-1 and finally from chip-1 back to the host.

The hypercomputer provides only 50-bit wide communication between the FPGAs and 32-bit PCIX bus
between the host and the FPGA architecture. To move any data larger than 50 bits we have to multiplex the
data and this incurs an additional cost. To match with the 32-bit PCIX bus, suppose, we pass 32 bits at a
time between the FPGAs, then the data movement between the chips would cost 33 clock cycles.
Additionally, to move nine operands (six co-ordinates for the two input points and three co-ordinates for
the sum point) from and to the host would cost 18 clock cycles. So, a total of 51 clock cycles are required
for data movement excluding the computation time in each of the FPGA chips. This implementation will
not require any explicit control circuitry other than that for handling the multiplexing of data between

5.4 Practical Bit-width ECCAdd
We ultimately intend to do the complete elliptic arithmetic over NIST bit-widths of 192, 224, 256, 384 and
521. For 128-bit ECCAdd we will require 42 * 27 = 1134 hardware multipliers, but we have a total of only
144 * 4 = 576 hardware multipliers from the four FPGAs. Apparently, we run out of the hardware
multipliers and, possibly, silicon resources. We are, therefore, forced to re-use the arithmetic units for our
long bit-width implementations.
This leads to a number of design related questions like how many units of each type of arithmetic operation
are to be allocated? Where to assign each of the allocated units on the four FPGA chips? And how to
schedule the execution on each of these assigned units. Also, we require a complex control circuitry to
choreograph the data movement between the arithmetic modules. Obviously, we have a number of design
alternatives and we rely on some of know high-level design methodologies to achieve an optimal

6. Summary and Future Work
We have implemented a complete elliptic point addition for up to 32-bit operands on a single Virtex-II
6000 FPGA chip. For elliptic point additions of operands greater than 32-bit and less than 64-bit the design
has to be placed on three chips. For anything greater than 64-bit, including the NIST proposed sizes of 192,
224, 256, 384 and 521, not only do we need to place the design on all four chips but also re-use the long
operand arithmetic units. We are currently working on the implementations for the operand bit-widths from
32 to 64 bits. We are also exploring the design space for the practical bit-width implementations.

Though the Karatsuba’s divide-and-conquer multiplier seems to offer good performance in speed, it is
using a lot of silicon resources. We would certainly want a multiplier that would run at speeds comparable
to D&Q and take relatively less resources. We are looking into, what we refer to as, “Hybrid” multiplier,
which is built using D&Q and Broadcast multipliers [1]. Hybrid multipliers are hierarchical in nature
similar to D&Q, but can have either D&Q or Broadcast units at each level. We plan to replace the D&Q
multiplier with these hybrid units for long bit-width implementations.

7. References
[1] Duncan A. Buell, James P. Davis, and Gang Quan, “Reconfigurable computing applied to problems in
     communication security,” Proceedings, MAPLD 2002.
[2] Duncan A. Buell, “Elliptic curves and the NIST standards,” Technical Report, Computer Science and
     Engineering, University of South Carolinas, May 2002.
[3] M. Rosner, “Elliptic curve cryptosystems on reconfigurable hardware,” Masters Thesis, ECE Dept.,
     Worchester Polytechnic Institute, Worchester, Massachusetts, 1998.
[4] M. Jung, F. Madlener, M. Ernst and S. A. Huss, “A reconfigurable coprocessor for finite field
     multiplication in GF (2n).”
[6] Peter L. Montgomery. Modular Multiplication without trial division. Mathematics of Computation,
     44:519-521, 1985
[7] NIST. Recommended elliptic curves for federal government use, July 1999.,,
[8] E. Brickell, D. Gordon, K. McCurley, and D. Wilson “Fast Implementation with precomputation”,
     Advances in Cryptology-Eurocrypt ’92, LNCS 658, 1993, 200-207.
[9] Energy Scalable Reconfigurable Cryptographic Hardware for Portable Applications. James Ross
     Goodman, PhD Dissertation, MIT