Document Sample

Elliptic Curve Arithmetic on Reconfigurable Hardware Siddaveerasharan Devarkal Duncan A. Buell Department of Computer Science and Engineering University of South Carolina, Columbia {devarkal | buell@cse.sc.edu} Abstract We present an implementation of point addition on elliptic curves for an Elliptic Curve Cryptosystem on a reconfigurable hardware platform. We have implemented the design for 14, 17 and 30 bit wide point co-ordinates, all of which fit in a single Virtex-II XC2V6000 FPGA chip. We also discuss larger bit-width implementations for which we have to partition the design across multiple FPGAs and choreograph the data movement between them. The kernel operation is multi- precise multiplication for which we use Karatsuba’s divide-and-conquer algorithm, and for modular reduction of double-length product we use the technique proposed by Peter L. Montgomery [6]. The reconfigurable hardware is the Starbridge Systems HC 36m Hypercomputer [5]. 1. Introduction In recent years the need for secure communication over computer networks has grown significantly, especially with the widespread use of a possibly transparent medium like Internet for online banking and other forms of e-commerce. Such applications use public key cryptosystems like RSA and Elliptic Curve Cryptosystem (ECC). Elliptic Curve Cryptosystems are emerging as a new generation of cryptosystems based on public key cryptography. They offer the smallest key size and highest strength per bit compared to any other public key cryptosystem, since there is currently no known sub-exponential time algorithm to solve the discrete logarithm problem. Smaller key sizes make them highly suitable for hardware implementation on FPGAs. The kernel operation in elliptic curve arithmetic is multi-precise multiplication for wide-length operands. A lot of research is being carried out in finding an efficient architecture for doing such wide-length multiplication [1-4]. Karatsuba’s divide-and-conquer multiplier was found to be feasible for up to 256-bit operands on a single Virtex XC2V6000 chip, and for anything longer than that hybrid multipliers are suggested [1]. Remainder of the paper is organized into six sections; Section 2 deals with the NIST proposed elliptic curves. In section 3 we talk about the arithmetic over such curves. In Section 4 we talk about the reconfigurable hardware used, and then implementation is described in section 5. Summary and future work come in section 6. 2. NIST recommended elliptic curves for cryptography In February 2000 NIST published recommendations for choice of underlying finite fields and type of curves for cryptographic systems based on elliptic curves [7]. There are two underlying finite fields: prime fields GF(P) and polynomial binary fields GF(2m). Elliptic curves over prime fields are defined for 192, 224, 256, 384 and 521 bits, and that over binary fields are of 163, 233, 283, 409 and 571 bits. NIST suggests randomly generated curve for both prime fields and binary fields. For binary fields NIST also suggests a special curve called Koblitz in addition to a randomly generated curve. In this paper we only deal with prime fields and the curve equation for large prime fields can be represented in a short Weierstrass equation as Y2 = X3 + AX + B where constant A = -3 because it leads to efficient implementation of point doubling [8]. 3. Elliptic Curve Arithmetic For pedagogical purposes we will use canonical representation of an elliptic curve proposed by NIST and we will only deal with prime fields defined modulo large prime p. Such a curve can be written in homogenous form as Y2Z = X3 + AXZ2 + BZ3 --------- (1) for constants A and B. the co-ordinates X, Y, Z and p are each of the width mentioned above. Figure 1 gives the hierarchy of arithmetic operations involved in the elliptic cure arithmetic. At the heart of the computations of the NIST standards are the large bit-width multi-precise arithmetic operations. The fundamental operation to be explored is multiplication of a point P = (x, y, z) on the group of the curve by a scalar M. This is done using the additive analog of the standard recursive doubling method for exponentiation. To compute M . P, we write M in binary and process the bits of M sequentially. If we process the bits from right to left, then, as we move left, we double P successively to get 2P, 4P, 8P, and so forth. If the kth bit of M is set, we add in the multiple 2 k . P to a running sum initialized to the zero point of the curve. At the end of this process for a 192-bit multiplier, we will have performed 191 doublings and on average 96 additions (analogous to the squarings and multiplications needed for exponentiation) on the curve. Multiplication of a point P = (x,y,z) on the elliptic curve by a scalar M Elliptic Curve Point Additions and Doublings Addition, Subtraction, Multiplication and Left-Shift (all modulo a large integer prime) Figure 1. Elliptic Curve Arithmetic Hierarchy Drilling down to the next layer of the computation, each elliptic curve addition and doubling requires a fixed number of modular multiplications, additions, shifts, and similar basic arithmetic operations. The actual number depends on the way in which the curve is represented; our version uses 8 multiplications and 5 squarings for doubling a point and 12 multiplications and 2 squarings for a general point addition. Usually it is the modular multiplications that will dominate the running time, and running time will scale exactly with the number of arithmetic operations needed. In software, there is an advantage to treating squaring differently from multiplication, in that pairs of cross product terms are identical and need only be calculated once, but in hardware that advantage often disappears, especially if one has to factor in the possible cost of implementing hardware both for a multiplication unit and a squaring unit. With our curve as written in (1), we can add points P1 = (x1, y1, z1) and P2 = (x2, y2, z2) to get P3 = (x3, y3, z3) by computing m = 3x2 + Az2 x3 = 2yz(m2 - 8xy2z) y3 = -m3 + 12mxy2z - 8y4z2 z3 = 8y3z3 if P1 = P2 and u = y1z2 - y2z1 v = x1z1 - x2z1 x3 = 2v(z1z2u2 - (x1z2 + x2z1)v2) y3 = u(3(x1z2 + x2z1)v2 - 2z1z2u2) - (y1z2 + y2z1)v3 z3 = 2 z1z2 v3 if P1 P2. In the former case of doubling a point P1, we can compute the triple (x3, y3, z3) with five squarings, eight multiplications, five additions, and five shifts as in Figure 1. In the latter case of distinct points P1 and P2, we can compute the triple (x3, y3, z3) with two squarings, twelve multiplications, seven additions, and two shifts as in Figure 2. Addition of two operands results in a sum that is at most 1-bit longer than the operands and this can be reduced using one subtraction. Same is the case with subtraction and shift operation. But, multiplication results in a double-length product and reducing that needs a multi-precise division. Division in hardware is expensive and has to be avoided. So, we use Montgomery Multiplication instead of regular multiplication and reduction; by using this method reduction is eliminated at the cost of two additional ordinary multiplications and two low-cost shift operations. We assume that squaring and multiplication are of equal cost when implemented in hardware, since if the silicon itself is a major resource constraint, then implementing separate units for squaring and for multiplication is not likely to be beneficial. Montgomery Multiplication: For modulus N, compute R R’ – N N’ = 1 for R = 2k>N Given a double-length product T, compute m = (T (mod R)) N’ (mod R) t = (T + m N)/R Mod R operation is choosing the rightmost k bits and division by R would be shifting right by k bits. Step 1: Step 4: x2 = x 2 SQUARE m = (3x2)3 + (Az2)2 ADD y2 = y2 SQUARE 4y4z2 = (y2)1 . (4y2z2)3 MUL z2 = z2 SQUARE z3 = (2yz)2 . (4y2z2)3 MUL xy = x . y MUL Step 5: yz = y . z MUL m2 = ((m)4)2 SQUARE 8y4z2 = (4y4z2)4 << 1 SHIFT Step 2: Step 6: Az2 = A . (z2)1 MUL u = (m2)5 - (8xy2z)3 ADD xy2z = (xy)1 . (yz)1 MUL 2x2 = (x2)1 << 1 SHIFT Step 7: 2yz = (yz)1 << 1 SHIFT v = (u)6 - (4xy2z)3 ADD x3 = (2yz)2 . (u)6 MUL Step 3: 3x2 = (x2)1 + (2x2)2 ADD Step 8: 4xy2z = (xy2z)2 << 2 SHIFT w = (m)4 . (v)7 MUL 8xy2z = (xy2z)2 << 3 SHIFT 4y2z2 = ((2yz)2)2 SQUARE Step 9: y3 = -(w)8 _ (8y4z2)5 ADD Figure 1 Arithmetic Schedule for Doubling a Point (Y-coordinate) 4. Star Bridge System’s HC36m Hypercomputer We use Star Bridge System’s (SBS) HC 36m Hypercomputer as our target reconfigurable platform. It consists of seven Field-Programmable-Gate-Arrays (FPGA) interfaced in a proprietary manner. Primary building block of the architecture is the processing element (PE) consisting of a Xilinx Virtex XC2V6000 FPGA chip coupled to a set of four DDR RAM modules each of which has a storage capacity of 512MB. The FPGA chip, which forms the primary source of reconfigurable logic, is provided with a 90-bit wide communication link to each of the memory modules. Four such PEs are arranged in a cross-point connection with 50-bit wide communications between each other forming, what is referred to as, a “Quad structure” shown in figure 4[5]. Another Virtex XC2V6000 FPGA chip serves as a cross-point switch enabling inter-FPGA communication. The Hypercomputer comes with an additional two Virtex XC2V4000 chips serving as a bus controller and a router respectively. The Hypercomputer runs on dual Intel Xeon processors linked to the FPGA interface through a 64-bit bi-directional PCIX bus running at 66MHz. HC 36m Hypercomputer comes with a software development tool called Viva . It has graphical user interface where in you click and drag objects onto an editor referred to as “sheet”. The designs created on the sheet are then compiled in their proprietary compiler/synthesizer before calling the standard Xilinx tools for place and route. One innovative feature of Viva is that it can be used to program one or all of the four FPGA chips in the quad. Step 1: Step 5: x1z2 = x1 . z2 MUL a = (u2z1z2)4 - (wv2)4 ADD x2z1 = x2 . z1 MUL tv3 = (t)2 . (v3)4 MUL y1z2 = y1 . z2 MUL z1z2v3 = (z1z2)1 . (v3)4 MUL y2z1 = y2 . z1 MUL z1z2 = z1 . z2 MUL Step 6: 2a = (a)5 << 1 SHIFT Step 2: z3 = (z1z2v3)5 << 1 SHIFT u = (y1z2)1 - (y2z1)1 ADD t = (y1z2)1 + (y2z1)1 ADD Step 7: v = (x1z2)1 - (x2z1)1 ADD c = (wv2)4 - (2a)6 ADD w = (x1z2)1 + (x2z1)1 ADD x3 = (v)2 - (2a)6 MUL Step 3: Step 8: u2 = ((u)2)2 SQUARE uc = (u)2 . (c)7 MUL v2 = ((v)2)2 SQUARE Step 9: Step 4: y3 = (uc)8 - (tv3)5 ADD u2z1z2 = (z1z2)1 . (u2)3 MUL v3 = (v)2 . (v2)3 MUL wv2 = (w)2 . (v2)3 MUL Figure 2 Arithmetic Schedule for Adding Two Points (Y-coordinate) Figure 4. HC 36m Hypercomputer’s Quad Structure 5. Implementation In this section we describe the implementation of Elliptic Curve Point Addition (ECCAdd) on the SBS’s HC 36m Hypercomputer. To start with, we first discuss the implementation of small bit-width versions of ECCAdd that would fit in a single chip, thus, avoiding the complications resulting from moving the design off-chip. Later we discuss longer bit-with implementations for which design has to be spread across multiple chips. Addition of two points on our chosen elliptic curve requires 12 multiplications, 2 squarings, 7 additions/subtractions and 2 left-shift operations, all modulo a large prime. The arithmetic operations are executed in the order of steps given in figure 3. The complete ECCAdd design is built using four building blocks corresponding to modular multiplication, addition, subtraction and left-shift. As said earlier, the arithmetic operation of primary concern is multiplication and this forms the major bottleneck in the performance of the ECCAdd. 5.1 Divide-and-Conquer Multiplier (D&Q) Before talking about the implementation for elliptic point addition we will look into the implementation of multi-precise multiplier. We use Karatsuba and Offman algorithm [9] for the multi-precision multiplication using divide-and-conquer approach. Unlike the naïve divide-and-conquer multiplier, which would require four sub-units of multipliers to form the partial products, the Karatsuba and Offman’s approach requires only three multipliers. This would considerably reduce the number base multipliers used for larger bit- widths. Base multipliers are 18-bit hardware multipliers and there are only 144 such multipliers available in Virtex XC2V6000 chips. These hardware multipliers take only one clock cycle and thus provide a considerable improvement in speed. Karatsuba and Offman Algorithm: If A = Ah . 2n/2 + Al and B = Bh . 2n/2 + Bl Then, A . B = T0 . 2n + (T2 – (T1 + T0)) . 2n/2 + T1 where T0 = Ah Bh T1 = Al Bl T2 = (Ah + Al ) + (Bh + Bl) The D&Q multiplier is recursive, and we designed it bottom-up, starting at 32 bits using three hardware multipliers for 16-Bit multiplication. The multiplier as we have designed it can be easily scaled to 256 bits and above. As expected, the 32, 64, 128 and 256-bit D&Q implementations take 3, 9, 27 and 81 hardware multipliers blocks as seen in Table 1. There is an overhead of six clock cycles, at each level of hierarchy, for units other than multipliers. The 30ns clock period for 128 and 256-bit implementation is because of the adder units for the large partial products. These big adders can be split into two smaller adders that work under 15ns period at the cost of one extra clock cycle. D&Q Slices MULT18x18 Clock Cycles Clock Period (ns) 32-Bit 585 3 7 15 64-Bit 2485 9 13 15 128-Bit 8770 27 19 30 256-Bit 29812 81 25 30 Table 1. D&Q Resource Utilization 5.2 Small Bit-width ECCAdd In the next three sub-sections we discuss the implementation of complete elliptic point addition. For small bit-width implementation we use as many arithmetic blocks as required till we run out physical resources. This simplifies the design considerably and also eases the process of scaling the design to higher bit-widths. We have seen that for our chosen curve we require 14 Montgomery multipliers for single ECCAdd and each Montgomery multiplier requires 3 D&Q multipliers. So, for complete ECCAdd it would require 42 D&Q multipliers. 32-bit D&Q requires 3 hardware multipliers so that for complete ECCAdd of 32-bit operands it would require 126 hardware multipliers. Therefore on a single Virtex XC2V6000 chip, which has 144 hardware multipliers we can only fit up to 32-bit ECCAdd module. For any bit-width greater than 32 we will run out of both slices and hardware multipliers and the design has to be spread across multiple chips. For 14- and 17-bit designs directly 42 hardware multipliers are used as seen in Table 2. The drastic rise in the slice usage for implementations from 17-bit to 24-bit is due to the overhead that comes with the use of adders in D&Q multipliers. From this we could imagine the rise in the slice usage when we move to operands above 32-bit. ECCAdd Slices MULT18x18 Clock Cycles Clock Period (ns) 14-Bit 2969 42 34 15 17-Bit 4390 42 34 15 24-Bit 20649 126 147 20 30-Bit 23599 126 147 25 Table 2. ECCAdd Resource Utilization 5.3 Multi-chip Implementation For a 64-bit D&Q we require nine hardware multipliers so that for a 64-bit ECCAdd we would require 42 * 9 = 378 hardware multipliers. Since, two chips have only 144*2 = 288 hardware multipliers, and we require 378, we have to place the 64-bit ECCAdd on three chips. This leads to the question of how to partition the design across three chips. If we just go by the number of Montgomery multipliers that can be fit on single chip, we see that only five such units can placed on one FPGA chip. This places the first two steps of figure-1 on chip-1, next two steps on chip-2 and the remainder of the design on chip-3. Dataflow is from host to chip-1, chip-1 to chip-2, chip-2 to chip-3, chip-3 to chip-1 and finally from chip-1 back to the host. The hypercomputer provides only 50-bit wide communication between the FPGAs and 32-bit PCIX bus between the host and the FPGA architecture. To move any data larger than 50 bits we have to multiplex the data and this incurs an additional cost. To match with the 32-bit PCIX bus, suppose, we pass 32 bits at a time between the FPGAs, then the data movement between the chips would cost 33 clock cycles. Additionally, to move nine operands (six co-ordinates for the two input points and three co-ordinates for the sum point) from and to the host would cost 18 clock cycles. So, a total of 51 clock cycles are required for data movement excluding the computation time in each of the FPGA chips. This implementation will not require any explicit control circuitry other than that for handling the multiplexing of data between chips. 5.4 Practical Bit-width ECCAdd We ultimately intend to do the complete elliptic arithmetic over NIST bit-widths of 192, 224, 256, 384 and 521. For 128-bit ECCAdd we will require 42 * 27 = 1134 hardware multipliers, but we have a total of only 144 * 4 = 576 hardware multipliers from the four FPGAs. Apparently, we run out of the hardware multipliers and, possibly, silicon resources. We are, therefore, forced to re-use the arithmetic units for our long bit-width implementations. This leads to a number of design related questions like how many units of each type of arithmetic operation are to be allocated? Where to assign each of the allocated units on the four FPGA chips? And how to schedule the execution on each of these assigned units. Also, we require a complex control circuitry to choreograph the data movement between the arithmetic modules. Obviously, we have a number of design alternatives and we rely on some of know high-level design methodologies to achieve an optimal architecture. 6. Summary and Future Work We have implemented a complete elliptic point addition for up to 32-bit operands on a single Virtex-II 6000 FPGA chip. For elliptic point additions of operands greater than 32-bit and less than 64-bit the design has to be placed on three chips. For anything greater than 64-bit, including the NIST proposed sizes of 192, 224, 256, 384 and 521, not only do we need to place the design on all four chips but also re-use the long operand arithmetic units. We are currently working on the implementations for the operand bit-widths from 32 to 64 bits. We are also exploring the design space for the practical bit-width implementations. Though the Karatsuba’s divide-and-conquer multiplier seems to offer good performance in speed, it is using a lot of silicon resources. We would certainly want a multiplier that would run at speeds comparable to D&Q and take relatively less resources. We are looking into, what we refer to as, “Hybrid” multiplier, which is built using D&Q and Broadcast multipliers [1]. Hybrid multipliers are hierarchical in nature similar to D&Q, but can have either D&Q or Broadcast units at each level. We plan to replace the D&Q multiplier with these hybrid units for long bit-width implementations. 7. References [1] Duncan A. Buell, James P. Davis, and Gang Quan, “Reconfigurable computing applied to problems in communication security,” Proceedings, MAPLD 2002. [2] Duncan A. Buell, “Elliptic curves and the NIST standards,” Technical Report, Computer Science and Engineering, University of South Carolinas, May 2002. [3] M. Rosner, “Elliptic curve cryptosystems on reconfigurable hardware,” Masters Thesis, ECE Dept., Worchester Polytechnic Institute, Worchester, Massachusetts, 1998. [4] M. Jung, F. Madlener, M. Ernst and S. A. Huss, “A reconfigurable coprocessor for finite field multiplication in GF (2n).” [5] http://www.starbridgesystems.com [6] Peter L. Montgomery. Modular Multiplication without trial division. Mathematics of Computation, 44:519-521, 1985 [7] NIST. Recommended elliptic curves for federal government use, July 1999. csrc.nist.gov/csrc/fedstandards.html, csrc.nist.gov/encryption/dss/ecdsa/NISTReCur.pdf, csrc.nist.gov/publications/fips/fips186-2.pdf. [8] E. Brickell, D. Gordon, K. McCurley, and D. Wilson “Fast Implementation with precomputation”, Advances in Cryptology-Eurocrypt ’92, LNCS 658, 1993, 200-207. [9] Energy Scalable Reconfigurable Cryptographic Hardware for Portable Applications. James Ross Goodman, PhD Dissertation, MIT

DOCUMENT INFO

Shared By:

Categories:

Tags:
elliptic curve, scalar multiplication, elliptic curve cryptography, reconfigurable hardware, clock cycles, fpga implementation, modular multiplication, embedded systems, point addition, elliptic curve cryptosystems, modular arithmetic, reconfigurable computing, field gf, elliptic curve cryptosystem, arithmetic logic unit

Stats:

views: | 13 |

posted: | 6/24/2010 |

language: | English |

pages: | 7 |

OTHER DOCS BY zfz19897

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.