VIEWS: 4 PAGES: 26 POSTED ON: 7/27/2012 Public Domain
Design Exploration of 192-bit Elliptic Curve Adder on StarBridge HC-36 System Gang Quan, Duncan A. Buell, James P. Davis, Siddaveerasharan Devarkal Department of Computer Science & Engineering, University of South Carolina Elliptic Curve Cryptography • Emerging as new generation of cryptosystems based on public key cryptography • No sub-exponential algorithm to solve the discrete logarithm problem • Smallest key size & highest strength per bit compared to other public key cryptosystems • Smaller key sizes suitable for hardware implementation By Quan, Buell, Davis, and Devarkal 2 P60 NIST standards • NIST has proposed a specific set of elliptic curves for cryptography purposes • Elliptic curves are defined for prime fields GF(p) and binary polynomial fields GF(2m) • Prime fields for 192, 224, 256, 384 and 521 bits • Binary fields for 163, 233, 283, 409 and 571 bits • Multi-precise arithmetic of such long bit-widths By Quan, Buell, Davis, and Devarkal 3 P60 Elliptic Curve Arithmetic • For 192-bit operand, naïve M * P operation involves 191 elliptic curve doublings and 96 elliptic curve additions (ECC Adder) • ECC Addition – Given P1=(x1, y1, z1), P2=(x2,y2,z2), compute P3=(x3,y3,z3) such that – 14 high bit-width modular multiplications • 42 high bit-width multi precision multiplications if using Montgomery multiplication method By Quan, Buell, Davis, and Devarkal 4 P60 ECC Adder Data Flow Graph StarBridge High Performance Computing Platform StarBridge HC-36 System • 4 Processing Elements – Virtex II 6000 (Processing elements) • 66Mhz PCI Bus • PE-PE communication rates – 50 bits/cycle • Development Environment – Viva By Quan, Buell, Davis, and Devarkal 7 P60 Challenges Search for the optimal or near optimal design solution such that it can optimize the ECC Adder performance under the resource constraints (slices, number of built-in hardware multipliers, communication rate, etc ) of the target architecture (SBS HC36). The size of the design space can easily exceed 2120 even with a conservative estimation. By Quan, Buell, Davis, and Devarkal 8 P60 Hierarchical Design Methodology Rapid and accurate performance/cost evaluation is the key for effective and efficient design space exploration, and the performance/cost of the multipliers are critical for performance/cost of the ECC Adder. By Quan, Buell, Davis, and Devarkal 10 P60 Evaluation of Timing and Resource Usage of a Multiplier • Different Multiplier implementation – Shift-and-Add, Divide-and-Conquer(D&Q), “Broadcast” (BC), etc • Performance/cost trade off – Hybrid multiplier • A multiplier combining different implementation strategies By Quan, Buell, Davis, and Devarkal 11 P60 Divide & Conquer Multiplier • Karatsuba-Ofman Algorithm (1962) Let A Ah 2 n / 2 Al , B Bh 2 n / 2 Bl , then A B t0 2 n (t 2 t1 t0 ) 2 n / 2 t1 , where t0 Ah Bh , t1 Al Bl , and t0 ( Ah Al ) ( Bh Bl ) By Quan, Buell, Davis, and Devarkal 12 P60 “Broadcast” Multiplier • Algorithm N Let A a( k 1) ...a1a0 , B b( k 1) ...b1b0 , ai bi . Then k N A B (ai (b( k 1) ...b1b0 ) (i )). i k • Features – Shuffling the partial product for fully pipelined implementation – Given k functional units, each “loop body” can be computed in parallel – Easy tradeoff of resource usage/speed by selecting k • k=N: Shift-and-add (low degree of parallelism, low speed, low resource usage) • k = 1, 2, 3, … (small integer) : Conventional “block” multiplications (high degree of parallelism, high speed, high resource usage) – Good scalability By Quan, Buell, Davis, and Devarkal 13 P60 Example: 192-bit “Broadcast” Multiplier b5 b4 b3 b2 b1 b0 a0 32 a1 . MUX . 32 a5 . X X X X X X Operand Select 64 Shift_32 Rf Rd Rb Re Rc Ra + + + MUX MUX MUX Ripple Carry Ri Rh Rg + Ripple Carry Shift_32 Rl Rk Rj Final Product By Quan, Buell, Davis, and Devarkal 14 P60 The Hybrid Multiplier • A hybrid multiplier is denoted by a integer string, M(N) = {m1, m2, …,mn } – mi: the multiplier scheme at ith level – mi = 1, using D&Q scheme – mi = k (k>1), using BC scheme with k sub multipliers – for multiplication with bit width less than 18 bit, the build-in hardware multiplier (18x18) is used By Quan, Buell, Davis, and Devarkal 15 P60 An Example of Hybrid Multiplier • An 192 hybrid multiplier M(192)={ 1, 1, 3} – At the first level, D&Q scheme is adopted which requires three 96-bit multipliers – For each of the 96-bit multipliers (the 2nd level), the D&Q scheme is adopted again – For each of the 48-bit multipliers (the 3nd level), the BC scheme with three 16-bit multipliers is used – The hardware multipliers (18 bit) built in Virtex II 6000 are used for the 16-bit multiplications By Quan, Buell, Davis, and Devarkal 16 P60 The First Level of M(192)={1,1,3} D&Q is used which requires three 96-bit multipliers The Second Level of M(192)={1,1,3} D&Q is used again which requires three 48-bit multipliers The Third Level of M(192)={1,1,3} BC sheme with three 16-bit multipliers is used One “loop” for 48-bit BC Analytical Cost Estimation for the Hybrid Multiplier • Area Estimation N 3S (i 1) ( 2 ) SOD&Q ( N ) if mi 1 Si ( N ) kS(i 1) (k N ) SOBC ( N ) if m i k 1 k – Si(N): area cost for N-bit multiplier – SOD&Q(N): area cost for the overhead in D&Q implementation of N-bit multiplier (for control and other units such as adders) – SOBC(N): area cost for the overhead in BC implementation of N-bit multiplier By Quan, Buell, Davis, and Devarkal 21 P60 It is reasonable to assume that SOD&Q(N) and SOBC(N) are linear to N. Therefore, SOD&Q(N) =a x N + const1 SOBC(N) = b x N + const2 Empirically, we have a = 15, b = 11, and const1 = const2 = 0. By Quan, Buell, Davis, and Devarkal 22 P60 Analytical Cost Estimation for the Hybrid Multiplier • Timing estimation N T(i 1) ( ) 4Tadd ( N 2) TOD&Q ( N ) if mi 1 2 Ti ( N ) k T( i 1) ( N ) 2Tadd ( N N ) TC BC (k ) TBC (k ) if m i k 1 k K – Ti(N): timing cost for N-bit multiplier – Tadd(N): timing cost for N-bit addition – TOD&Q(N): timing cost for the control in N-bit D&Q multiplier – TCBC(k): timing cost for the control with k base units in BC implementation – TBC(k): timing cost for “loop” overhead with k base units in BC implementation By Quan, Buell, Davis, and Devarkal 23 P60 • With the given Viva design library, we have, for N < 192, k > 1, TOD&Q (N) = 3 TCBC (k) = 2 TBC(k) = k By Quan, Buell, Davis, and Devarkal 24 P60 Comparison of Analytical and Actual Results Area Timing Hybrid Multiplier (Slices number) (No. of Clock Cycles) (m1 m2 …) Estimated Actual Estimated Actual {1 6 } 6564 6587 43 43 {6 1} 5508 5714 78 78 {1 1 3} 12630 12140 32 32 {1 3 1} 11046 11106 46 46 {3 1 1} 9990 10257 60 60 By Quan, Buell, Davis, and Devarkal 25 P60 Summary • Rapid estimation of the design cost for the hybrid multiplier architecture • With given Viva library, we are able to estimate the cycle number of a hybrid multiplier accurately • The relative error for the area estimation is within 5% • Future – Estimation of communication cost – Investigation of efficient hierarchical allocation/partition/mapping/scheduling techniques By Quan, Buell, Davis, and Devarkal 26 P60