p60 quan s by A6LxG4PK

VIEWS: 4 PAGES: 26

									Design Exploration of 192-bit
   Elliptic Curve Adder on
 StarBridge HC-36 System
          Gang Quan, Duncan A. Buell, James P. Davis,
                 Siddaveerasharan Devarkal




 Department of Computer Science & Engineering, University of South Carolina
              Elliptic Curve Cryptography

   • Emerging as new generation of cryptosystems based on
     public key cryptography
   • No sub-exponential algorithm to solve the discrete
     logarithm problem
   • Smallest key size & highest strength per bit compared to
     other public key cryptosystems
   • Smaller key sizes suitable for hardware implementation




By Quan, Buell, Davis, and Devarkal   2                         P60
                              NIST standards
   • NIST has proposed a specific set of elliptic curves for
     cryptography purposes
   • Elliptic curves are defined for prime fields GF(p) and binary
     polynomial fields GF(2m)
   • Prime fields for 192, 224, 256, 384 and 521 bits
   • Binary fields for 163, 233, 283, 409 and 571 bits
   • Multi-precise arithmetic of such long bit-widths




By Quan, Buell, Davis, and Devarkal   3                        P60
                    Elliptic Curve Arithmetic
   •   For 192-bit operand, naïve M * P operation involves 191
       elliptic curve doublings and 96 elliptic curve additions (ECC
       Adder)
   •   ECC Addition
         –   Given P1=(x1, y1, z1), P2=(x2,y2,z2), compute P3=(x3,y3,z3) such that




         –   14 high bit-width modular multiplications
               •   42 high bit-width multi precision multiplications if using Montgomery
                   multiplication method




By Quan, Buell, Davis, and Devarkal                      4                                 P60
ECC Adder Data Flow Graph
StarBridge High Performance Computing Platform
                StarBridge HC-36 System
   • 4 Processing Elements
         – Virtex II 6000 (Processing elements)
   • 66Mhz PCI Bus
   • PE-PE communication rates
         – 50 bits/cycle
   • Development Environment
         – Viva




By Quan, Buell, Davis, and Devarkal   7           P60
                                      Challenges

            Search for the optimal or near optimal
         design solution such that it can optimize the
         ECC Adder performance under the resource
         constraints (slices, number of built-in hardware
         multipliers, communication rate, etc ) of the
         target architecture (SBS HC36).


         The size of the design space can easily exceed 2120 even
         with a conservative estimation.

By Quan, Buell, Davis, and Devarkal      8                          P60
Hierarchical

  Design

Methodology
  Rapid and accurate performance/cost
  evaluation is the key for effective and
  efficient design space exploration, and
  the performance/cost of the multipliers
  are critical for performance/cost of the
  ECC Adder.


By Quan, Buell, Davis, and Devarkal   10   P60
           Evaluation of Timing and Resource
                 Usage of a Multiplier
   • Different Multiplier implementation
         – Shift-and-Add, Divide-and-Conquer(D&Q),
           “Broadcast” (BC), etc
   • Performance/cost trade off
         – Hybrid multiplier
              • A multiplier combining different implementation
                strategies




By Quan, Buell, Davis, and Devarkal   11                          P60
               Divide & Conquer Multiplier
   • Karatsuba-Ofman Algorithm
     (1962)
           Let A  Ah  2 n / 2  Al , B  Bh  2 n / 2  Bl , then

           A  B  t0  2 n  (t 2  t1  t0 )  2 n / 2  t1 , where

           t0  Ah  Bh , t1  Al Bl , and t0  ( Ah  Al )  ( Bh  Bl )



By Quan, Buell, Davis, and Devarkal        12                           P60
                      “Broadcast” Multiplier
• Algorithm
                                                                   N 
       Let A  a( k 1) ...a1a0 , B  b( k 1) ...b1b0 , ai  bi   . Then
                                                                   k 
                                                  N 
       A  B   (ai (b( k 1) ...b1b0 )  (i    )).
               i                                  k 
• Features
     – Shuffling the partial product for fully pipelined implementation
     – Given k functional units, each “loop body” can be computed in
       parallel
     – Easy tradeoff of resource usage/speed by selecting k
           • k=N: Shift-and-add (low degree of parallelism, low speed, low resource
             usage)
           • k = 1, 2, 3, … (small integer) : Conventional “block” multiplications (high
             degree of parallelism, high speed, high resource usage)
     – Good scalability
By Quan, Buell, Davis, and Devarkal          13                                    P60
              Example: 192-bit “Broadcast” Multiplier
                                                            b5                      b4               b3               b2              b1        b0

                         a0                            32
                         a1   .      MUX
                              .
                                             32
                         a5   .
                                                  X                        X                   X             X             X               X
                                  Operand
                                   Select    64




                                  Shift_32
                                                  Rf                  Rd                  Rb                     Re              Rc        Ra




                                                                                               +             +              +


                                                                                              MUX          MUX             MUX

                                                                               Ripple Carry
                                                                                               Ri           Rh             Rg



                                       +     Ripple Carry                                                  Shift_32
                                                                 Rl               Rk                Rj




                                                                                                     Final Product




By Quan, Buell, Davis, and Devarkal                                                            14                                                    P60
                      The Hybrid Multiplier
• A hybrid multiplier is denoted by a integer
  string, M(N) = {m1, m2, …,mn }
      – mi: the multiplier scheme at ith level
      – mi = 1, using D&Q scheme
      – mi = k (k>1), using BC scheme with k sub
        multipliers
      – for multiplication with bit width less than 18 bit,
        the build-in hardware multiplier (18x18) is used



By Quan, Buell, Davis, and Devarkal   15                 P60
                      An Example of Hybrid
                           Multiplier
   • An 192 hybrid multiplier M(192)={ 1, 1, 3}
         – At the first level, D&Q scheme is adopted
           which requires three 96-bit multipliers
         – For each of the 96-bit multipliers (the 2nd
           level), the D&Q scheme is adopted again
         – For each of the 48-bit multipliers (the 3nd
           level), the BC scheme with three 16-bit
           multipliers is used
         – The hardware multipliers (18 bit) built in
           Virtex II 6000 are used for the 16-bit
           multiplications


By Quan, Buell, Davis, and Devarkal   16                 P60
    The First Level of M(192)={1,1,3}




D&Q is used which requires three 96-bit multipliers
    The Second Level of M(192)={1,1,3}




D&Q is used again which requires three 48-bit multipliers
   The Third Level of M(192)={1,1,3}




BC sheme with three 16-bit multipliers is used
One “loop” for 48-bit BC
          Analytical Cost Estimation for the
                  Hybrid Multiplier
   • Area Estimation
                              N 
                  3S (i 1) (  2  )  SOD&Q ( N ) if mi  1
                 
      Si ( N )                
                 kS(i 1) (k  N  )  SOBC ( N ) if m i  k  1
                              k 
                                
                 
         – Si(N): area cost for N-bit multiplier
         – SOD&Q(N): area cost for the overhead in D&Q
           implementation of N-bit multiplier (for control
           and other units such as adders)
         – SOBC(N): area cost for the overhead in BC
           implementation of N-bit multiplier
By Quan, Buell, Davis, and Devarkal         21                      P60
    It is reasonable to assume that SOD&Q(N) and SOBC(N)
    are linear to N. Therefore,

    SOD&Q(N) =a x N + const1
    SOBC(N) = b x N + const2

    Empirically, we have a = 15, b = 11, and const1 =
    const2 = 0.




By Quan, Buell, Davis, and Devarkal   22                P60
               Analytical Cost Estimation for the
                       Hybrid Multiplier
   • Timing estimation
                                             N 
                                   T(i 1) (   )  4Tadd ( N  2)  TOD&Q ( N ) if mi  1
                                             2
          Ti ( N )  
                     k T( i 1) (  N  )  2Tadd ( N   N  )  TC BC (k )  TBC (k ) if m i  k  1
                      
                      
                                    k 
                                                         K 
                                                            
                                                                              
                                                                              
         – Ti(N): timing cost for N-bit multiplier
         – Tadd(N): timing cost for N-bit addition
         – TOD&Q(N): timing cost for the control in N-bit D&Q multiplier
         – TCBC(k): timing cost for the control with k base units in BC
           implementation
         – TBC(k): timing cost for “loop” overhead with k base units in BC
           implementation

By Quan, Buell, Davis, and Devarkal                   23                                            P60
   • With the given Viva design library, we have, for
     N < 192, k > 1,

             TOD&Q (N) = 3
             TCBC (k) = 2
             TBC(k) = k




By Quan, Buell, Davis, and Devarkal   24                P60
                 Comparison of Analytical and
                       Actual Results

                                            Area                   Timing
       Hybrid Multiplier              (Slices number)       (No. of Clock Cycles)
          (m1 m2 …)
                                  Estimated        Actual   Estimated    Actual
               {1 6 }                 6564         6587         43         43
                {6 1}                 5508         5714         78         78
              {1 1 3}                 12630        12140        32         32
              {1 3 1}                 11046        11106        46         46
              {3 1 1}                 9990         10257        60         60




By Quan, Buell, Davis, and Devarkal           25                             P60
                                      Summary
• Rapid estimation of the design cost for the hybrid
  multiplier architecture
• With given Viva library, we are able to estimate the
  cycle number of a hybrid multiplier accurately
• The relative error for the area estimation is within 5%
• Future
      – Estimation of communication cost
      – Investigation of efficient hierarchical
        allocation/partition/mapping/scheduling techniques




By Quan, Buell, Davis, and Devarkal     26                   P60

								
To top