Simulation and Evaluation of the shared session Key construction by hrv27156


									    AES implementation on 8-bit microcontroller
                 Sungha Kim                                      Ingrid Verbauwhede

                       Department of Electrical Engineering
                       University of California, Los Angeles
                             Los Angeles, CA-90024

Abstract                                               and logic operation, therefore additional
                                                       register assignment schedule is needed.
The security of sensor network is even more
important nowadays. However their physical             1.1       Notation
and computational limitation makes achieving           Following is the convention used to describe
required security level challenging. In this           the operations in this paper.
project, I will propose highly optimized
Rijndael implementation using 16 registers for                  Nb: input block length divided by 32
storing each state on Atmel’s AVR™                              Nk : key length divided by 32
microcontroller. This implementation is 40%                     Nr : number of rounds
faster and smaller than other implementation                    State : the intermediate cipher result
on AES proposal [1].                                            Sub state: 8 bit, divided unit of state, if
                                                                 block length is 128-bit, it has 16 sub
1      Introduction                                              state of 8-bit
         8-bit microcontroller can be used in a                 GF: finite field, Galois field
wide range of applications, such as wireless                    Indirect address register: to store at
sensor network for environment monitoring and                    memory and load from memory, the
battle field ad-hoc network. Also the smart card                 destination address should be saved at
is equipped with 8-bit microcontroller.                          these registers.
Independent of their limited ability in                          X=R27|R26, Y=R29|R28, Z=R31|R30
computation and power, the security is the one
of the most important issue for these
applications. Until recently, much effort to           2         Related Work
provide reliable security was done with the                     This section will briefly introduce other
assumption having enough computation and               implementation of Rijndael. Rijndael can also
power supports. In case of applying these              be implemented very efficiently on a wide
approaches to 8-bit microcontroller, new issues        range of processors and in hardware. Rafael R.
occur because of their limitation. 8-bit               Sevilla implemented by 80186 assembly and
assembly environment is critically limited in          Geoffrey       keating’s      Motorola       6805
the perspective of their computational                 implementation is also available on Rijndael
functionality and data managing scheme. The            site [3].
maximum size of transferring data is under the
size of 28 and fundamentally deficient
functionality of assembly language makes
every transformation more complicated.
                                                       3         What is Rijndael?
Moreover small numbers of 32 registers are                 In early August 1999, NIST selected five
barely suffice the requirement for arithmetic          algorithms – Mars, Rc6, Rijndael, Serpent and

Twofish as candidates of AES(Advanced                         The ByteSub transformation is a non-
Encryption Standard). Finally, the Rijndael           linear byte substitution which takes 8-bit sub
block cipher algorithm was announced as AES           state as its input and produce same size next
by FIPS-197 [2] in 2001. The cipher has a             sub state. The output is defined at S-box which
variable block length and key length. Currently       takes 16 by 16 byte of memory.
specified keys and blocks have the length of
128, 192, or 256 bits respectively.(all nine          3.2     ShiftRow operation
combinations of key length and block length                  The ShiftRow transformation is
are possible). Both block length and key length       individual to every last 3 rows of the state.
can be extended by multiples of 32 bits.              Each of the three rows shifts by all different bit
Moreover the operation is based on 8 bits size        which decided by block length.
of sub state, which gives 8-bit processor the
highest advantage to implement it.
 Rigndael(State, CipherKey)
 KeyExpansion (CipherKey, ExpandedKey)
 AddRoundKey (State, ExpandedKey)                              Fig4. ShiftRow operation
         Round (State,ExpandedKey( i ))
 FinalRound (State, ExpandedKey(Nr))                  3.3     MixColumn operation
 }                                                    The MixColumn transformation operates
           Fig1. Rijndael algorithm                   independently on every column of the state and
Fig1. shows the whole procedure of Rijndael           treats each sub state of the column as term of
algorithm. Like DES or other block cipher             a(x) in the equation b(x)=c(x)⊗a(x), where
algorithm, Rijndael is also composed by a             c(x)= ‘03’X3+’01’X2+’01’X+’02’.For example,
certain number of rounds which is decided by          in the fig5. a(x) is a0,jX3+ai,jX2+a2,jX+a3,j and it
input block(Nb) and key length(Nk).                   is used as multiplicand of operation.
 Round(State, RoundKey)
                                                               Fig5. MixColumn operation
             Fig2. Round of Rijndael
Four different transformations constitute each
round. The final round is same as normal round
                                                      3.4     AddRoundKey operation
without MixColumn.                                    The AddRoundKey operation is simply a
                                                      bitwise EXOR of roundkey and state.
3.1    ByteSub operation

            Fig3. SubByte operation                            Fig6. AddRoundKey operation

4     Advantages of Rijndael for                         5.1    ByteSub operation
                                                             Byte substitution can be done in the two
8-bit implementation                                     different ways. First, taking the multiplicative
    Because 8-bit microcontroller can not
                                                         inverse in GF(28) and ‘00’ is mapped onto itself.
provide any high level compiler except
                                                         Then, applying an affine (over GF(2))
assembler, the implementation environment
                                                         transformation defined by 8 by 8 matrix,
should depend on the attribute of assembly
                                                         produces the output. This approach needs
language. However every transformation is
                                                         sacrificing of speed because of several numbers
ultimately divided by minimal 8-bit sub state.
                                                         of extra operations. Whereas the other way
Therefore if only load and store the 8-bit sub
                                                         using S-box gains speed but loses the code size
state at right time, right position, the 8-bit
                                                         for storing S-box on the memory. In this
operation size can be optimal
                                                         implementation S-box approach was chosen.
4.1    SubByte operation                                 5.2    MixColumn operation
   Sub Byte is byte substitution done on each
                                                             MixColumn operation needs a certain
sub state independently. Therefore the
                                                         number of multiplication operations on GF(28).
operation target is 8-bit.
                                                         Every finite field multiplication can be done
                                                         using tables, Logs and Antilogs tables. Like
4.2    MixColumn operation                               ByteSub operation, using table need more
    The       matrix    of    Fig7       describes       memory space while increasing speed.
theMixColumn operation. Following the matrix             Especially,    in     MixColumn        operation,
below, the operation is also byte independent.           multiplicand is confined as ‘01’, ‘02’, and ‘03’,
        b 0  02 03 01 01 a0                        which means ‘01’ and ‘02’ multiplication can
         b   01 02 03 01  a                        provide the clue for ‘03’ multiplication.
         1                     1                  Therefore to exploit this special feature, direct
        b2   01 01 02 03 a2 
                                                   multiplication approach was chosen instead of
         b3  03 01 01 02  a3                       using tables.
     Fig7. Matrix for MixColumn operation
                                                         5.3    Storing state
                                                             Repeated round comprises 4 different
4.3    AddRoundKey operation                             transformation. Each transformation need state
    Round key addition is bitwise EXOR                   as its input and produce output as new state.
operation applied between round key and state.           Such data transaction between each state and
Each 8-bit sub state of state has one to one             register executed very frequently. Also for
mapping with sub key of round key. Therefore             arithmetic and logic operation the operand,
this operation is also byte dependent.                   each sub state should stay at register. Therefore
                                                         the storing state issue is very critical. If the
                                                         state stays at memory it is easier to fetch each
5 Implementation                                         8-bit sub state from memory to register with
    For implementation on Atmel’s AVR                    indirect address register X,Y,Z. However this
microcontroller, AVR Studio v.3.55 which is              approach need more cycle consumption
provided by Atmel was chosen for convenience.            because from/to memory to/from register
It     provides     integrated   development             transaction need 2 cycles each, which is twice
environment composed of assembler and                    as much as from/to register to/from register
simulator also. Every cycle number and code              scheme. If all sub state can be stored at
size output depends on AVR Studio.                       registers, the memory will be referenced only
    Every design decision is tradeoff between            for round key and S-box, which is impossible
speed and code size.                                     to be stored at registers.

 1) st Y+, register         (2Cycles)                    modules have all different execution times as in
    ld register, Y+         (2Cycles)                    Cycle column of the Table2
                                                                Module            Cycle        Code
 2) mov register1, register2 (1Cycle1)                                                        (Byte)
                                                           Precomputation           1            1
 Fig8. storing state issue                                      ByteSub            10            2
     1) memory-register scheme                                 ShiftRow            10            2
     2) register-register scheme                             MixColumn              9            1
                                                            AddRoundKey            11            3
The register-register scheme requires whole
state should stay at register, which means only                 branch              1            1
16 registers out of 32 registers are available for        Table2. Weight factor of each module
operation. In this implementation these 16
                                                         .Whereas, code length weight factor is
registers are barely meet the required registers
                                                         irrelevant with cycle number weight factor
at Mix Column operation. Even high part of
                                                         because some modules are reused every time
indirect address registers, R27, R29, R31 are
                                                         while the other modules are not. For example,
used as general purpose registers while still
                                                         AddRoudKey module executed at round0,
used for storing address. From register0 to
                                                         round1-9 and round10 but MixColumn
register 15, the state is stored.
                                                         executed only at round1-9, which makes the
                                                         different code weight factor 3 and 1
                                                         respectively. With weight factor total number
6 Simulation result                                      of cycle and code are computed by equation
    Code was run on the AVR Studio v.3.55                below.
with the test vector from Brian Gladman’s
technical paper[5]. The implementation was               Total cycle number =
optimized many times. From many versions of              ∑Cycle number( i ) * weight factor( i )
implementation, two simulation results is
proposed depending on the storing state issue.           Total code size =
The one stores the state at memory and the               ∑Code number( i ) * weight factor( i )
other at registers.
        Module          Cycle       Code                        ,where i is every module
   Precomputation       2648        1848                 The round0-10 number is pure execution
       ByteSub           161         32                  number, which excluding precomputation
       ShiftRow          64          256                 feature. Because precompuatation is composed
     MixColumn           288         134                 of S-box input to memory, key input to
    AddRoundKey          163         48                  memory, key expansion and data block input,
         branch          23          12                  these phases happens just once in the whole life
          Total         9306        2714                 time of sensors. Once done precomputation can
     (round0-10)        6658        866                  be reused afterward if there is no key update
 Table1. Simulation result of memory version             and S-box update. For data block input, it can
                                                         be assumed as initially staying at each register,
The total number includes consideration of the           from R0 to R15, beforehand. This procedure
number of rounds. In this implementation, the            can be done by radio simply interconnect radio
block and key length are 128-bit each.                   with microcontroller.
Therefore 10 rounds constitute one encryption            The Table3 shows much improved result after
procedure. In the 10 rounds of encryption                storing state at registers. The improvement is
procedure, precomputation is executed just               43% and 42% in the cycle number and code
once which gives weight factor 1, and other              size respectively.

                           Module                   Cycle               Code
                                                                       (Byte)                                                 size improvement

  Precomputation        2648                                            1848                              300
      ByteSub             49                                             66

                                                                                        Code Size(Byte)
      ShiftRow            16                                             32                               200
    MixColumn            231                                             104                              150
  AddRoundKey             48                                             64                               100
       branch             45                                             12                                50
        Total           6463                                            2352                                              ShiftRow                 AddKey
                                                                                                                ByteSum                MixColumn             Branch
    (round0-10)         3815                                             504
 Table3. Simulation result of Register version
                                                                                                                                     memory   register

                                                                                                          Table5. Size improvement by register
7 Evaluation                                                                          Comparing with other implementations the
                                        speed improvement                             output is obviously better. The Table6 and 7 are
                                                                                      from Rijndael proposal [1] and they show
                     350                                                              execution time and code size of other
  Number of cycles

                                                                                      implementation depending on key and block
                     200                                                              length.
                     150                                                              Key/Bolck length                        Number of cycles           Code length
                     100                                                              (128, 128) a)                                   4065 cycles            768 byte
                       0                                                              (128, 128) b)                                   3744 cycles            826 byte
                                     ShiftRow                 AddKey                  (128, 128) c)                                   3168 cycles           1016 byte
                           ByteSub                MixColumn              Branch
                                                   Modules                            (192, 128)                                      4512 cycles           1125 byte
                                                                                      (256, 128)                                      5221 cycles           1041 byte
                                                memory   register
                                                                                                          Table6. Execution time and code size
                Table4. Speed improvement by register
                                                                                                          Rijndael in Intel 8051 assembler
     The fully registered implementation
                                                                                      In case of Intel 8051 assembler, as code size
produced remarkably improved output as at
table4 and table5. Trivially increased time                                           increase, the speed decreases, this improvement
consumption at branch module stems from                                               in speed while sacrificing size can be also done
                                                                                      at AVR microcontroller by executing
complicated register assignment scheduling.
This sacrifice makes utilization percent of                                           multiplication by tables.
                                                                                      Key/Bolck length                        Number of cycles           Code length
register nearly full, which means most of the
                                                                                      (128, 128) a)                                   8390 cycles            919 byte
data transactions happen between register and
register. The MixColumn module is still takes                                         (192, 128)                                     10780 cycles           1170 byte
much part of cycle number. This module is also                                        (256, 128)                                     12490 cycles           1135 byte
the critical part for register scheduling because                                                          Table7. Execution time and code size
it need four multipication with four different                                                             Rijndael in Motorola 68HC08 assembler
sub state at the same time while keeping their
initial state. Therefore indirect address registers
was used temporarily for normal operations.
For other modules the utilization of register is
slightly over 50%, which gives more possibility
                                                                                      8 Summary and future work
to improve the efficiency. Table5 shows size                                              Rijndael implementation using 16 registers
improvement after storing state in the register.                                      for storing state improves the efficiency over
                                                                                      40% in both speed and code size. The other 16
                                                                                      registers are slightly meets the need of logical

and arithmetic operations. Therefore if input
block size is over 128-bit, it needs different
register assignment scheme. Until now 128-bit
size of input block is optimal for Rijndael
implementation on 8-bit microcontroller.

[1]     AES proposal: Rijndael, Joan Daemen, Vincent
[2]     FIPS-197 [2]
[3]     Rijndael home site
[4]     Adrian Perrig ,Robert Szewczyk, Victor Wen,D
avid Culler,and J.D.Tygar SPINS:Security Protocols for
Sensor Networks, MobiCom, July 2001.
[5]     A Specification for Rijndael, the AES
Algorithm v3.3, Brian Gladman, May 2002

[6]     A Communications Security Architecture and
Cryptographic Mechanisms for Distributed Sensor
DARPA SensIT Workshop, Oct 8, 1999



To top