Docstoc

Chapter 12 - Software Optimisation

Document Sample
Chapter 12 - Software Optimisation Powered By Docstoc
					     Chapter 12
Software Optimisation
                      Software Optimisation Chapter
           This chapter consists of three parts:

                      Part 1: Optimisation Methods.
                      Part 2: Software Pipelining.
                      Part 3: Multi-cycle Loop Pipelining.




Chapter 12, Slide 2                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
          Chapter 12
   Software Optimisation
Part 1 - Optimisation Methods
                                  Objectives
                     Introduction to optimisation and
                      optimisation procedure.
                     Optimisation of C code using the code
                      generation tools.
                     Optimisation of assembly code.




Chapter 12, Slide 4                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                    Introduction
                     Software optimisation is the process of
                      manipulating software code to achieve
                      two main goals:
                         Faster execution time.
                         Small code size.
           Note: It will be shown that in general there
                 is a trade off between faster
                 execution type and smaller code size.




Chapter 12, Slide 5                            Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                   Introduction
                     To implement efficient software, the
                      programmer must be familiar with:
                         Processor architecture.
                         Programming language (C, assembly or
                          linear assembly).
                         The code generation tools (compiler,
                          assembler and linker).




Chapter 12, Slide 6                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                      Code Optimisation Procedure
                                               Optimise Algorithm


                                                 Program in 'C'
                                               and compile without
                                                any optimisation


                          Make the
                                           N         Code
                          necessary
                                                  Functioning?
                         correction(s)

                                                           Y

                                                  Profile Code



                           No further
                                                     Result
                         optimisation is                                 Identify Code
                                           Y      Satisfactory?
                            required                                 Functions to be further
                                                                        optimised from
                                                          N             Profiling Result
                                                  Use intrinsics

                                                                     Convert code needing
                                                  Profile Code       optimisation to linear
                                                                          assembly

                           No further
                                           Y         Result
                         optimisation is
                                                  Satisfactory?                                     Make the
                            required                                        Code               N
                                                                                                    necessary
                                                                         Functioning?
                                                                                                   correction(s)
                                                          N

                                                 Set n=0 (-On)                    Y


                                                                                                     No further
                                               Compile code with            Result             Y
                                                                                                   optimisation is
                                                  -On option             Satisfactory?
                                                                                                      required

                                                                                  N
                          Make the
                                           N         Code             Write code in hand
                          necessary
                                                  Functioning?            assembly
                         correction(s)

                                                           Y

                                                  Profile Code



                           No further
                                           Y         Result
                         optimisation is
                                                  Satisfactory?
                            required

                                                          N

                         Pass to next
                           step of         y
                                                      N<3?
                         optimisaion
                          (N=N+1)

                                                          N




Chapter 12, Slide 7                                                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                      Code Optimisation Procedure




                                   .c            .if                    .opt               .asm
                        C source                                                 Code
                                        Parser             Optimiser
                           file                                                generator


                                                       Optimising Compiler




Chapter 12, Slide 8                                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                      Optimising C Compiler Options
                     The „C6x optimising C compiler uses the
                      ANSI C source code and can perform
                      optimisation currently up-to about 80%
                      compared with a hand-scheduled
                      assembly.
                     However, to achieve this level of
                      optimisation, knowledge of different levels
                      of optimisation is essential. Optimisation is
                      performed at different stages and levels.




Chapter 12, Slide 9                         Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Assembly Optimisation
                      To develop an appreciation of how to
                       optimise code, let us optimise an FIR
                       filter:
                                         N 1
                                  yn   hk  xn  k 
                                         k 0



                      For simplicity we write:
                                            N 1
                                    yn   hi  xi                                         [1]
                                                i 0




Chapter 12, Slide 10                                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Assembly Optimisation
                      To implement Equation 1, we need to
                       perform the following steps:
                       (1) Load the sample x[i].
                       (2) Load the coefficients h[i].
                       (3) Multiply x[i] and h[i].
                       (4) Add (x[i] * h[i]) to the content of an
                           accumulator.
                       (5) Repeat steps 1 to 4 N-1 times.
                       (6) Store the value in the accumulator to y.



Chapter 12, Slide 11                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                               Assembly Optimisation
                      Steps 1 to 6 can be translated into the
                       following „C6x assembly code:
                       MVK   .S1   0,B0       ;   Initialise the loop counter
                       MVK   .S1   0,A5       ;   Initialise the accumulator
             loop      LDH   .D1   *A8++,A2   ;   Load the samples x[i]
                       LDH   .D1   *A9++,A3   ;   Load the coefficients h[i]
                       NOP         4          ;   Add “nop 4” because the LDH has a latency of 5.
                       MPY   .M1   A2,A3,A4   ;   Multiply x[i] and h[i]
                       NOP                    ;   Multiply has a latency of 2 cycles
                       ADD   .L1   A4,A5,A5   ;   Add “x [i]. h[i]” to the accumulator
             [B0]      SUB   .L2   B0,1,B0    ;    
             [B0]      B     .S1   loop       ;     loop overhead
                       NOP         5          ;     The branch has a latency of 6 cycles




Chapter 12, Slide 12                                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Assembly Optimisation
                      In order to optimise the code, we need
                       to:
                       (1) Use instructions in parallel.
                       (2) Remove the NOPs.
                       (3) Remove the loop overhead (remove SUB
                           and B: loop unrolling).
                       (4) Use word access or double-word access
                           instead of byte or half-word access.




Chapter 12, Slide 13                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                Step 1 - Using Parallel Instructions
 Cycle
                   .D1   .D2   .M1   .M2   .L1        .L2           .S1           .S2           NOP
    1              ldh
    2                    ldh
    3                                                                                           nop
    4                                                                                           nop
    5                                                                                           nop
    6                                                                                           nop
    7                          mpy
    8                                                                                           nop
    9                                      add
    10                                                sub
    11                                                                b
    12                                                                                          nop
    13                                                                                          nop
    14                                                                                          nop
    15                                                                                          nop
    16                                                                                          nop
Chapter 12, Slide 14                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                Step 1 - Using Parallel Instructions
 Cycle
                   .D1   .D2   .M1   .M2   .L1        .L2           .S1           .S2           NOP
    1     ldh     ldh
    2                                                                                           nop
    3                                                                                           nop
    4                                                                                           nop
    5                                                                                           nop
    6                    mpy
    7                                                                                           nop
    8                                   add
    9                                          sub
    10                                                  b
    11                                                                                          nop
    12                                                                                          nop
    13                                                                                          nop
     Note: Not all instructions can be put in parallel since the
    14                                                                                          nop
    15     result of one unit is used as an input to the following                              nop
    16
           unit.
Chapter 12, Slide 15                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Step 2 - Removing the NOPs
 Cycle
                   .D1    .D2   .M1     .M2      .L1        .L2           .S1           .S2           NOP
    1              ldh    ldh
    2                                                       sub
    3                                                                        b
    4                                                                                                 nop
    5                                                                                                 nop
    6                           mpy
    7                                                                                                 nop
    8                                            add
    9
                                      loop LDH    .D1 *A8++,A2
    10
                                           LDH    .D1 *A9++,A3
    11                                [B0] SUB    .L2 B0,1,B0
    12                                [B0] B      .S1 loop
    13                                     NOP        2
    14                                     MPY    .M1 A2,B3,A4
    15                                     NOP
    16                                     ADD    .L1 A4,A5,A5
Chapter 12, Slide 16                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                Step 3 - Loop Unrolling
                      The SUB and B instructions consume at
                       least two extra cycles per iteration (this
                       is known as branch overhead).
                                                 LDH   .D1     *A8++,A2       ;Start of iteration 1
                                            ||   LDH   .D2     *B9++,B3
                                                 NOP           4
                                                 MPY   .M1X    A2,B3,A4       ;Use      of    cross      path
                                                 NOP
                                                 ADD   .L1     A4,A5,A5

     loop        LDH      .D1    *A8++,A2        LDH   .D1     *A8++,A2       ;Start of iteration 2
                 LDH      .D1    *A9++,A3   ||   LDH   .D2     *B9++,B3
     [B0]        SUB      .L2    B0,1,B0         NOP           4
     [B0]        B        .S1    loop            MPY   .M1     A2,B3,A4
                 NOP             2               NOP
                 MPY      .M1    A2,A3,A4        ADD   .L1     A4,A5,A5
                 NOP                        ;          :
                 ADD      .L1    A4,A5,A5   ;          :
                                            ;          :
                                                 LDH   .D1     *A8++,A2       ; Start of iteration n
                                            ||   LDH   .D2     *B9++,B3
                                                 NOP            4
                                                 MPY   .M1     A2,B3,A4
                                                 NOP
                                                 ADD   .L1     A4,A5,A5

Chapter 12, Slide 17                               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
           Step 4 - Word or Double Word Access
            The „C6711 has two 64-bit data buses
             for data memory access and therefore
             up to two 64-bit can be loaded into the
             registers at any time (see Chapter 2).
            In addition the „C6711 devices have
             variants of the multiplication
             instruction to support different
             operation (see Chapter 2).
           Note: Store can only be up to 32-bit.



Chapter 12, Slide 18              Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            Step 4 - Word or Double Word Access
                      Using word access, MPY and MPYH the
                       previous code can be written as:
       loop
                        LDW    .D1   *A9++,A3 ; 32-bit word is loaded in a single cycle
       ||               LDW    .D2   *B6++,B1
                        NOP          4
       [B0]             SUB    .L2
       [B0]             B      .S1   loop
                        NOP          2
                        MPY    .M1   A3,B1,A4
       ||               MPYH   .M2   A3,B1,B3
                        NOP
                        ADD    .L1   A4,B3,A5




                      Note: By loading words and using MPY and
                       MPYH instructions the execution time has
                       been halved since in each iteration two 16x16-
                       bit multiplications are performed.
Chapter 12, Slide 19                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Optimisation Summary
                      It has been shown that there are four
                       complementary methods for code
                       optimisation:
                          Using instructions in parallel.
                          Filling the delay slots with useful code.
                          Using word or double word load.
                          Loop unrolling.

              These increase performance and reduce code size.



Chapter 12, Slide 20                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Optimisation Summary
                      It has been shown that there are four
                       complementary methods for code
                       optimisation:
                          Using instructions in parallel.
                          Filling the delay slots with useful code.
                          Using word or double word load.
                          Loop unrolling.

            This increases performance but increases code size.



Chapter 12, Slide 21                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
         Chapter 12
  Software Optimisation
Part 2 - Software Pipelining
                                  Objectives
                      Why using Software Pipelining, SP?
                      Understand software pipelining
                       concepts.
                      Use software pipelining procedure.
                      Code the word-wide software pipelined
                       dot-product routine.
                      Determine if your pipelined code is
                       more efficient with or without prolog
                       and epilog.


Chapter 12, Slide 23                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              Why using Software Pipelining, SP?
               SP creates highly optimized loop-code by:
                      Putting several instructions in parallel.
                      Filling delay slots with useful code.
                      Maximizes functional units.
               SP is implemented by simply using the tools:
                      Compiler options -o2 or -o3.
                      Assembly Optimizer if .sa file.




Chapter 12, Slide 24                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Software Pipeline concept
                 To explain the concept of software pipelining,
                 we will assume that all instructions execute in
                 one cycle.


                                        How many cycles would
                            LDH         it take to perform this
                                        loop 5 times?
                       ||   LDH
                                        (Disregard delay-slots).
                            MPY
                            ADD
                                        ______________ cycles


Chapter 12, Slide 25                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Software Pipeline Example

                                    How many cycles would
                            LDH     it take to perform this
                                    loop 5 times?
                       ||   LDH
                                    (Disregard delay-slots).
                            MPY
                            ADD        5 x 3 = 15
                                    ______________ cycles

                                    Let‟s examine hardware
                                    (functional units) usage ...



Chapter 12, Slide 26                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
       Cycle .D1
                             Non-Pipelined Code
                             .D2
         1   ldh
             .D1             ldh
                             .D2   .M1   .M2     .L1           .L2           .S1           .S2

             2                     mpy

             3                                   add
             4         ldh   ldh

             5                     mpy

             6                                   add

             7         ldh   ldh

             8                     mpy
             9                                   add

Chapter 12, Slide 27                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
       Cycle
                              Pipelining Code
         1   ldh
             .D1             .D2
                             ldh   .M1   .M2     .L1           .L2           .S1           .S2

             2         ldh   ldh   mpy

             3         ldh   ldh   mpy           add
             4         ldh   ldh   mpy           add

             5         ldh   ldh   mpy           add

             6                     mpy           add

             7                                   add


             Pipelining these instructions took 1/2 the cycles!

Chapter 12, Slide 28                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
       Cycle
                              Pipelining Code
         1   ldh
             .D1             .D2
                             ldh   .M1   .M2     .L1           .L2           .S1           .S2

             2         ldh   ldh   mpy

             3         ldh   ldh   mpy           add
             4         ldh   ldh   mpy           add

             5         ldh   ldh   mpy           add

             6                     mpy           add

             7                                   add


             Pipelining these instructions takes only 7 cycles!

Chapter 12, Slide 29                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                Pipelining Code
                       Prolog          1         ldh
                                                 .D1           .D2
                                                               ldh           .M1           .L1
             Staging for loop.         2         ldh           ldh           mpy

                                       3         ldh           ldh           mpy           add
                  Loop Kernel
          Single-cycle “loop”          4         ldh           ldh           mpy           add
         iterated three times.         5         ldh           ldh           mpy           add

                       Epilog          6                                     mpy           add

            Completing final           7                                                   add
              operations.



Chapter 12, Slide 30                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                 Pipelined Code
                       prolog:           LDH        ; load 1
                                    ||   LDH

                                         MPY        ; mpy 1
                                    ||   LDH        ; load 2
                                    ||   LDH

                       loop:             ADD       ; add 1
                                    ||   MPY       ; mpy 2
                                    ||   LDH       ; load 3
                                    ||   LDH

                                         ADD       ; add 2
                                    ||   MPY       ; mpy 3
                                    ||   LDH       ; load 4
                                    ||   LDH
                                         .
                                         .
Chapter 12, Slide 31                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create dependency graph.
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 32                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            Software Pipelining Example (Step 1)
                  short DotP(short *m, short *n, short count)
                  { int i;
                       short product;
                       short sum = 0;
                       for (i=0; i < count; i++)
                       {
                           product = m[i] * n[i];
                           sum += product;
                       }
                       return(sum);
                  }




Chapter 12, Slide 33                           Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create dependency graph.
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 34                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
         Write code in Linear Assembly (Step 2)
                       ; for (i=0; i < count; i++)
                       ; prod = m[i] * n[i];
                       ; sum += prod;

                       loop:            ldh               *p_m++, m
                                        ldh               *p_n++, n
                                        mpy               m, n, prod
                                        add               prod, sum, sum

                         [count]        sub               count, 1, count
                         [count]        b                 loop

                        1. No NOP‟s required.
                        2. No parallel instructions required.
                        3. You don‟t have to specify:
                                  Functional units, or
                                  Registers.
Chapter 12, Slide 35                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create a dependency graph (4 steps).
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 36                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Dependency Graph Terminology

                       LDH                                          LDH

                        a      .D                                      b           .D

     Parent Node
                                    5              5
                             Path        NOT           Conditional Path

                                         na      .L
                            Child Node


Chapter 12, Slide 37                           Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph Steps
           (a) Draw the algorithm nodes and paths.
           (b) Write the number of cycles it takes for
               each instruction to complete execution.
           (c) Assign “required” function units to each
               node.
           (d) Partition the nodes to A and B sides and
               assign sides to all functional units.




Chapter 12, Slide 38               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dependency Graph (Step a)
                      In this step each instruction is represented
                       by a node.
                      The node is represented by a circle,
                       where:
                          Outside: write instruction.
                          Inside: register where result is written.
            Nodes are then connected by paths
             showing the data flow.
           Note: Conditional paths are represented by
                 dashed lines.

Chapter 12, Slide 39                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH

                       m




Chapter 12, Slide 40                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH                 LDH

                       m                      n




Chapter 12, Slide 41                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH                 LDH

                       m                      n

                              MPY

                              prod




Chapter 12, Slide 42                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH                      LDH

                       m                           n

                              MPY

                              prod


                                    ADD

                              sum


Chapter 12, Slide 43                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH                      LDH

                       m                           n

                              MPY

                              prod


                                    ADD

                              sum


Chapter 12, Slide 44                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step a)
                       LDH                      LDH

                       m                           n

                              MPY                                              SUB

                              prod                                  count


                                    ADD                                          B

                              sum                                     loop


Chapter 12, Slide 45                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Dependency Graph (Step b)
                      In this step the number of cycles it takes
                       for each instruction to complete execution
                       is added to the dependency graph.
                      It is written along the associated data
                       path.




Chapter 12, Slide 46                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step b)
                       LDH                         LDH

                       m                              n

                             5   MPY
                                              5                                   SUB

                                 prod                 1                count

                                 2                                                1
                                       ADD                                          B

                           1     sum                                     loop

                                                                                   6
Chapter 12, Slide 47                         Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dependency Graph (Step c)
                      In this step functional units are assigned to
                       each node.
                      It is advantageous to start allocating units
                       to instructions which require a specific
                       unit:
                          Load/Store.
                          Branch.
            We do not need to be concerned with
             multiply as this is the only operation that
             the .M unit performs.
           Note: The side is not allocated at this stage.
Chapter 12, Slide 48                         Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step c)
                       LDH                         LDH

            .D         m                              n           .D


                             5   MPY
                                              5                                   SUB

                                 prod .M              1                count

                                 2                                                1
                                       ADD                                          B

                           1     sum                                     loop .S

                                                                                   6
Chapter 12, Slide 49                         Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Dependency Graph (Step d)
                      The data path is partitioned into side A
                       and B at this stage.
                      To optimise code we need to ensure that a
                       maximum number of units are used with a
                       minimum number of cross paths.
                      To make the partition visible on the
                       dependency graph a line is used.
                      The side can then be added to the
                       functional units associated with each
                       instruction or node.


Chapter 12, Slide 50                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step d)
                       LDH
                                       A       B           LDH
                                      Side    Side
            .D         m                                      n           .D


                             5          MPY
                                                      5                                   SUB


                                 .M    prod                   1                count

                                        2                                                 1
                                              ADD                                           B

                           1            sum                                      loop .S

                                                                                           6
Chapter 12, Slide 51                                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Dependency Graph (Step d)
                       LDH
                                    A       B           LDH
                                   Side    Side
            .D1        m                                   n           .D2


                             5       MPY
                                                   5                                   SUB


                             .M1x prod                     1                count .L2

                                     2                                                 1
                                           ADD                                           B

                           1 .L1     sum                                      loop .S2

                                                                                        6
Chapter 12, Slide 52                              Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create a dependency graph (4 steps).
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 53                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                  Step 4 - Allocate Functional Units
                          .L1   sum
                                         Do we have enough functional
                        .M1     prod
                                         units to code this algorithm in
                       
                           .D1    m      a single-cycle loop?
                           .S1
                           x1    .M1x              LDH
                                                                        A
                                                                       Side
                                                                                 B
                                                                                Side       LDH

                                          .D1       m                                      n    .D2
                           .L2   count
                                                            5                         5
                           .M2                                           MPY                                   SUB

                                                             .M1x        prod               1          count       .L2
                           .D2    n
                                                                                                               1
                                                                          2
                          .S2   loop                                           ADD                            B

                                                         1       .L1      sum                          loop        .S2
                          x2
                                                                                                               6




Chapter 12, Slide 54                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Step 4 - Allocate Registers

         Content of Register File A   Reg. A Reg. B         Content of Register File B

                                       A0     B0                         count
                       &a              A1     B1                           &b
                        a              A2     B2                              b
                       prod            A3     B3
                       sum             A4     B4
                                        ...    ...
                                       A15    B15

Chapter 12, Slide 55                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create a dependency graph (4 steps).
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 56                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Table
                                   PROLOG                                                LOOP
                       1   2   3     4          5              6              7              8
      .L1
      .L2
      .S1
      .S2
     .M1
     .M2
     .D1
     .D2
                How do we know the loop ends up in cycle 8?

Chapter 12, Slide 57                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Length of Prolog

                       LDH


                       m
                                                    Answer:
                             5       MPY               Count up the length
                                                        of longest path, in
                                     prod
                                                        this case we have:
                                     2                  5 + 2 + 1 = 8 cycles
                                            ADD


                                 1   sum




Chapter 12, Slide 58                              Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                               Scheduling Table
                                      PROLOG                                              LOOP
                       1   2      3     4          5              6              7              8
      .L1
      .L2
      .S1
      .S2
     .M1
     .M2
     .D1
     .D2



Chapter 12, Slide 59                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                               Scheduling Table
                                      PROLOG                                              LOOP
                       1   2      3     4          5              6              7            8
      .L1                                                                                    add
      .L2
      .S1
      .S2                         B     *          *           *                 *              *
     .M1                                                      mpy                *              *
     .M2
     .D1 ldh m             *      *     *          *              *              *              *
     .D2 ldh n             *      *     *          *              *              *              *

     Where do we want to branch?                            Branch here

Chapter 12, Slide 60                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                               Scheduling Table
                                      PROLOG                                              LOOP
                       1   2      3     4          5              6              7            8
      .L1                                                                                    add
      .L2      sub                *     *          *              *              *            *
      .S1
      .S2                         B     *          *           *                 *              *
     .M1                                                      mpy                *              *
     .M2
     .D1 ldh m *                  *     *          *              *              *              *
     .D2 ldh n  *                 *     *          *              *              *              *



Chapter 12, Slide 61                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Software Pipelining Procedure


            1.         Write algorithm in C code & verify.
            2.         Write „C6x Linear Assembly code.
            3.         Create a dependency graph (4 steps).
            4.         Allocate registers.
            5.         Create scheduling table.
            6.         Translate scheduling table to „C6x code.



Chapter 12, Slide 62                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                           C1        ldh .D1 *A1++,A2
                                     PROLOG                                               LOOP
                            ||       ldh .D2 *B1++,B2
                       1   1     2      3          4              5              6            7
      .L1                                                                                    add
      .L2      sub               *      *          *              *              *            *
      .S1
      .S2                        B      *          *           *                 *              *
     .M1                                                      mpy                *              *
     .M2
     .D1 ldh m *                 *      *          *              *              *              *
     .D2 ldh n  *                *      *          *              *              *              *



Chapter 12, Slide 63                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                               C1      ldh .D1               *A1++,A2
                                  PROLOG                                               LOOP
                                ||     ldh .D2               *B1++,B2
                       1   2   2     3          4              5              6            7
      .L1                      C2       ldh .D1              *A1++,A2                     add
      .L2      sub             *|| * ldh *  .D2               *
                                                             *B1++,B2*                     *
      .S1                       || [B0] sub .L2              B0,1,B0

      .S2                      B     *          *           *                 *              *
     .M1                                                   mpy                *              *
     .M2
     .D1 ldh m *               *     *          *              *              *              *
     .D2 ldh n  *              *     *          *              *              *              *



Chapter 12, Slide 64                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                                     C1
                                   PROLOG            ldh .D1               *A1++,A2
                                                                                 LOOP
                       1   2   3     ||
                                     3           4   ldh .D2
                                                         5                 *B1++,B27
                                                                             6
      .L1                                                                        add
                                    C2       ldh .D1                      *A1++,A2
      .L2      sub             *     *
                                     ||
                                                  *
                                           * ldh .D2                        *
                                                                          *B1++,B2
                                                                                   *
      .S1                            || [B0] sub .L2                      B0,1,B0
      .S2                      B     *           *        *                 *      *
                                    C3               ldh .D1              *A1++,A2
     .M1                                                mpy                 *      *
                                     ||       ldh .D2                     *B1++,B2
     .M2                             || [B0] sub .L2                      B0,1,B0
     .D1 ldh m *               *     *      *
                                     || [B0] B     *
                                                  .S2                       *
                                                                          loop     *
     .D2 ldh n  *              *     *           *              *              *              *



Chapter 12, Slide 65                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                                         C1             ldh .D1        *A1++,A2
                                          ||            ldh .D2        *B1++,B2      LOOP
                       1   2   3   4     C2
                                               4  ldh .D1
                                                             5              6
                                                                       *A1++,A2
                                                                                         7
                                          ||      ldh .D2              *B1++,B2
      .L1                                 || [B0] sub .L2              B0,1,B0
                                                                                        add
      .L2      sub             *   *           *             *              *            *
                                         C3       ldh         .D1      *A1++,A2
      .S1                                 ||      ldh         .D2      *B1++,B2
      .S2                      B   *           *
                                          || [B0] sub     *   .L2           *
                                                                       B0,1,B0             *
                                          || [B0] B           .S2      loop
     .M1                                                 mpy                *              *
                                         C4        ldh                  .D1         *A1++,A2
     .M2                                  ||       ldh                  .D2         *B1++,B2
     .D1 ldh m *               *   *        *
                                          || [B0] *sub                    *
                                                                        .L2            *
                                                                                    B0,1,B0
     .D2 ldh n  *              *   *      || [B0] *
                                            *      B                      *
                                                                        .S2            *
                                                                                    loop




Chapter 12, Slide 66                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                   C1       ldh .D1   *A1++,A2
                    ||      ldh .D2   *B1++,B2
                                                                                                   LOOP
                   C2       ldh .D1   *A1++,A2
                    || 1      2
                            ldh .D2     3
                                      *B1++,B2   4          5              6              7            8
                    || [B0] sub .L2   B0,1,B0
      .L1                                                                                             add
      .L2 C3 ||
                     ldh .D1 *A1++,A2
                     sub *B1++,B2 *
                     ldh .D2
                              *                          sub               *              *            *
      .S1 || [B0] sub .L2 B0,1,B0
             || [B0] B   .S2 loop
      .S2                       B     *                    B            *                 *              *
           C4        ldh .D1 *A1++,A2
     .M1 ||          ldh .D2 *B1++,B2                                  mpy                *              *
     .M2 || [B0] sub .L2 B0,1,B0
             || [B0] B   .S2 loop
                        *
     .D1 C5ldh m ldh .D1 **A1++,A2    *                  ldh               *              *              *
                        *
     .D2 || n ldh .D2 **B1++,B2
            ldh                       *                  ldh               *              *              *
                  || [B0] sub .L2         B0,1,B0
                  || [B0] B   .S2         loop

Chapter 12, Slide 67                                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                                   PROLOG                                              LOOP
                       1   2   3     4          4              6              7            8
      .L1                                                                                 add
      .L2       sub     *            *          *           sub               *            *
           C6       ldh .D1         *A1++,A2
      .S1   ||      ldh .D2         *B1++,B2
      .S2   || [B0] sub B
                        .L2          *
                                    B0,1,B0 *               B                 *              *
            || [B0] B   .S2         loop
     .M1    ||      mpy .M1x        A2,B2,A3
                                                           mpy                *              *
     .M2
     .D1 ldh m *        *            *          *           ldh               *              *
     .D2 ldh n   *      *            *          *           ldh               *              *



Chapter 12, Slide 68                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                                   PROLOG                                              LOOP
                       1   2   3     4          4              6              7            8
      .L1                                                                                 add
      .L2       sub     *            *          *           sub               *            *
           C7       ldh .D1         *A1++,A2
      .S1   ||      ldh .D2         *B1++,B2
      .S2   || [B0] sub B
                        .L2          *
                                    B0,1,B0 *               B                 *              *
            || [B0] B   .S2         loop
     .M1    ||      mpy .M1x        A2,B2,A3
                                                           mpy                *              *
     .M2
     .D1 ldh m *        *            *          *           ldh               *              *
     .D2 ldh n   *      *            *          *           ldh               *              *



Chapter 12, Slide 69                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
                                   PROLOG                                              LOOP
                       1   2   3     4          4              6              7            8
      .L1                                                                                 add
      .L2        sub     *     *     *                      sub               *            *
           * Single-Cycle Loop
      .S1
           loop:     ldh .D1 *A1++,A2
      .S2    ||      ldh B     *
                         .D2 *B1++,B2*                      B                 *              *
     .M1     || [B0] sub .L2 B0,1,B0                       mpy                *              *
             || [B0] B   .S2 loop
     .M2     ||      mpy .M1x A2,B2,A3
     .D1 ldh || * add *
             m           .L1 A4,A3,A4
                               *     *                      ldh               *              *
     .D2 ldh n    *      *     *     *                      ldh               *              *

                                     See Chapter 14 for practical examples

Chapter 12, Slide 70                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
      Translate Scheduling Table to „C6x Code
          With this method we have only created the prolog
           and the loop.
          Therefore if the filter has 100 taps, then we need to
           repeat the loop 100 times as we need 100 adds.
          This means that we are performing 107 loads. These
           7 extra loads may lead to some illegal memory
           acesses.
                                           PROLOG                           LOOP
                              1    2     3    4   5           6       7       8
                        .L1                                                  add
                        .L2       sub sub sub sub sub sub                    sub
                        .S1
                        .S2              B     B     B     B     B     B
                       .M1                                mpy mpy mpy
                       .M2
                       .D1 ldh m ldh m ldh m ldh m ldh m ldh m ldh m ldh m
                       .D2 ldh n ldh n ldh n ldh n ldh n ldh n ldh n ldh n
Chapter 12, Slide 71                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Solution: The Epilog

                  We only created the
                  Prolog and Loop …
                  What about the Epilog?


                          The Epilog can be extracted from
                           your results as described below.


                                  See example in the next slide.


Chapter 12, Slide 72                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh
                                            ||   ldh                               e1: mpy
           p2:         ldh||ldh             ||   mpy                                || add
            ||         []sub                ||   add
           p3:         ldh||ldh             ||   [] sub
            ||         []sub                ||   [] b
            ||         []b
           p4:         ldh||ldh
            ||         []sub
            ||         []b
           p5:         ldh||ldh
            ||         []sub      Epilog = Loop - Prolog
            ||         []b
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 73                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1: mpy
                                            ||   ldh                               || add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add
                                            ||   [] sub                           e2: mpy
           p3:         ldh||ldh             ||   [] b                              || add
            ||         []sub
            ||         []b
           p4:         ldh||ldh
            ||         []sub
            ||         []b
           p5:         ldh||ldh
            ||         []sub      Epilog = Loop - Prolog
            ||         []b
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 74                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1:      mpy
                                            ||   ldh                               ||      add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add                              e2:      mpy
                                            ||   [] sub                            ||      add
           p3:         ldh||ldh             ||   [] b
            ||         []sub
            ||         []b                                                        e3: mpy
           p4:         ldh||ldh                                                    || add
            ||         []sub
            ||         []b
           p5:         ldh||ldh
            ||         []sub      Epilog = Loop - Prolog
            ||         []b
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 75                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1:      mpy
                                            ||   ldh                               ||      add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add                              e2:      mpy
                                            ||   [] sub                            ||      add
           p3:         ldh||ldh             ||   [] b
            ||         []sub                                                      e3:      mpy
            ||         []b                                                         ||      add
           p4:         ldh||ldh
            ||         []sub                                                      e4: mpy
            ||         []b                                                         || add
           p5:         ldh||ldh
            ||         []sub      Epilog = Loop - Prolog
            ||         []b
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 76                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1:      mpy
                                            ||   ldh                               ||      add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add                              e2:      mpy
                                            ||   [] sub                            ||      add
           p3:         ldh||ldh             ||   [] b
            ||         []sub                                                      e3:      mpy
            ||         []b                                                         ||      add
           p4:         ldh||ldh                                                   e4:      mpy
            ||         []sub                                                       ||      add
            ||         []b
                                                                                  e5:      mpy
           p5:         ldh||ldh                                                    ||      add
            ||         []sub      Epilog = Loop - Prolog
            ||         []b
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 77                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1:      mpy
                                            ||   ldh                               ||      add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add                              e2:      mpy
                                            ||   [] sub                            ||      add
           p3:         ldh||ldh             ||   [] b
            ||         []sub                                                      e3:      mpy
            ||         []b                                                         ||      add
           p4:         ldh||ldh                                                   e4:      mpy
            ||         []sub                                                       ||      add
            ||         []b
                                                                                  e5:      mpy
           p5:         ldh||ldh                                                    ||      add
            ||         []sub      Epilog = Loop - Prolog
            ||         []b                                                        e6: add
           p6:         ldh||ldh   And there is no sub or
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 78                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Dot-Product with Epilog
                  Prolog                 Loop                                     Epilog
           p1:         ldh||ldh     loop:        ldh                              e1:      mpy
                                            ||   ldh                               ||      add
           p2:         ldh||ldh             ||   mpy
            ||         []sub                ||   add                              e2:      mpy
                                            ||   [] sub                            ||      add
           p3:         ldh||ldh             ||   [] b
            ||         []sub                                                      e3:      mpy
            ||         []b                                                         ||      add
           p4:         ldh||ldh                                                   e4:      mpy
            ||         []sub                                                       ||      add
            ||         []b
                                                                                  e5:      mpy
           p5:         ldh||ldh                                                    ||      add
            ||         []sub      Epilog = Loop - Prolog
            ||         []b                                                        e6:      add
           p6:         ldh||ldh   And there is no sub or                          e7: add
            ||         mpy
            ||         []sub         b in the epilog
            ||         []b
           p7:         ldh||ldh
            ||         mpy
            ||         []sub
            ||         []b

Chapter 12, Slide 79                             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
   Scheduling Table: Prolog, Loop and Epilog


                                   Prologue                     Loop                           Epilogue

        Cycle
                   1    2     3       4        5     6     7     8        9       10      11      12      13      14      15
    Unit

       .D1       LDH   LDH   LDH     LDH      LDH   LDH   LDH   LDH
       .D2       LDH   LDH   LDH     LDH      LDH   LDH   LDH   LDH
        .L1                                                     ADD      ADD     ADD     ADD     ADD      ADD   ADD     ADD
        .L2            SUB   SUB    SUB       SUB   SUB   SUB   SUB
       .S1
       .S2                    B       B        B     B     B     B
       .M1                                          MPY   MPY   MPY      MPY     MPY     MPY     MPY      MPY
       .M2




Chapter 12, Slide 80                                                 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                   Loop only!
                      Can the code be written as a loop only (i.e.
                       no prolog or epilog)?

                                       Yes!




Chapter 12, Slide 81                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Loop only!
                                                         PROLOG                                  LOOP

(i) Remove all instructions          1      2        3         4        5         6        7         8
    except the branch.
                              .L1                                                                  add

                              .L2           sub      *         *        *         *        *         *

                              .S1

                              .S2                    B         *        *         *        *         *

                              .M1                                              mpy         *         *

                              .M2

                              .D1   ldh m   *        *         *        *         *        *         *

                              .D2   ldh n   *        *         *        *         *        *         *




Chapter 12, Slide 82                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Loop only!
                                                 PROLOG                                   LOOP

(i) Remove all instructions                  1         2        3         4        5         6
    except the branch.
                              .L1                                                          add

                              .L2                                                          sub

                              .S1

                              .S2            B        B         B         B        B         B

                              .M1                                                         mpy

                              .M2

                              .D1                                                         ldh m
                              .D2                                                         ldh n




Chapter 12, Slide 83                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Loop only!
                                                 PROLOG                                   LOOP

(i) Remove all instructions                  1         2        3         4        5         6
    except the branch.
                              .L1                                      zero a zero
                                                                              sum          add
(ii) Zero input registers,
                              .L2                                                          sub
     accumulator and
     product registers.       .S1                                             zero
                                                                       zero b prod

                              .S2            B        B         B         B        B         B

                              .M1                                                         mpy

                              .M2

                              .D1                                                         ldh m
                              .D2                                                         ldh n




Chapter 12, Slide 84                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Loop only!
                                                 PROLOG                                   LOOP

(i) Remove all instructions                  1         2        3         4        5         6
    except the branch.
                              .L1                                      zero a zero
                                                                              sum          add
(ii) Zero input registers,
                              .L2                                                 sub      sub
     accumulator and
     product registers.       .S1                                             zero
                                                                       zero b prod

(iii)Adjust the number of     .S2            B        B         B         B        B         B
     subtractions.                                                                        mpy
                              .M1

                              .M2

                              .D1                                                         ldh m
                              .D2                                                         ldh n




Chapter 12, Slide 85                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                       Loop Only - Final Code
                                b        loop

                                b        loop

                                b        loop
                        ||      zero m              ;input register
      Overhead          ||      zero n              ;input register

                                b        loop
                        ||      zero prod           ;product register
                        ||      zero sum            ;accumulator

                                b        loop
                        ||      sub                 ;modify count register

                        loop    ldh
                        ||      ldh
                        ||      mpy
      Loop              ||      add
                        || []   sub
                        || []   b        loop



Chapter 12, Slide 86                            Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                Laboratory exercise
                      Software pipeline using the LDW
                       version of the Dot-Product routine:
                        (1)    Write linear assembly.
                        (2)    Create dependency graph.
                        (3)    Complete scheduling table.
                        (4)    Transfer table to „C6000 code.
                      To Epilogue or Not to Epilog?
                             Determine if your pipelined code is more
                              efficient with or without prolog and
                              epilog.

Chapter 12, Slide 87                            Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
         Lab Solution: Step 1 - Linear Assembly

                       ; for (i=0; i < count; i++)
                       ; prod = m[i] * n[i];
                       ; sum += prod;   *** count becomes 20 ***

                       loop:       ldw         *p_m++, m
                                   ldw         *p_n++, n
                                   mpy         m, n, prod
                                   mpyh        m, n, prodh
                                   add         prod, sum, sum
                                   add         prodh, sumh, sumh
                         [count]   sub         count, 1, count
                         [count]   b           loop

                       ; Outside of Loop
                                   add         sum, sumh, sum

Chapter 12, Slide 88                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Step 2 - Dependency Graph
                              A Side   B Side
                              LDW      LDW

                        .D1       m      n        .D2
                                                                                             SUB

                                  5     5                                         count .S2
                        MPY                      MPYH

                       .M1x prod       prodh .M2x                                 1
                                                                                               B
                              2              2
                                                                                    loop .S1
                        ADD                      ADD
             1          .L1 sum        sumh .L2                 1                             6


Chapter 12, Slide 89                              Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                           Step 2 - Functional Units
                          .L1   sum
                          .M1 prod        Do we still have enough
                                              functional units to
                          .D1    m          code this algorithm
                          .S1   loop       in a single-cycle loop?
                          x1    .M1x
                                                                 Yes !
                          .L2 sumh
                          .M2 prodh
                          .D2    n
                          .S2 count
                          x2    .M2x
Chapter 12, Slide 90                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Step 2 - Registers
          Register File A        #      #              Register File B
                                A0     B0                         count
                                A1     B1
                                A2     B2
                                A3     B3              return address
             &a/ret value       A4     B4                            &x
                        a       A5     B5                              x
              count/prod        A6     B6                        prodh
                       sum      A7     B7                         sumh

Chapter 12, Slide 91                    Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Step 3 - Schedule Algorithm
                                             PROLOG                                                LOOP
                        1       2      3        4           5              6              7            8
     .L1                                                                                              add
     .L2                                                                                                add

     .S1                              B1       B2           B3             B4             B5             B6

     .S2                       sub1   sub2     sub3       sub4           sub5           sub6           sub7

    .M1                                                                  mpy            mpy2           mpy3

    .M2                                                                 mpyh           mpyh2          mpyh3

    .D1                ldw m   ldw2   ldw3     ldw4       ldw5           ldw6           ldw7           ldw8

    .D2                ldw n   ldw2   ldw3     ldw4       ldw5           ldw6           ldw7           ldw8




Chapter 12, Slide 92                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Step 4 - „C6000 Code
                      The complete code is available in the
                       following location:
                         \Links\DotP LDW.pdf




Chapter 12, Slide 93                        Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Why Conditional Subtract?
                       loop:       ldh      *p_m++, m
                                   ldh      *p_n++, n
                                   mpy      m, n, prod
                                   add      prod, sum, sum

                         [count]   sub      count, 1, count
                         [count]   b        loop

          Without Cond. Subtract:        With Cond. Subtract:
          Loop (count = 1)         (B)   Loop (count = 1)                          (B)
             loop (count = 0)       X
                                   (B)      loop (count = 0)                        X
                                                                                   (B)
             loop (count = -1)     (B)      loop (count = 0)                        X
                                                                                   (B)
             loop (count = -2)     (B)      loop (count = 0)                        X
                                                                                   (B)
             loop (count = -3)     (B)      loop (count = 0)                        X
                                                                                   (B)
             loop (count = -4)     (B)      loop (count = 0)                        X
                                                                                   (B)
          Loop never ends                Loop ends
Chapter 12, Slide 94                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
             Chapter 12
       Software Optimisation
Part 3 - Pipelining Multi-cycle Loops
                                  Objectives
                      Software pipeline the weighted vector
                       sum algorithm.
                      Describe four iteration interval
                       constraints.
                      Calculate minimum iteration interval.
                      Convert and optimize the dot-product
                       code to floating point code.




Chapter 12, Slide 96                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?
                      Resource Limitations
                             Running out of resources
                              (Functional Units, Registers, Bus Accesses)
                              Weighted Vector Sum example requires
                               three .D units
                      Live Too Long
                             Minimum iteration interval defined by length of time a
                              Variable is required to exist
                      Loop Carry Path
                             Latency required between loop iterations
                              FIR example and SP floating-point dot product examples are
                              demonstrated
                      Functional Unit Latency > 1
                             A few „C67x instructions require functional units for 2 or 4
                              cycles rather than one. This defines a minimum iteration
                              interval.


Chapter 12, Slide 97                                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?

                       Four reasons:
                       1.   Resource Limitations.
                       2.   Live Too Long.
                       3.   Loop Carry Path.
                       4.   Double Precision (FUL > 1).


        Use these four constraints to determine the smallest
        Iteration Interval (Minimum Iteration Interval or
        MII).
Chapter 12, Slide 98                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
  Resource Limitation: Weighted Vector Sum
              Step 1 - C Code
   void WVS(short *c, short *b, short *a, short r, short n)
   { int i;
        for (i=0; i < n; i++)
        {
          c[i] = a[i] + (r * b[i]) >> 15;
        }
   }
                                                             a, b:        input arrays
                                                                c:        output array
         Store               Load            Load
          .D                  .D              .D                n:        length of arrays
                                                                r:        weighting factor
                       Requires 3 .D units


Chapter 12, Slide 99                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Software Pipelining Procedure
           Write algorithm in C code & verify.
          1.                     & verify.
           2.       Write „C6x Linear Assembly code.
                                                  Code.
           3.       Create dependency graph.
           4.       Allocate registers.
           5.       Create scheduling table.
           6.       Translate scheduling table to „C6x code.




Chapter 12, Slide 100                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                         Step 2 - „C6x Linear Code
                        c[i] = a[i] + (r * b[i]) >> 15;

                          loop:     LDH   *a++, ai
                                    LDH   *b++, bi
                                    MPY   r, bi, prod
                                    SHR   prod, 15, sum
                                    ADD   ai, sum, ci
                                    STH   ci, *c++


                             [i]    SUB   i, 1, i
                             [i]    B     loop



                The full code is available here:
                 \Links\Wvs.sa
Chapter 12, Slide 101                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
        LDH               Step 3 - Dependency Graph                                          LDH

          ai            .D1    A          B                                                    bi          .D2
                              Side       Side                  r
                                                                              MPY
                                                                                                5
                                                                             prod            .M2
                                                15
                                                          SHR
                         5                                                        2
                                                          sum         .S2                                SUB

                                     ADD
                                                      1                                             i         .L2
                                                                                       1
                                         ci     .L1                                                      1
                                                                                                  B
                                     1        STH
                              1                                                       loop .S1
                                     *c++       .D1
Chapter 12, Slide 102
                                                                                               6
                                                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 4 -Allocate Functional Units
                       .L1     ci        This requires 3 .D
                                           units therefore it
                        .M1                cannot fit into a
                                           single cycle loop.
                      .D1   ai, *c
                                          This may fit into a 2
                       .S1   loop         cycle loop if there are
                                           no other constraints.
                       .L2     i

                       .M2   prod

                       .D2    bi
                       .S2   sum
Chapter 12, Slide 103                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                2 Cycle Loop
                                                loop:                       .L1
                                                                            .L2
                                                                            .S1
                                                                            .S2
                                           Cycle 1                         .M1
                                                                           .M2
                                                                           .D1
                     2 cycles                                              .D2
                       per                                                  .L1
                  loop iteration                                            .L2
                                                                            .S1
                                           Cycle 2                          .S2
                                                                           .M1
                                                                           .M2
                                                                           .D1
                                                                           .D2

              Iteration Interval (II): # cycles per loop iteration.
Chapter 12, Slide 104                         Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 105
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 106
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 107
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 108
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 109
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Multi-Cycle Loop Iterations
                                  loop 1    loop 2              loop 3
                                  cycle 1   cycle 3             cycle 5
                           .D1    ldh         ldh                 ldh
                           .D2    ldh         ldh                 ldh
                            .S2   shr         shr                 shr
                           .M1    mpy         mpy                 mpy
                           .M2
                            .L1    add        add                  add
                            .L2    sub        sub                  sub
                            .S1     b          b                    b
                                  cycle 2   cycle 4             cycle 6
                           .D1     sth         sth                  sth
                           .D2
                            .S1
                            .S2
                           .M1
                           .M2
                            .L1
Chapter 12, Slide 110
                            .L2             Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        How long is the Prolog?
                                                                                        bi
  What is the length of the
  longest path?
             10                                                                   5

                                     ai
  How many cycles per loop?                                          prod

           2                                                        2
                                        5
                                                         sum

                                                         1
                                                  ci
                                                               10
                                              1
                                    1
                                              *c++
Chapter 12, Slide 111                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                Step 5 - Create Scheduling Chart (0)
       Unit\cycle       0     2       4                   6                    8
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2
       Unit\cycle       1     3       5                   7                    9
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2
Chapter 12, Slide 112               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6                    8
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2
Chapter 12, Slide 113                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6                    8
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2                       MPY mi                  *                    *
         .D1
         .D2
Chapter 12, Slide 114                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6                    8
          .L1
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2                                      SHR sum                      *
         .M1
         .M2                       MPY mi                  *                    *
         .D1
         .D2
Chapter 12, Slide 115                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6               8
          .L1                                                            ADD ci
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2                                      SHR sum                      *
         .M1
         .M2                       MPY mi                  *                    *
         .D1
         .D2
Chapter 12, Slide 116                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6               8
          .L1                                                            ADD ci
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2                                      SHR sum                      *
         .M1
         .M2                       MPY mi                  *              *
         .D1                                                              *
                                                                        STH c[i]
         .D2
Chapter 12, Slide 117                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6               8
          .L1                                                            ADD ci
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2                                      SHR sum                      *
         .M1
         .M2                        MPY mi                 *              *
         .D1                 LDH ai   *                    *              *
                                                                        STH c[i]
         .D2
Chapter 12, Slide 118                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2       4                   6               8
          .L1                                                            ADD ci
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1
         .D2      LDH bi       *       *                   *                    *
       Unit\cycle   1          3       5                   7                    9
          .L1
          .L2
          .S1
          .S2                                      SHR sum                      *
         .M1
         .M2                        MPY mi                 *              *
         .D1                 LDH ai   *                    *              *
                                                                        STH c[i]
         .D2
Chapter 12, Slide 119                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Conflict Solution
         Here are two possibilities ...
         Which is better?


       Unit\cycle   0           2           4                   6                    8
         .D1                  LDH ai
         .D2      LDH bi        *           *                   *                    *
       Unit\cycle       1       3           5                   7                    9
          .L1
          .L2
          .S1
          .S2                                           SHR sum                      *
         .M1
         .M2                         MPY mi                     *              *
         .D1                  LDH ai   *                        *              *
                                                                             STH c[i]
         .D2                  LDH ai
Chapter 12, Slide 120                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Conflict Solution
         Here are two possibilities ...
         Which is better?
         Move the LDH to cycle 2.
         (so you don‟t have to go back and recheck crosspaths)

       Unit\cycle   0               2             4                   6                    8
         .D1                      LDH ai
         .D2      LDH bi            *             *                   *                    *
       Unit\cycle        1           3            5                   7                    9
          .L1
          .L2
          .S1
          .S2                                                 SHR sum                      *
         .M1
         .M2                             MPY mi                       *              *
         .D1                      LDH ai   *                          *              *
                                                                                   STH c[i]
         .D2                      LDH ai
Chapter 12, Slide 121                           Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2         4                   6               8
          .L1                                                              ADD ci
          .L2
          .S1
          .S2
         .M1
         .M2
         .D1                 LDH ai      *                   *                    *
         .D2      LDH bi       *         *                   *                    *
       Unit\cycle   1          3         5                   7                    9
          .L1
          .L2
          .S1
          .S2                                        SHR sum                      *
         .M1
         .M2                          MPY mi                 *              *
         .D1                 LDH ai                                       STH c[i]
         .D2
Chapter 12, Slide 122                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0          2         4                   6               8
          .L1                                                              ADD ci
          .L2
          .S1                          [i] B                 *                    *
          .S2
         .M1
         .M2
         .D1                 LDH ai      *                   *                    *
         .D2      LDH bi       *         *                   *                    *
       Unit\cycle   1          3         5                   7                    9
          .L1
          .L2
          .S1
          .S2                                        SHR sum                      *
         .M1
         .M2                          MPY mi                 *              *
         .D1                 LDH ai                                       STH c[i]
         .D2
Chapter 12, Slide 123                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0         2       4      6      8
          .L1                                     ADD ci
          .L2
          .S1                       [i] B    *      *
          .S2
         .M1
         .M2
         .D1              LDH ai      *      *      *
         .D2      LDH bi      *       *      *      *
       Unit\cycle   1         3       5      7      9
          .L1
          .L2            [i] SUB i    *      *      *
          .S1
          .S2                             SHR sum   *
         .M1
         .M2                       MPY mi    *      *
         .D1              LDH ai                  STH c[i]
         .D2
Chapter 12, Slide 124                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Step 5 - Create Scheduling Chart
       Unit\cycle   0         2       4      6      8
          .L1                                     ADD ci
          .L2
          .S1                       [i] B    *      *
          .S2
         .M1
         .M2
         .D1              LDH ai      *      *      *
         .D2      LDH bi      *       *      *      *
       Unit\cycle   1         3       5      7      9
          .L1
          .L2            [i] SUB i    *      *      *
          .S1
          .S2                             SHR sum   *
         .M1
         .M2                       MPY mi    *      *
         .D1              LDH ai                  STH c[i]
         .D2
Chapter 12, Slide 125                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        2 Cycle Loop Kernel
       Unit\cycle   0         2       4      6      8
          .L1                                     ADD ci
          .L2
          .S1                       [i] B    *      *
          .S2
         .M1
         .M2
         .D1              LDH ai      *      *      *
         .D2      LDH bi      *       *      *      *
       Unit\cycle   1         3       5      7      9
          .L1
          .L2            [i] SUB i    *      *      *
          .S1
          .S2                             SHR sum   *
         .M1
         .M2                       MPY mi    *      *
         .D1              LDH ai                  STH c[i]
         .D2
Chapter 12, Slide 126              Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?

                    Four reasons:
                        1.   Resource Limitations.
                        2.   Live Too Long.
                        3.   Loop Carry Path.
                        4.   Double Precision (FUL > 1).




Chapter 12, Slide 127                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                              Live Too Long - Example
                                                                               LDH
                            c = (a >> 5) + a                                      ai           5
                                                                                       .D1


     0                  1     2     3     4        5         6
                                                                              5                         SHR
   LDH ai           LDH       LDH   LDH   LDH   a0 valid
                                                                                                           x        .S1

                                                                                                       1
                                                                                    ADD

                                                                                       ci       .L1



Chapter 12, Slide 128                                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Live Too Long - Example
     0                  1   2     3     4        5         6                 LDH
   LDH ai           LDH     LDH   LDH   LDH   a0 valid    a1
                                                                                ai           5
                                                                                     .D1

                                                                            5                         SHR

                                                                                                         x        .S1

                                                                                                     1
                                                                                  ADD

                                                                                     ci       .L1



Chapter 12, Slide 129                                    Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Live Too Long - Example
     0                  1   2     3     4        5          6                 LDH
   LDH ai           LDH     LDH   LDH   LDH   a0 valid     a1
                                               SHR       x0 valid                ai           5
                                                                                      .D1

                                                                             5                         SHR

                                                                                                          x        .S1

                                                                                                      1
                                                                                   ADD

                                                                                      ci       .L1



Chapter 12, Slide 130                                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Live Too Long - Example
     0                  1   2     3     4        5          6                 LDH
   LDH ai           LDH     LDH   LDH   LDH   a0 valid     a1
                                               SHR       x0 valid                ai           5
                                                          ADD                         .D1

                                                                             5                         SHR

                                                                                                          x        .S1
              Oops, rather than adding
                     a0 + x0                                                                          1
                       we got                                                      ADD

                     a1 + x0                                                          ci       .L1

            Let‟s look at one solution ...
Chapter 12, Slide 131                                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Live Too Long - 2 Cycle Solution
         0              2        4          6
       LDH ai           LDH     LDH      a0 valid
                                                           With a 2 cycle loop,
                                                             a0 is valid for
                                                                        2 cycles.

            1           3        5          7
                              a0 valid     a1




Chapter 12, Slide 132                               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Live Too Long - 2 Cycle Solution
         0              2        4          6
       LDH ai           LDH     LDH      a0 valid
                                                     Notice, a0 and x0 are
                                         x0 valid
                                                     both valid for 2 cycles
                                                    which is the length of the
                                                       Iteration Interval
            1           3        5          7
                              a0 valid     a1                   Adding them ...
                               SHR       x0 valid




Chapter 12, Slide 133                               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Live Too Long - 2 Cycle Solution
         0              2        4          6
       LDH ai           LDH     LDH      a0 valid
                                                            Works!
                                         x0 valid
                                                    But what‟s the drawback?
                                          ADD
                                                         2 cycle loop is slower.

            1           3        5          7
                              a0 valid     a1       Here‟s a better solution ...
                               SHR       x0 valid




Chapter 12, Slide 134                               Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                   Live Too Long - 1 Cycle Solution
     0                  1   2     3     4        5        6                 LDH
   LDH ai           LDH     LDH   LDH   LDH   a0 valid   a1
                                               MV b    b valid
                                               SHR x0 valid                    ai           5
                                                        ADD                         .D1

                                                                           5         MV              SHR

                                                                               b         .S2             x       .S1
           Using a temporary register
           solves this problem without                                     1                         1
                  increasing the                                                 ADD

           Minimum Iteration Interval                                               ci         .L1



Chapter 12, Slide 135                                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?

                    Four reasons:
                        1.   Resource Limitations.
                        2.   Live Too Long.
                        3.   Loop Carry Path.
                        4.   Double Precision (FUL > 1).




Chapter 12, Slide 136                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                             Loop Carry Path
                  The loop carry path is a path which
                   feeds one variable from part of the
                   algorithm back to another.

                                                e.g. Loop carry path = 3.
                                 p2

                                  MPY.M2
                        1    2

                             st_y0
                                      STH.D1



                   Note: The loop carry path is not the code loop.
Chapter 12, Slide 137                          Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                        Loop Carry Path, e.g. IIR Filter

                                IIR Filter Example
                                y0 = a0*x0 + b1*y1




Chapter 12, Slide 138                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                   IIR.SA

                               IIR Filter Example
                               y0 = a0*x0 + b1*y1


                  IIR:   ldh   *a_1,   A1
                         ldh   *x1,    A3
                         ldh   *b_1,   B1
                         ldh   *y0,    B0   ; y1 is previous y0

                         mpy   A1, A3, prod1
                         mpy   B1, B0, prod2

                         add   prod1, prod2, prod2
                         sth   prod2, *y0



Chapter 12, Slide 139                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                    Loop Carry Path - IIR Example
                            LDH.D1                                   LDH.D2
                             x1        A1               B1               y1

                                                                     5
     IIR Filter Loop
     y0 = a1*x1 + b1*y1           p1   MPY.M1                   p2
                                                         2
     Min Iteration Interval                                         MPY.M2
     Resource = 2                 ADD.L1      y0
     (need 3 .D units)
     Loop Carry Path = 9                     1
      (9 = 5 + 2 + 1 + 1)
     therefore, MII = 9           STH.D1 st_y0

                                                    1
     Can it be minimized?                           Result carries over from
                                                    one iteration of the loop
                                                    to the next.
Chapter 12, Slide 140                   Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
    Loop Carry Path - IIR Example (Solution)
                                  LDH.D1                                     LDH.D2
                                    x1        A1               B1                y1

                                                                             5
     IIR Filter Loop
     y0 = a1*x1 + b1*y1                  p1   MPY.M1                    p2
                                                                 2
     Min Iteration Interval                                                 MPY.M2
     Resource = 2                        ADD.L1       y0
     (need 3 .D units)                                           1
     New Loop Carry Path = 3                         1
      (3 = 2 + 1)
     therefore, MII = 3                  STH.D1 st_y0

                         Since y0 is stored in a CPU register,
                         it can be used directly by MPY
                         (after the first loop iteration).
Chapter 12, Slide 141                           Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Reminder: Fixed-Point Dot-Product Example
                                    LDH                                  LDH

                              .D1   m                                     n        .D2
     Is there a loop carry
     path in this example?                       MPY                 5
     Yes, but it‟s only “1”
                                                prod .M1x
     Min Iteration Interval
                                                  2
     Resource = 1                                      ADD
     Loop Carry Path = 1
                                          .L1 sum
      MII = 1
                                                         1

   For the fixed-point implementation, the Loop Carry
   Path was not taken into account because it is equal to 1.
Chapter 12, Slide 142                      Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                              Loop Carry Path
                       IIR Example.
                       Enhancing the IIR.
                       Fixed-Point Dot-Product Example.
                       Floating-Point Dot Product Example.




Chapter 12, Slide 143                     Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               Loop Carry Path due to FUL > 1
             Floating-Point Dot-Product Example
                                 LDW                                   LDW

                           .D1   m                                      n        .D2


                                             MPYSP                 5

                                              prod .M1x
     Min Iteration Interval
                                                4
     Resource = 1                      ADDSP
     Loop Carry Path = 4
                                       .L1 sum
      MII = 4
                                                       4


Chapter 12, Slide 144                    Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                                                    Unrolling the Loop

                 If the MII must be four cycles long, then
                 use all of them to calculate four results.

                 LDW                    LDW               LDW                    LDW               LDW                     LDW               LDW                    LDW

           .D1   m1                     n1    .D2   .D1   m2                     n2    .D2   .D1   m3                      n3    .D2   .D1   m4                     n4    .D2


                           MPYSP                                    MPYSP                                     MPYSP                                    MPYSP

                           prod1 .M1x                               prod2 .M1x                                prod3 .M1x                               prod4 .M1x


                               4                                        4                                         4                                        4
                       ADDSP                                    ADDSP                                     ADDSP                                    ADDSP

                       .L1 sum1                                 .L1 sum2                                  .L1 sum3                                 .L1 sum4




Chapter 12, Slide 145                                                                                    Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8


              ADDSP takes 4 cycles or three delay slots to
               produce the result.
Chapter 12, Slide 146                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 147                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 148                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 149                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 150                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 151                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 152                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9                               sum = x1 + x5
                         10                               sum = x2 + x6
                         11                               sum = x3 + x7
                         12                               sum = x0 + x4 + x8




Chapter 12, Slide 153                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
                        Cycle   Instruction               Result
                          0     ADDSP x0, sum, sum        sum = 0
                          1     ADDSP x1, sum, sum        sum = 0
                          2     ADDSP x2, sum, sum        sum = 0
                          3     ADDSP x3, sum, sum        sum = 0
                          4     ADDSP x4, sum, sum        sum = x0
                          5     ADDSP x5, sum, sum        sum = x1
                          6     ADDSP x6, sum, sum        sum = x2
                          7     ADDSP x7, sum, sum        sum = x3
                          8     ADDSP x8, sum, sum        sum = x0 + x4
                          9     NOP                       sum = x1 + x5
                         10     NOP                       sum = x2 + x6
                         11     NOP                       sum = x3 + x7
                         12     NOP                       sum = x0 + x4 + x8

              There are effectively four running sums:
                            sum (i)     = x(i) + x(i+4) + x(i+8) + …
                            sum (i+1)   = x(i+1) + x(i+5) + x(i+9) + …
                            sum (i+2)   = x(i+2) + x(i+6) + x(i+10) + …
                            sum (i+3)   = x(i+3) + x(i+7) + x(i+11) + …
Chapter 12, Slide 154                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
              ADDSP Pipeline (Staggered Results)
              There are effectively four running sums:
                        sum (i)     = x(i) + x(i+4) + x(i+8) + …
                        sum (i+1)   = x(i+1) + x(i+5) + x(i+9) + …
                        sum (i+2)   = x(i+2) + x(i+6) + x(i+10) + …
                        sum (i+3)   = x(i+3) + x(i+7) + x(i+11) + …

              These need to be combined after the last
               addition is complete...




Chapter 12, Slide 155                            Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction           Result
         0              ADDSP x0, sum, sum    sum = 0
         1              ADDSP x1, sum, sum    sum = 0
         2              ADDSP x2, sum, sum    sum = 0
         3              ADDSP x3, sum, sum    sum = 0
         4              ADDSP x4, sum, sum    sum = x0
         5              ADDSP x5, sum, sum    sum = x1
         6              ADDSP x6, sum, sum    sum = x2
         7              ADDSP x7, sum, sum    sum = x3
         8              ADDSP x8, sum, sum    sum = x0 + x4
         9              MV        sum, temp   sum = x1 + x5
        10                                    sum = x2 + x6
        11                                    sum = x3 + x7
        12                                    sum = x0 + x4 + x8




Chapter 12, Slide 156                                Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction             Result
         0              ADDSP x0, sum, sum      sum = 0
         1              ADDSP x1, sum, sum      sum = 0
         2              ADDSP x2, sum, sum      sum = 0
         3              ADDSP x3, sum, sum      sum = 0
         4              ADDSP x4, sum, sum      sum = x0
         5              ADDSP x5, sum, sum      sum = x1
         6              ADDSP x6, sum, sum      sum = x2
         7              ADDSP x7, sum, sum      sum = x3
         8              ADDSP x8, sum, sum      sum = x0 + x4
         9              MV        sum, temp     sum = x1 + x5
        10              ADDSP sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
        11                                      sum = x3 + x7
        12                                      sum = x0 + x4 + x8




Chapter 12, Slide 157                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction             Result
         0              ADDSP x0, sum, sum      sum = 0
         1              ADDSP x1, sum, sum      sum = 0
         2              ADDSP x2, sum, sum      sum = 0
         3              ADDSP x3, sum, sum      sum = 0
         4              ADDSP x4, sum, sum      sum = x0
         5              ADDSP x5, sum, sum      sum = x1
         6              ADDSP x6, sum, sum      sum = x2
         7              ADDSP x7, sum, sum      sum = x3
         8              ADDSP x8, sum, sum      sum = x0 + x4
         9              MV        sum, temp     sum = x1 + x5
        10              ADDSP sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
        11              MV        sum, temp     sum = x3 + x7
        12                                      sum = x0 + x4 + x8




Chapter 12, Slide 158                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction             Result
         0              ADDSP x0, sum, sum      sum = 0
         1              ADDSP x1, sum, sum      sum = 0
         2              ADDSP x2, sum, sum      sum = 0
         3              ADDSP x3, sum, sum      sum = 0
         4              ADDSP x4, sum, sum      sum = x0
         5              ADDSP x5, sum, sum      sum = x1
         6              ADDSP x6, sum, sum      sum = x2
         7              ADDSP x7, sum, sum      sum = x3
         8              ADDSP x8, sum, sum      sum = x0 + x4
         9              MV        sum, temp     sum = x1 + x5
        10              ADDSP sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
        11              MV        sum, temp     sum = x3 + x7
        12              ADDSP sum, temp, sum2   sum = x0 + x4 + x8, temp = x3 + x7




Chapter 12, Slide 159                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction             Result
         0              ADDSP x0, sum, sum      sum = 0
         1              ADDSP x1, sum, sum      sum = 0
         2              ADDSP x2, sum, sum      sum = 0
         3              ADDSP x3, sum, sum      sum = 0
         4              ADDSP x4, sum, sum      sum = x0
         5              ADDSP x5, sum, sum      sum = x1
         6              ADDSP x6, sum, sum      sum = x2
         7              ADDSP x7, sum, sum      sum = x3
         8              ADDSP x8, sum, sum      sum = x0 + x4
         9              MV        sum, temp     sum = x1 + x5
        10              ADDSP sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
        11              MV        sum, temp     sum = x3 + x7
        12              ADDSP sum, temp, sum2   sum = x0 + x4 + x8, temp = x3 + x7
        13              NOP
        14              NOP                     sum1 = x1 + x2 + x5 + x6




Chapter 12, Slide 160                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction             Result
         0              ADDSP x0, sum, sum      sum = 0
         1              ADDSP x1, sum, sum      sum = 0
         2              ADDSP x2, sum, sum      sum = 0
         3              ADDSP x3, sum, sum      sum = 0
         4              ADDSP x4, sum, sum      sum = x0
         5              ADDSP x5, sum, sum      sum = x1
         6              ADDSP x6, sum, sum      sum = x2
         7              ADDSP x7, sum, sum      sum = x3
         8              ADDSP x8, sum, sum      sum = x0 + x4
         9              MV        sum, temp     sum = x1 + x5
        10              ADDSP sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
        11              MV        sum, temp     sum = x3 + x7
        12              ADDSP sum, temp sum2    sum = x0 + x4 + x8, temp = x3 + x7
        13              NOP
        14              NOP                     sum1 = x1 + x2 + x5 + x6
        15              NOP
        16              ADDSP sum1, sum2, sum   sum2 = x0 + x3 + x4 + x7 + x8




Chapter 12, Slide 161                                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
            ADDSP Pipeline (Combining Results)
       Cycle            Instruction                Result
           0            ADDSP    x0, sum, sum      sum = 0
           1            ADDSP    x1, sum, sum      sum = 0
           2            ADDSP    x2, sum, sum      sum = 0
           3            ADDSP    x3, sum, sum      sum = 0
           4            ADDSP    x4, sum, sum      sum = x0
           5            ADDSP    x5, sum, sum      sum = x1
           6            ADDSP    x6, sum, sum      sum = x2
           7            ADDSP    x7, sum, sum      sum = x3
           8            ADDSP    x8, sum, sum      sum = x0 + x4
           9            MV       sum, temp         sum = x1 + x5
          10            ADDSP    sum, temp, sum1   sum = x2 + x6, temp = x1 + x5
          11            MV       sum, temp         sum = x3 + x7
          12            ADDSP    sum, temp sum2    sum = x0 + x4 + x8, temp = x3 + x7
          13            NOP
          14            NOP                        sum1 = x1 + x2 + x5 + x6
          15            NOP
          16            ADDSP    sum1, sum2, sum   sum2 = x0 + x3 + x4 + x7 + x8
          17            NOP
          18            NOP
          19            NOP
          20            NOP                        sum = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
Chapter 12, Slide 162                                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?

                    Four reasons:
                        1.   Resource Limitations.
                        2.   Live Too Long.
                        3.   Loop Carry Path.
                        4.   Double Precision (FUL > 1).




Chapter 12, Slide 163                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                            Simple FUL Example

                                 5                   3
                                     MPYDP


                                     prod
                                           10 (4.9)

                           1     2     3            4            5                    6
              .M1        MPYDP                                 MPYDP                             ...




                        MPYDP ties up the functional unit
                                 for 4 cycles.
Chapter 12, Slide 164                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
                    A Better Way to Diagram this ...
                                       1     5     9     13
     Since the MPYDP           .M1   MPYDP MPYDP MPYDP MPYDP
     instruction has a
     functional unit                   2              6               10               14
     latency (FUL) of          .M1
     “4”, .M1 cannot be
     used again until                  3              7               11               15
     the fifth cycle.          .M1

     Hence, MII  4.                                               prod1            prod2

                                       4              8               12               16
                               .M1



Chapter 12, Slide 165                  Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
               What Requires Multi-Cycle Loops?
                         1.   Resource Limitations.
                         2.   Live Too Long.
                         3.   Loop Carry Path.
                         4.   Double Precision (FUL > 1).



             Lab: Converting your dot-product code to
                  Single-Precision Floating-Point.


Chapter 12, Slide 166                       Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
     Chapter 12
Software Optimisation
       - End -

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/1/2011
language:English
pages:167