Docstoc

amit.chaudhary

Document Sample
amit.chaudhary Powered By Docstoc
					Bitwidth Analysis with Application
      to Silicon Compilation
                Amit Chaudhari

         Paper by Mark Stephenson*,
    Jonathan Babb+, Saman Amarasinghe*
          *MIT Laboratory for Computer Science
                       +Princeton

  @ ACM SIGPLAN conference on Programming Language
  Design and Implementation, Vancouver, British Columbia,
                        June 2000
                     Goal
• For a program written in a high level
  language, automatically find the minimum
  number of bits needed to represent:
  – Each static variable in the program
  – Each operation in the program.
    Usefulness of Bitwidth Analysis
•     Higher Language Abstraction

•     Enables other compiler optimizations
     1. Synthesizing application-specific processors
     2. Optimizing for power-aware processors
     3. Extracting more parallelism for SIMD
        processors
       Bitwidth Opportunities
• Runtime profiling reveals plenty of bitwidth
  opportunities.

• For the SPECint95 benchmark suite,
  – Over 50% of operands use less than half the
    number of bits specified by the programmer.
         Analysis Constraints
• Bitwidth results must maintain program
  correctness for all input data sets
  – Results are not runtime/data dependent


• A static analysis can do very well, even in
  light of this constraint
         Bitwidth Extraction
• Use abundant hints in the source language
  to discover bitwidths with near optimal
  precision.

• Caveats
  – Analysis limited to fixed-point variables.
  – The hints assume source program correctness.
                      The Hints
•    Bitwidth refining constructs
    1.   Arithmetic operations
    2.   Boolean operations
    3.   Bitmask operations
    4.   Loop induction variable bounding
    5.   Clamping operations
    6.   Type castings
    7.   Static array index bounding
      1. Arithmetic Operations
• Example
            int        a;
            unsigned b;
            a = random();
            b = random();
            a: 32 bits b: 32 bits
            a = a / 2;
            a: 31 bits b: 32 bits
            b = b >> 4;
            a: 31 bits b: 28 bits
      2. Boolean Operations
• Example

            int a;
            a: 32 bits
            a = (b != 15);
             a: 1 bit
      3. Bitmask Operations
• Example

            int a;
            a: 32 bits
            a = random() & 0xff;
             a: 8 bits
    4. Loop Induction Variable
              Bounding
• Applicable to for loop induction variables.
• Example

           int i;
            i: 32 bits
           for (i = 0; i < 6; i++) {
                   i: 3 bits
                    …
           }
               i: 3 bits
     5. Clamping Optimization
• Multimedia codes often simulate saturating
  instructions.
• Example
     int valpred
      valpred: 32 bits
     if (valpred > 32767)
        valpred = 32767
     else if (valpred < -32768)
        valpred = -32768
      valpred: 16 bits
     6. Type Casting (Part I)
• Example
            int a;
            char b;
            a: 32 bits b: 8 bits
            a = b;
             a: 8 bits b: 8 bits
    6. Type Cast1ing (Part II)
• Example
            int a;
            char b;
            a: 32 bits b: 8 bits
             a: 8 bits b: 8 bits
            b = a;
             a: 8 bits b: 8 bits
   7. Array Index Optimization
• An index into an array can be set based on
  the bounds of the array.
• Example
  int a, b;
  int X[1024];
  a: 32 bits b: 32 bits
   a: 10 bits b: 8 bits
  X[a] = X[4*b];
   a: 10 bits b: 8 bits
     Propagating Data-Ranges
• Data-flow analysis
• Three candidate lattices
  – Bitwidth
  – Vector of bits
  – Data-ranges

         a: 4 bits
        a = a + 1     Propagating bitwidths
         a: 5 bits
     Propagating Data-Ranges
• Data-flow analysis
• Three candidate lattices
  – Bitwidth
  – Vector of bits
  – Data-ranges

        a: 1X
        a = a + 1   Propagating bit vectors
        a: XXX
     Propagating Data-Ranges
• Data-flow analysis
• Three candidate lattices
  – Bitwidth
  – Vector of bits       Four bits are required
  – Data-ranges

         a: <0,13>
        a = a + 1     Propagating data-ranges
         a: <1,14>
      Propagating Data-Ranges
• Propagate data-ranges forward and backward over
  the control-flow graph using transfer functions
  described in the paper

• Use Static Single Assignment (SSA) form with
  extensions to:
  – Gracefully handle pointers and arrays.
  – Extract data-range information from conditional
    statements.
    Example of Data-Range
        Propagation
           a0 = input()
           a1 = a0 + 1




                 a1 < 0
                                      Range-refinement functions
        true


a2 = a1:(a10)            a4 = a1:(a10)
a3 = a2 + 1               c0 = a4




          a5 = (a3,a4)
          b0 = array[a5]
              Example of Data-Range
                  Propagation
                  a0 = input()            <-128, 127> <-2, 8>
                  a1 = a0 + 1             <-127, 127> <-1, 9>




 <-1, -1>               a1 < 0
<-127, -1>
               true                                <0, 9>
                                                  <0, 127>

       a2 = a1:(a10)            a4 = a1:(a10)
       a3 = a2 + 1               c0 = a4

 <0, 9>
<-126, 0>                                    <0, 127>
                                               <0, 9>
                 a5 = (a3,a4)
                 b0 = array[a5]
<-126, 127>                             array’s bounds are [0:9]
   <0, 9>
      What to do with Loops?
• Finding the fixed-point around back edges
  will often saturate data-ranges.
• Instruction in loops comprise the bulk of
  dynamically executed instruction!
         Their Loop Solution
• Find the closed-form solutions to commonly
  occurring sequences.
  – A sequence is a mutually dependent group of
    instructions.


• Use the closed-form solutions to determine
  final ranges.
Finding the Closed-Form Solution
a = 0
for i = 1 to 10
  a = a + 1
  for j = 1 to 10
      a = a + 2
  for k = 1 to 10
      a = a + 3
...= a + 4
Finding the Closed-Form Solution
a = 0
for i = 1 to 10
  a = a + 1
  for j = 1 to 10
      a = a + 2
  for k = 1 to 10
      a = a + 3
...= a + 4
Finding the Closed-Form Solution
a = 0               <0,0>
for i = 1 to 10
  a = a + 1         <1,460>
  for j = 1 to 10
      a = a + 2     <3,480>
  for k = 1 to 10
      a = a + 3     <24,510>
...= a + 4          <510,510>

• Non-trivial to find the exact ranges
Finding the Closed-Form Solution
a = 0               <0,0>
for i = 1 to 10
  a = a + 1         <1,460>
  for j = 1 to 10
      a = a + 2     <3,480>
  for k = 1 to 10
      a = a + 3     <24,510>
...= a + 4          <510,510>

• Non-trivial to find the exact ranges
Finding the Closed-Form Solution
 a = 0               <0,0>
 for i = 1 to 10
   a = a + 1         <1,460>
   for j = 1 to 10
       a = a + 2     <3,480>
   for k = 1 to 10
       a = a + 3     <24,510>
 ...= a + 4          <510,510>

• Can easily find conservative range of <0,510>
   Solving the Linear Sequence
a = 0
for i = 1 to 10             <1,10>
  a = a + 1
  for j = 1 to 10           <1,100>
      a = a + 2
  for k = 1 to 10           <1,100>
      a = a + 3
...= a + 4
• Figure out the iteration count of each loop.
   Solving the Linear Sequence
a = 0
for i = 1 to 10           <1,10>
  a = a + 1               <1,10>*<1,1>=<1,10>
  for j = 1 to 10         <1,100>
      a = a + 2           <1,100>*<2,2>=<2,200>
  for k = 1 to 10         <1,100>
      a = a + 3           <1,100>*<3,3>=<3,300>
...= a + 4
• Find out how much each instruction contributes to
  sequence using iteration count.
   Solving the Linear Sequence
a = 0
for i = 1 to 10        <1,10>
  a = a + 1            <1,10>*<1,1>=<1,10>
  for j = 1 to 10      <1,100>
      a = a + 2        <1,100>*<2,2>=<2,200>
  for k = 1 to 10      <1,100>
      a = a + 3        <1,100>*<3,3>=<3,300>
...= a + 4 (<1,10>+<2,200>+<3,300>)<0,0>=<0,510>
• Sum all the contributions together, and take the data-
  range union with the initial value.
                    Results
• Standalone Bitwise compiler.
  – Bits cut from scalar variables
  – Bits cut from array variables


• With the DeepC silicon compiler.
                    percentage of bits remaining




                           20
                           40
                           60
                           80
                          100




                            0
              softfloat

               adpcm

            bubblesort

                   life

            intmatmul

                jacobi
                                               with Bitwise




               median

             mpegcorr
benchmark
              convolve

            histogram
                                               dynamic profile




                 intfir

                parity

               pmatch
                                                                 Percentage of Original Scalar Bits




                   sor
                          percentage of bits remaining




                            0
                           10
                           20
                           30
                           40
                           50
                           60
                           70
                           80
                           90
                          100
              softfloat


               adpcm


            bubblesort


                   life


            intmatmul


                jacobi
                                                         with Bitwise




               median


            mpegcorr

benchmark
             convolve
                                                         dynamic profile




            histogram


                 intfir


                parity


              pmatch
                                                                           Percentage of Original Array Bits




                   sor
DeepC Compiler Targeted to FPGAs
            C/Fortran program

               Suif Frontend

 Pointer alias and other high-level analyses


         Bitwidth Analysis

            Raw parallelization
            MachSuif Codegen

           DeepC specialization                Verilog

      Traditional CAD optimizations

              Physical Circuit
                                                                                                                            Area (CLB count)




                                                                                                                  0
                                                                                                                      200
                                                                                                                            400
                                                                                                                                  600
                                                                                                                                        800
                                                                                                                                              1000
                                                                                                                                                     1200
                                                                                                                                                            1400
                                                                                                                                                                   1600
                                                                                                                                                                          1800
                                                                                                                                                                                 2000
                                                                                                  adpcm (8)


                                                                                              bubblesort (32)


                                                                                               convolve (16)


                                                                                              histogram (16)


                                                                                                    intfir (32)
                                                                                                                                                                                        Without bitwise




                                                                                               intmatmul (16)


                                                                                                   jacobi (8)


                                                                                                       life (1)


                                                                                                 median (32)
                                                                                                                                                                                                          FPGA Area




                                                                                              mpegcorr (16)


                                                                                                  newlife (1)
                                                                                                                                                                                        With bitwise




                                                                                                   parity (32)
                                  •On average bitwidth optimized circuit used 57% less area




                                                                                                 pmatch (32)


                                                                                                     sor (32)
Benchmark (main datapath width)
              XC4000-09 Clock Speed (MHZ)




         0
                               100
                                     125
                                           150




                25
                     50
                          75
   adpcm


bubblesort


 convolve


histogram


     intfir
                                                 Without bitwise




intmatmul


    jacobi


       life


  median
                                                                   (50 MHz Target)




mpegcorr
                                                                                     FPGA Clock Speed




   newlife
                                                 With bitwise




    parity


  pmatch


      sor
                                             Power Savings
                                      Without bitwidth analysis      With bitwidth analysis
Average Dynamic Power (mW)




                              5
                             4.5
                              4
                             3.5
                              3
                             2.5
                              2
                             1.5
                              1
                             0.5
                              0
                                   bubblesort     histogram         jacobi        pmatch
                                         •On average, analysis reduced power by 50%.
               Power Savings

• C  ASIC
  – IBM SA27E process
     • 0.15 micron drawn
  – 200 MHz
• Methodology
  – C  RTL
  – RTL simulation  Register switching activity
  – Synthesis reports dynamic power
                   Summary
• Bitwise: a scalable bitwidth analyzer
  – Standard data-flow analysis
  – Loop analysis
  – Incorporate pointer analysis
• Demonstrated savings when targeting silicon
  from high-level languages
  – 57% less area
  – up to 86% improvement in clock speed
  – less than 50% of the power
Thank You

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/26/2011
language:English
pages:41