Docstoc

Warp Processors

Document Sample
Warp Processors Powered By Docstoc
					Warp Processing:
Making FPGAs Ubiquitous via Invisible Synthesis



                   Greg Stitt
   Department of Electrical and Computer Engineering
                 University of Florida
        Introduction
   Improved performance enables new applications
       Past decade -     Mp3 players, portable game consoles, cell phones,
        etc.
       Future architectures -        Speech/image recognition, self-guiding
        cars, computation biology, etc.




                                                                               2/26
    Introduction
   FPGAs (Field Programmable Gate Arrays) – Implement
    custom circuits
       10x, 100x, even 1000x for scientific and embedded apps
            [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], …
       But, FPGAs not mainstream
   Warp Processing Goal: Bring FPGAs into mainstream
       Make FPGAs “Invisible”



                                                    Performance
                                                                              FPGAs capable
                                                                              of large
                                                                              performance
                                                                              improvements


                                                                  FPGA   uP
                                                                                               3/26
          Introduction –
          Hardware/Software Partitioning
    C Code for FIR Filter         Hardware for loop
                                  * * * * * * * * * * * * .......
                                Hardware/software partitioning                   Designer creates
     for (i=0; ii < 16; i++)
      for (i=0; < 128; i++)
         y[i] += c[i] * x[i]
          y[i] += c[i] * x[i]           performance + . . . . . .
                                selects + + + + critical. regions
                                   +
                                                                                 custom hardware
     ..
      ..                                                                         using hardware
     ..
      ..
                                for hardware +
                                     +        implementation . . .
                                                     + ....                      description
     ..
      ..
                                [Ernst, Henkel 93] +
                                           +              .......                language (HDL)
                                [Gupta, DeMicheli 97] . . . . . . .
                                                 +
                                [Vahid, Gajski 94]
                                [Eles et al. 97]
        Compiler                [Sangiovanni-Vincentelli 94] 100   100
                                                                    90
                                                                   90
                                                                   80
                                                                   80
                                                                   70
                                                                   70
                                                                   60
                                                                   60
                                                                   50
                                                                   50                               Hw/Sw
                                                                                                       Sw
                                                                   40
                                                                   40                               Sw
             Processor
           Processor                                FPGA
                                                 Processor         30
                                                                   30
                                                                   20
                                                                   20
                                                                   10
                                                                   10
                                                                    0
                                                                    0
                                                                          Time
                                                                         Time           Energy
                                                                                      Energy

   ~1000 cycles                               ~ 10 cycles
                                                     Speedup = 1000 cycles/ 10 cycles
                                                              = 100x
                                                                                                 4/26
              Introduction –
              High-level Synthesis
                     High-level
                                              Problem: Describing circuit
                      Updated
                       Code
                       Binary             
                                              using HDL is time
               Hw/Sw Partitioning
                 Decompilation                consuming/difficult
             Compiler       High-level
                           Decompilatio      Solution: High-level
                            Synthesis
                                n             synthesis
Libraries/
Libraries/   Software      Hardware              Create circuit from high-level
 Object
  Object                                          code
  Code
  Code
              Linker                                  [Gupta, DeMicheli 92][Camposano,
                                                       Wolf 91][Rabaey 96][Gajski, Dutt 92]

                     Bitstream
                     Bitstream                   Allows developers to use
                                                  higher-level specification
                                                 Potentially, enables synthesis
                uP                FPGA            for software developers
                                                                                          5/26
        Introduction –
        High-level Synthesis

                                                          Problem: Describing circuit
                                                           using HDL is time
                                                           consuming/difficult
            for (i=0; i < 16; i++)
              y[i] += c[i] * x[i]


                                                          Solution: High-level
        High-level Synthesis
           Decompilation
                                                           synthesis
                                                              Create circuit from high-level
* * * * * * * * * * * *                       .......          code
+       +       +       +       +       +    .......               [Gupta, DeMicheli 92][Camposano,
                                                                    Wolf 91][Rabaey 96][Gajski, Dutt 92]
    +               +               +       .......

            +               +           .......               Allows developers to use
                    +           .......                        higher-level specification
                                                              Potentially, enables synthesis
                                                               for software developers
                                                                                                       6/26
             Problems with High-Level Synthesis

                                                   Problem: High-level
                High-level Code
              Specialized Language
                 Updated Binary
                                                    synthesis is unattractive to
                                        Non-
                   Synthesis
              Specialized Compiler
                 Decompilation          Standard    software developers
                                        Software       Requires specialized language
Libraries/
Libraries/
 Object
  Object
                                        Tool Flow           SystemC, NapaC, HandelC,
             Software     Hardware
  Code
  Code                                                       …
              Linker                                   Requires specialized compiler
                                                            Spark, ROCCC, CatapultC, …
                     Bitstream
                     Bitstream                         Limited commercial success
                                                            Software developers
                                                             reluctant to change tools
                uP               FPGA


                                                                                     7/26
              Warp Processing – “Invisible” Synthesis
Libraries/
Libraries/
 Object
  Object
  Code
  Code
                 High-level Code
                 Updated Binary                         Solution: Make
                 High-Level Code
                  Updated Binary
                    Compiler
                  Decompilation          Standard        synthesis “invisible”
                                         Software
                                          Move              2 Requirements
                     Synthesis
                   Decompilation
                 Software Binary
                  Updated                Tool Flow
                                          compilation
                                         before                  Standard software tool
 Libraries/
 Libraries/
  Object
   Object            Synthesis
                   Decompilation
              Software    Hardware       synthesis                flow
   Code
   Code                                                                Perform compilation
                                                                        before synthesis
               Software
                Linker      Hardware
                                                                 Hide synthesis tool
                      Bitstream
                      Bitstream                                        Move synthesis on
                                                                        chip
                                                                       Similar to dynamic
                                                                        binary translation
                 uP               FPGA                                       [Transmeta]
                                                                       But, translate to hw
                                                                                            8/26
              Warp Processing – “Invisible” Synthesis
Libraries/
Libraries/
 Object
  Object
  Code
  Code
                 High-level Code
                 Updated Binary                     Solution: Make
                 High-Level Code
                  Updated Binary
                    Compiler
                  Decompilation
                                                     synthesis “invisible”
                     Synthesis
                                                          2 Requirements
                   Decompilation
                 Software Binary
                  Updated
                                                               Standard software tool
 Libraries/
 Libraries/
  Object
   Object            Synthesis
                   Decompilation
              Software    Hardware                              flow
   Code
   Code                                                              Perform compilation
                                         Warp                         before synthesis
               Software
                Linker      Hardware
                                         processor             Hide synthesis tool
                      Bitstream
                      Bitstream
                                         looks like                  Move synthesis on
                                         standard uP but              chip
                                         invisibly
                                                                     Similar to dynamic
                                         synthesizes
                                                                      binary translation
                 uP               FPGA   hardware
                                                                           [Transmeta]
                                                                     But, translate to hw
                                                                                          9/26
              Warp Processing – “Invisible” Synthesis

                                                           Advantages
Libraries/
Libraries/
 Object
  Object          C++, Java, Matlab
               C, High-level Code
                  Updated Binary                      
  Code
  Code
                 High-Level Code
                  Updated Binary                              Supports all
               gcc, g++, javac, keil
                     Compiler
                  Decompilation                                languages,compilers,
                     Synthesis
                   Decompilation                               IDEs
                 Software Binary
                  Updated
 Libraries/
 Libraries/
                                                              Supports synthesis of
  Object
   Object            Synthesis
                   Decompilation
              Software    Hardware                             assembly code
   Code
                                                               Support synthesis of
   Code
                                                           

               Software
                Linker      Hardware     Warp                  library code
                                         processor            Also, enables dynamic
                                         looks like
                      Bitstream
                      Bitstream                                optimizations
                                         standard uP but
                                         invisibly
                                         synthesizes
                 uP               FPGA   hardware


                                                                              10/26
         Warp Processing Background: Basic Idea
                                          Software Binary
1   Initially, software binary loaded     Mov reg3, 0
    into instruction memory               Mov reg4, 0
                                          loop:
                                          Shl reg1, reg3, 1
                     Profiler             Add reg5, reg2, reg1
                                    I
                                          Ld reg6, 0(reg5)
                                    Mem
                                          Add reg4, reg4, reg6
    µP                                    Add reg3, reg3, 1
                                    D$    Beq reg3, 10, -5
                                          Ret reg4


                            On-chip CAD
         FPGA




                                                                 11/26
         Warp Processing Background: Basic Idea
                                        Software Binary
2   Microprocessor executes
    instructions in software binary     Mov reg3, 0
                                        Mov reg4, 0
                                        loop:
                                        Shl reg1, reg3, 1
                   Profiler             Add reg5, reg2, reg1
                                  I
                                        Ld reg6, 0(reg5)
                                  Mem
                                        Add reg4, reg4, reg6
    µP                                  Add reg3, reg3, 1      Time Energy
                                  D$    Beq reg3, 10, -5
                                        Ret reg4


                          On-chip CAD
         FPGA




                                                                       12/26
         Warp Processing Background: Basic Idea
                                              Software Binary
3   Profiler monitors instructions and
    detects critical regions in binary        Mov reg3, 0
                                              Mov reg4, 0
                                              loop:
                                              Shl reg1, reg3, 1
                    Profiler                  Add reg5, reg2, reg1
                                    I
              beq beq beq beq beq
            beq beq beq beq beq
              add add add add add
            add add add add add     Mem
                                              Ld reg6, 0(reg5)
                                              Add reg4, reg4, reg6
    µP                                        Add reg3, reg3, 1      Time Energy
                                    D$        Beq reg3, 10, -5
                                              Ret reg4


                           On-chip CAD    Critical Loop
         FPGA                             Detected




                                                                             13/26
         Warp Processing Background: Basic Idea
                                           Software Binary
4   On-chip CAD reads in critical region
                                           Mov reg3, 0
                                           Mov reg4, 0
                                           loop:
                                           Shl reg1, reg3, 1
                   Profiler                Add reg5, reg2, reg1
                                  I
                                           Ld reg6, 0(reg5)
                                  Mem
                                           Add reg4, reg4, reg6
    µP                                     Add reg3, reg3, 1      Time Energy
                                  D$       Beq reg3, 10, -5
                                           Ret reg4


                          On-chip CAD
         FPGA




                                                                          14/26
         Warp Processing Background: Basic Idea
                                                Software Binary
5   On-chip CAD converts critical region into
    control data flow graph (CDFG)              Mov reg3, 0
                                                Mov reg4, 0
                                                loop:
                                                Shl reg1, reg3, 1
                   Profiler                     Add reg5, reg2, reg1
                                  I
                                                Ld reg6, 0(reg5)
                                  Mem
                                                Add reg4, reg4, reg6
    µP                                          Add reg3, reg3, 1                Time Energy
                                  D$            Beq reg3, 10, -5
                                                Ret reg4


                         Dynamic CAD
                         On-chip Part.                         reg3 := 0
         FPGA            Module (DPM)                          reg4 := 0
                                                                loop:
                                                                reg4 := reg4 + mem[
                                                                    reg2 + (reg3 << 1)]
                                                                reg3 := reg3 + 1
                                                                if (reg3 < 10) goto loop

                                                                ret reg4
                                                                                           15/26
         Warp Processing Background: Basic Idea
                                          Software Binary
6   On-chip CAD synthesizes decompiled
                                          Mov reg3, 0
    CDFG to a custom (parallel) circuit
                                          Mov reg4, 0
                                          loop:
                                          Shl reg1, reg3, 1
                 Profiler                 Add reg5, reg2, reg1
                                I
                                          Ld reg6, 0(reg5)
                                Mem
                                          Add reg4, reg4, reg6
    µP                                    Add reg3, reg3, 1                    Time Energy
                                D$        Beq reg3, 10, -5
                                          Ret reg4


                                          +       +       +     +
                                                              reg4 := 0+     +
                       Dynamic CAD
                       On-chip Part.                          reg3 := 0
         FPGA          Module (DPM)
                                                                                 ...
                                                              loop:
                                              +               + := reg4+ mem[
                                                              reg4          +
                                                                  reg2 + (reg3 << 1)]
                                                              reg3 := reg3 + 1
                                                      +             + ...
                                                              if (reg3 < 10) goto loop

                                                               ret reg4
                                                              +       ...
                                                                                         16/55
         Warp Processing Background: Basic Idea
                                         Software Binary
7   On-chip CAD maps circuit onto FPGA   Mov reg3, 0
                                         Mov reg4, 0
                                         loop:
                                         Shl reg1, reg3, 1
                  Profiler               Add reg5, reg2, reg1
                                 I
                                         Ld reg6, 0(reg5)
                                 Mem
                                         Add reg4, reg4, reg6
    µP                                   Add reg3, reg3, 1            Time Energy
                                 D$      Beq reg3, 10, -5
                                         Ret reg4


                                         +     +   reg4 := 0+
                                                    ++              +
                        Dynamic CAD
                        On-chip Part.              reg3 := 0
         FPGA           Module (DPM)         SM     SM            SM . . .
                                                   loop:
                                             + CLB + :=+ + mem[
                                                 + reg4 CLB +
                                                            reg4
                                                       reg2 + (reg3 << 1)]
                                                   reg3 := reg3 + 1
                                             SM +                   .
                                                                 . goto
                                                   if (reg3 < 10) SM. loop
                                                    SM +

                                                         ret reg4
                                                        +       ...
                                                                              17/55
         Warp Processing Background: Basic Idea
    On-chip CAD replaces instructions in   Software Binary
    binary to use hardware, causing
8   performance and energy to “warp” by    Mov reg3, 0
    an order of magnitude or more          Mov reg4, 0
                                           loop:
                                           // instructions 1
                                           Shl reg1, reg3, that
                  Profiler                 interact with FPGA
                                           Add reg5, reg2, reg1
                                 I
                                           Ld reg6, 0(reg5)
                                 Mem
                                           Add reg4, reg4, reg6         Time   Energy
    µP                                     Add reg3, reg3, 1            Time Energy
                                 D$        Beq reg3, 10, -5              Software-only
                                           Ret reg4
                                                                         “Warped”

                                           +     +   reg4 := 0+
                                                      ++              +
                        Dynamic CAD
                        On-chip Part.                reg3 := 0
         FPGA           Module (DPM)           SM     SM            SM . . .
                                                     loop:
                                               + CLB + :=+ + mem[
                                                   + reg4 CLB +
                                                              reg4
                                                         reg2 + (reg3 << 1)]
                                                     reg3 := reg3 + 1
                                               SM +                   .
                                                                   . goto
                                                     if (reg3 < 10) SM. loop
                                                      SM +

                                                          ret reg4
                                                          +       ...
                                                                                18/55
        Warp Processing Background: Basic Technology
   Challenge: CAD tools normally require
    powerful workstations
                                                                 46x improvement
   Develop extremely efficient on-chip
                                                                 30% perf. penalty
    CAD tools
       Requires efficient synthesis                                       Binary
                                                                           Binary
       Requires specialized FPGA, physical
        design tools (JIT FPGA compilation)                               Synthesis
            [Lysecky FCCM05/DAC04],
             University of Arizona
                                                             Logic Optimization

                               Profiler




                                               compilation
                                                JIT FPGA
                                          I$                 Technology Mapping
                       uP
                                          D$
                                                             Placement & Routing
                        FPGA         On-chip
                                      CAD
                                                                    HW
                                                                  Binary              Updated
                                                                                       Binary
                                                                                        19/55
                                                                                      Binary
           Warp Processing:
           Initial Results - Embedded Applications
          15
          12
Speedup




          9
          6
          3
          0




                                                          er l
                                                  g2
                                 ok




                                                                n:
                                 ri x
                                 np




                                                        Av m u
                         l




                          bi r
              ev




                                cm




                                                   ct

                                                 21




                                                         M e:
                                                          m r
               x




                                  k
                               ow
                      ur




                                 d




                                                               fi
                                pr
            fa




                                                              ia
                              lo




                                                id




                                                            ag
                              nr

                             tm




                                               pe
                              at
           br




                                               g7
                            ro




                                                            at
          g3




                            tfl




                           tts




                                                           ed
                           tb
                          ca




                          m




                                              m
                         pk




                                      Benchmarks


         Average speedup of 6.3x
                  Achieved completely transparently
         Also, energy savings of 66%

                                                                     20/26
         Thread Warping - Overview
                for (i = 0; i < 10; i++) {
Multi-              thread_create( f, i );                      Performance
core
                }
platforms
 multi-                                           OS schedules threads onto x86     VLIW TWrp
                       Compiler                    accelerators (possibly dozens),
threaded                                           in addition to µPs
apps
                        Binary      f()                                       Very large speedups
                                                       FPGA                   possible – parallelism
                          µP         µP       f()                 f()         at bit, arithmetic, and
                                                                              now thread level too
OS schedules
threads onto             f()                        On-chip CAD
                                    OS
available µPs                                                      Acc.
                          µP         µP      µP
                                             f()         f()
                                                         µP
                                                                   Lib        OS invokes on-chip
                                                                              CAD tools to create
                                                                              accelerators for f()

                                           Thread warping: use one core to
                         Remaining threads create accelerator for waiting
                         added to queue    threads
                                                                                                 21/26
       Speedup from Thread Warping

             308              130     502      63                        130            4-uP
50                                                                               38
40                                                                                      TW
30
20                                                                                      8-uP
10
 0                                                                                      16-uP
       Fir   Prewitt   Linear Moravec Wavelet Maxfilter 3DTrans N-body    Avg.   Geo.   32-uP
                                                                                 Mean   64-uP



    Average 130x speedup
                                                        But, FPGA uses additional
                                                        area
                                                        So we also compare to systems
                                                        with 8 to 64 ARM11 uPs – FPGA
                                                        size = ~36 ARM11s

    11x faster than 64-core system
    Simulation pessimistic, actual results likely better
                                                                                          22/26
    Dynamic enables Custom Communication
NoC – Network on a Chip provides       Problem: Best topology is
communication between multiple cores   application dependent

                                                  App1



      µP              µP

                                            Bus    Mesh

                                              App2
      µP              µP




                                            Bus    Mesh




                                                                   23/26
    Dynamic enables Custom Communication
NoC – Network on a Chip provides       Problem: Best topology is
communication between multiple cores   application dependent

                                                  App1



      µP              µP

             FPGA                           Bus    Mesh

                                              App2
      µP              µP




                                            Bus    Mesh
 Warp processing can
 dynamically choose
 topology
                                                                   24/26
    Summary
   Warp processors
       Achieves performance advantages of FPGA without any extra
        effort
       “Invisible” synthesis
            Allows designers to use existing tools/languages
       Enables dynamic hardware optimization
   Thread warping
       Dynamic synthesis of thread accelerators for multi-cores
   Custom communication
       Warp processing can adapt communication topology to needs
        of application or a particular workload



                                                                   25/26
     References
        Patent
                Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending,
                 2004
1.       Hardware/Software Partitioning of Software Binaries
         G. Stitt and F. Vahid
         IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170.
2.       Warp Processors
         R. Lysecky, G. Stitt, and F. Vahid.
         ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681.
3.       Binary Synthesis
         G. Stitt and F. Vahid
         Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES)
4.       Expandable Logic
         G. Stitt, F. Vahid
         Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007.
5.       New Decompilation Techniques for Binary-level Co-processor Generation
         G. Stitt, F. Vahid
         IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554.
6.       Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode
         G.Stitt, F. Vahid, G. McGregor, B. Einloth
         IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-
         290.
7.       A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms.
         G. Stitt and F. Vahid
         IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397.
8.       Dynamic Hardware/Software Partitioning: A First Approach
         G. Stitt, R. Lysecky and F. Vahid
         IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255.




                                                                                                                         26/26

				
DOCUMENT INFO