Flextream_ Adaptive Compilation of Streaming Applications for

Document Sample
Flextream_ Adaptive Compilation of Streaming Applications for Powered By Docstoc
					                Flextream:
Adaptive Compilation of Streaming
 Applications for Heterogeneous
          Architectures

  Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3,
  Rodric Rabbah2, Trevor Mudge1, and Scott Mahlke1



                                               1 University of Michigan
                                               2 IBM T.J. Watson Research Lab.
                                               3 NVIDIA Corp.

                                                            University of Michigan
                                     Electrical Engineering and Computer Science
                                    Cores are the New Gates
                                                                                             (Shekhar Borkar, Intel)

               512

               256
                       Unicore                                                        Stream Programming      PicoChip       AMBRIC

                       Homogeneous Multicore                                                             CISCO CSR1
               128     Heterogeneous Multicore                                                               NVIDIA G80      CUDA
                   Courtesy: Gordon’06                                                                             Larrabee
                64                                                                                                           X10
                                                                                                                            Peakstream
                32
# cores/chip




                                                                                                              RAZA           Fortress


                              C/C++/Java
                                                                                                              XLR          Cavium
                16                                                                       RAW
                                                                                                                             Accelerator
                                                                                                                            Cell
                                                                                                       Niagara
                8                                                                                                            Ct
                                                                                                                    Opteron 4P        AMD Fusion
                4
                                                                                                     BCM 1480
                                                                                                                             CTM
                                                                                                                               Core2Quad
                                                                                                  Xbox 360                 Xeon
                      4004                                                            Power4 PA8800              Power6
                                                                                                                             Rstream
                2     8008                                                                               Opteron
                      8080   8086          286    386   486     Pentium          P2     P3   P4           CoreDuo            Rapidmind
                                                                                                                      Core2Duo

                1                                                                                                   Core
                                                                                         Athlon    Itanium    Itanium2


                     1975           1980         1985    1990             1995           2000                2005                  2010
                                                                                                                     University of Michigan
                                                                                              Electrical Engineering and Computer Science
    Streaming Computing Is Everywhere!

• Prevalent computing domain with applications in
  embedded systems, desktops and high-end servers




                                                        University of Michigan
                                 Electrical Engineering and Computer Science
                          StreamIt

• Main Constructs:
   – Filter: Encapsulate computation.                    filter
      • Stateful
      • Stateless
   – Pipeline  Expressing pipeline           pipeline
     parallelism.
   – Splitjoin  Expressing task/data-
     level parallelism.
                                         splitjoin



• Exposes different types of
  parallelism

                                                                 University of Michigan
                                          Electrical Engineering and Computer Science
          StreamIt Graph Tuning
• Parallelism can be tuned in streaming programs
  – Horizontal Replication

  – Horizontal Fusion

  – Vertical Fusion




                                                         University of Michigan
                                  Electrical Engineering and Computer Science
                               StreamIt Example
          10                                          10
                 A                                            A                                  10
                                                                                                        A

     6     Splitter                             6          Splitter

43                   43           21.5        21.5           21.5        21.5                    86
     B1                   B2             B1          B2             B3          B4                      B

     6         Joiner                           6          Joiner

         246                                         246                                     1138
                 C                                           C

         326                                         326                                              CDE
                 D                                           D

         566                                         566
                 E                                           E

          10                                          10                                         10
                 F                                           F                                          F
                                                                                                University of Michigan
                                                                         Electrical Engineering and Computer Science
                     What Are We Solving?
             A
                                           Memory       Memory          Memory         Memory
          Splitter
                                           Core          Core           Core           Core

B1    B2

          Joiner
                 B3       B4
                               ?
            C                  • Performing graph modulo
          Splitter               scheduling on a stream graph
     D1              D2          statically.
          Joiner

             E                 • What happens in case of dynamic
                                 resource changes?
             F

                                                                           University of Michigan
                                                    Electrical Engineering and Computer Science
              Target Architecture
• Master processor acts as a
  controller.
                                                        Slave
                                                                        ...            Slave
                                                      Local Store                    Local Store
                                                         DMA                            DMA


• Each slave processor has its    Memory

                                                                    Interconnect

  own local store and DMA         Master
                                 Processor


  engine.
                                                         DMA                            DMA
                                                      Local Store       ...          Local Store

                                                        Slave                          Slave




• An interconnect network
  connects all the components
  together.
                                                                    University of Michigan
                                             Electrical Engineering and Computer Science
              Overview of Flextream
          Streaming Application


          Prepass Replication     Adjust the amount of parallelism for the target system by
Static




                                  replicating actors.
                                    Find an optimal schedule for a virtualized member of a
           Work Partitioning        family of processors.
                                  Find optimal modulo schedule for a virtualized
                                    Goal: of a family of processors.
                                  memberTo perform Adaptive Stream Graph Modulo
                                    Scheduling.
          Partition Refinement    Tunes actor-processor mapping to the real configuration
                                  of the underlying hardware. (Load balance)
Dynamic




                                    Performs light-weight adaptation of the schedule for
           Stage Assignment       Specifies how actors execute in time in the new actor-
                                    the current configuration of the target hardware.
                                  processor mapping.
            Buffer Allocation     Tries to efficiently allocate the storage requirements of
                                  the new schedule into available memory units.
             MSL Commands

                                                                                    University of Michigan
                                                             Electrical Engineering and Computer Science
  MSL : Multi-Core Streaming Layer
• Instruction set for heterogeneous multi-core systems

• A set of high-level commands for :
   – Actor Commands(Loading/Unloading)
   – Buffer Commands(Allocating local/global buffers)
   – Data Transfer Commands(Managing DMAs)


• Flextream’s online layer uses these commands to adapt the
  static schedule

                                                                     University of Michigan
                                              Electrical Engineering and Computer Science
          Overall Execution Flow
• For every application may see multiple iterations of:




                                                            University of Michigan
                                     Electrical Engineering and Computer Science
      Prepass Replication [static 1]
                  6         S0
                  22
                                              61.5
 10
      A        C0 C1             C2 C3
              A E0                                    C1 C2
                                                      C                          D
                                                                                 D0
                 22               B C0
                  6         J0                          C3
 86
      B
                   6        S1
                   151.5
              P0 : 10 22              147.5
                                 P1 : 86                   184.5
                                                      P2 : 246                P3 : 326
                                                                                   163
246
      C                                163
                       D0         D1
326                    22
      D                                                E3
                                                       E0
               E1 6
               E            J1 F E2                                             D1
566                6        S2
      E          21

                  566
                  141.5
              P4E0 E1
                : 283            P5 : 10 E3
                                 E2 151.5
                                              141.5
                                                           283
                                                           141.5
                                                      P6 : 0                 P7 : 0
                                                                                  163
 10
      F          21
                   6        J2

                                                                       University of Michigan
                                                Electrical Engineering and Computer Science
          Work Partitioning [static 2]
• Finds optimal actor to processor mapping
  considering:
   –   Actors’ work estimates
   –   Communication cost
   –   DMA cost
   –   Memory requirements


• At the end, each actor is assigned to exactly one
  processor.
                                                           University of Michigan
                                    Electrical Engineering and Computer Science
   Partition Refinement [dynamic 1]
• Available resources at runtime can be more limited
  than resources in static target architecture.

• Partition refinement tunes actor to processor
  mapping for the active configuration.

• A greedy iterative algorithm is used to achieve this
  goal.
                                                            University of Michigan
                                     Electrical Engineering and Computer Science
        Partition Refinement Example
• Pick processors with most   C0 C1
                                                                         E2
                                                 E3                   S2 S1                    E0 B
  number of actors.            C2                E2                   S0 J0                    C2

                              P0 : 184.5           283
                                              P1 : 141.5             P2 : 171.5                  141.5
                                                                                            P3 : 289
• Sort the actors
                                               J2 J1
                              A E1           D1 F C3                  D0                      B C3
                              C0 C1          S2 J0 S0                                         J1 J2
• Find processor with max                       S1
  work                             151.5
                              P4 : 274.5            173
                                                    183
                                               P5 : 193
                                                    270.5            P6 : 140               P7 : 159.5



• Assign min actors until
  threshold                                E2 B C0 C1 C2 C3 S2 S1 J1 J2 S0 J0



                                                                                  University of Michigan
                                                           Electrical Engineering and Computer Science
      Stage Assignment [dynamic 2]
• Processor assignment only specifies how actors are
  overlapped across processors.

• Stage assignment finds how actors are overlapped in time.

• Relative start time of the actors is based on stage numbers.

• DMA operations will have a separate stage.

                                                                University of Michigan
                                         Electrical Engineering and Computer Science
    Stage Assignment Example
                A
                                0
                B               2
                                             E3                 E0 B
                S0              4            E2                 C2

6    C0    C1        C2    C3
                                                     J2 J1
                J0
                                     A E1         D1 F C3                  D0
                S1                   C0 C1        S2 J0 S0
                                8
                                                     S1
10        D0          D1

                J1
                                12
                S2

     E0   E1         E2    E3   14
                J2              16
                F               18
                                                                University of Michigan
                                         Electrical Engineering and Computer Science
      Buffer Allocation [dynamic 3]
• Slave processors have limited local store.

• Local store is faster than main memory.

• Utilize local stores first and then spill to main
  memory

• In case of spilling, DMAs have to be adjusted
                                                              University of Michigan
                                       Electrical Engineering and Computer Science
                     Methodology
• StreamIt Compiler

• Metis for graph partitioning

• 32 core heterogeneous distributed memory multi-core
  system

• Each slave core has a DMA engine and 128K local store

• System simulator to simulate the interconnect traffic.
                                                                  University of Michigan
                                           Electrical Engineering and Computer Science
                        Performance Comparison (DES)
                                    Full Static        Graph Partitioning   Flextream
                   35

                   30

                   25
Relative Speedup




                   20

                   15

                   10

                    5

                    0
                        2   4   6   8     10      12    14 16 18 20           22       24       26       28       30        32
                                                         Number of Cores
                                                                                                   University of Michigan
                                                                            Electrical Engineering and Computer Science
                                Performance Comparison
                 25
                        Graph Partitioning Approach
                        Flextream Approach
                 20
Slowdown ( % )




                 15


                 10


                 5


                 0
                      bitonic    dct   des      fft    filter   fm   matrix mpeg2 serpent               tde average
                                                      bank           mult.
                                                                                                 University of Michigan
                                                                          Electrical Engineering and Computer Science
                 Dynamic Approach Time Comparison
            12
                                                                     Flextream Refinement…
                                                                     Graph Partitioner Approach
            10

             8
Time (ms)




             6

             4

             2

             0
                 bitonic   dct   des   fft    filter   fm   matrix mpeg2 serpent               tde average
                                             bank           mult.
                                                                                        University of Michigan
                                                                 Electrical Engineering and Computer Science
                                              Overhead Comparison
                                                    Prepass Replication       Work Refinement Time
                                                    Stage Assignment Time     Buffer Allocation Time
                             1.00                                                                                                     5
                                                       8.4           7.4       5      3.2                    4.5         3.6
                             0.99                                                                                                    7.6
                                              8.1     11.3    3                9                             7.3                     3.3
                                                                     5.9
                                                                               4                                         8.4
Fraction of Time Allocated




                             0.98                      4.3           2.7              5.8                    4.9
                             0.97                                                                                        2.8
                                              5.2                                     1.3
                             0.96
                                              2.6            6.9
                             0.95   3735                                                       4588
                                                                                                                                    1283
                             0.94                    1117           705       887                           695
                             0.93                                                     274                               403
                                              301
                             0.92                            2.3
                             0.91
                                                             125
                             0.90
                                    bitonic   dct     des     fft    filter    fm    matrix mpeg2 serpent                 tde     average
                                                                    bank             mult.

                                                                                                                   University of Michigan
                                                                                            Electrical Engineering and Computer Science
                  Conclusion
• Static scheduling approaches are promising but not
  enough.

• Dynamic adaptation is necessary for future
  systems.

• Flextream provides a hybrid static/dynamic
  approach to improve efficiency.
                                                          University of Michigan
                                   Electrical Engineering and Computer Science
                                                          Effect of Buffer Allocation on Performance
                                     1
                                                    Overhead Comparison
                                    0.9                   Prepass Replication      Work Refinement Time
                                    0.8                   Stage Assignment Time    Buffer Allocation Time
                                    1.00                                                                                                 5
                                    0.7                      8.4                    5                            4.5         3.6
             Relative Performance




                                                                            7.4            3.2                                          7.6
                                    0.99            8.1     11.3     3              9                            7.3                    3.3
                                    0.6                                     5.9
                                                                                    4                                        8.4
Fraction of Time Allocated




                                    0.98                     4.3            2.7            5.8                   4.9
                                    0.5
                                    0.97                                                                                     2.8
                                    0.4             5.2                                    1.3
                                    0.96
                                                    2.6             6.9
                                    0.95
                                    0.3
                                            3735                                                    4588
                                                                                                                                       1283
                                    0.94
                                    0.2                     1117           705     887                          695
                                    0.93
                                    0.1                                                    274                              403
                                                    301
                                    0.92
                                     0                              2.3
                                    0.91 bitonic   dct      des    fft    filter   fm    matrix mpeg2 serpent                tde      average
                                                                   125    bank
                                    0.90
                                           Min Memdct
                                           bitonic       des    fft    filter             matrix mpeg2 serpent tde average
                                                                                    fmMin Mem + (Max Mem - Min Mem)/5
                                                                      bank
                                           Min Mem + 2(Max Mem - Min Mem)/5               mult.
                                                                                      Min Mem + 3(Max Mem - Min Mem)/5
                                           Min Mem + 4(Max Mem - Min Mem)/5           Max Mem               University of Michigan
                                                                                                 Electrical Engineering and Computer Science
          Prepass Replication
                6           S0
               22
                                             61.5
 10
      A       C0       C1        C2    C3
               22
 86
                6           J0
      B
                   6        S1
246                    22
      C                                163
                       D0         D1
326                    22
      D            6        J1

566                6        S2
      E        21
                                             141.5
 10
              E0       E1        E2    E3
      F        21
                6           J2

                                                                      University of Michigan
                                               Electrical Engineering and Computer Science
A E0                         C1 C2
                             C                             D
                                                           D0
              B C0
                               C3


     151.5
P0 : 10           147.5
             P1 : 86               184.5
                              P2 : 246                  P3 : 326
                                                             163



  E1
  E          F E2               E3
                                E0                        D1



     141.5
     566
P4 : 283     P5 : 10
                  151.5            283
                              P6 : 0
                                   141.5               P7 : 0
                                                            163



                                                 University of Michigan
                          Electrical Engineering and Computer Science
                    Outline
• Streaming Background

• Flextream’s Approach
  – Static phase
  – Dynamic phase


• Evaluation

• Conclusion
                                                     University of Michigan
                              Electrical Engineering and Computer Science
                Introduction
• Single core performance
  stopped to scale.

• Multi-core and Many-core
  systems are every where.

• These systems have different
  configurations.

• Resource management is a
  challenging problem.                    Cell                    Intel Larrabee
                                       Processor
                                                        University of Michigan
                                 Electrical Engineering and Computer Science

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:2/26/2013
language:English
pages:29