Compilation for Scalable d Virtual Hardware

Document Sample
Compilation for Scalable d Virtual Hardware Powered By Docstoc
					               Compilation for Scalable,
                Paged Virtual Hardware

IA        IB

                         Eylon Caspi
                        Qualifying Exam
                             3/6/01
     OA    OB




                   University of California, Berkeley
  The Compilation Problem
 Programming Model                                Execution Model
    • Communicating EFSM operators                     • Communicating page configs
      - unrestricted size, # IOs, timing                 - fixed size, # IOs, timing
                                                       • Paged virtual hardware


memory                                                                         memory
segment                                                                        segment



  TDF                                                                          compute page
operator
                                      Compile


                                    Compilation is a
           stream                   resource-binding                     stream
                                    xform on state
                                    machines +
  3/6/01                            data-paths                                           2
Overview

 Motivation
          Paged virtual hardware – software survival + scalability
          SCORE programming model
 Compilation methodology
          New page partitioning techniques
          Automatic synthesis & partitioning of communicating FSMs
 Evaluation + Architectural Studies
 Timeline


3/6/01                     Eylon Caspi – Qualifying Exam              3
Reconfigurable Computing
 Programmable logic +
  Programmable interconnect (e.g. FPGA)
 10x-100x gain vs. microprocessors in:
          Performance
          Functional density (work per area-time)
 Spatial Computing
          Parallelism; custom data paths
 Programmability
          Custom execution sequence; specialization
 BUT current models expose resource constraints
  to the programmer
          Programmer has to target a specific device
          Limits software longevity
                                                              Graphics copyright by
3/6/01                      Eylon Caspi – Qualifying Exam   their respective company   4
Solution: Virtual Hardware
 Compute model with unbounded resources
    Programmer no longer targets a specific device
            Enables software longevity, scalability
 Requires efficient hardware virtualization
    Large device  concurrent spatial execution
    Small device  time multiplexing
    Paging model




3/6/01                     Eylon Caspi – Qualifying Exam   5
Previous Approaches to Paging
 WASMII: Register IO
            [Ling+Amano, FCCM „93]
            Page IO via registers
            Evaluate each page for a cycle, then reconfigure
            Reconfiguration time dominates execution

 DPGA: Configuration Cache
          [DeHon, FPGA „94] , TM-FPGA [Xilinx, FCCM „97]
          Fast reconfiguration  area, power
          Reconfiguration power dominates execution

 PipeRench: Stripes                                            time
          [CMU, FPGA „98]
          Pipelined reconfiguration
          Feed-forward computation only
3/6/01                        Eylon Caspi – Qualifying Exam            6
   Paging + Streaming
    Streaming allows efficient, useful virtualization
             Amortizes reconfiguration cost over a larger epoch
             Exploits program structure
             Less restrictive communication topology
    Compiler and scheduler‟s joint responsibility
                                            Swap




buffers                  Swap              Swap                 Swap




   3/6/01                       Eylon Caspi – Qualifying Exam          7
SCORE Compute Model
 Program = DFG of compute nodes
    Kahn process network
               blocking read, non-blocking write
 Compute: SFSM (Streaming Finite State Machine)
    Concretely: page + FSM to implement token-flow semantics
    Abstractly: task with local control
 Communication: Stream
    Abstraction of wire, with buffering
 Storage: Memory Segment
 Dynamics:
          Dynamic local behavior in SFSM
          Unbounded resource usage: stream buffer expansion
          Dynamic graph allocation in STM (Streaming Turing Machine)
3/6/01                        Eylon Caspi – Qualifying Exam             8
SCORE Programming Model: TDF

 TDF = intermediate, behavioral language for:
    EFSM Operators        • Static operator graphs
 State machine for:
    Firing signatures     • Control flow (branching)
 Firing semantics:
    When in state X, wait for X‟s inputs, then fire (consume, act)
    select ( input boolean     s,
             input unsigned[8] t,                s     t       f
             input unsigned[8] f,
            output unsigned[8] o )
    {
      state S (s) :
        if (s) goto T; else goto F;                    select
      state T (t) :
        o=t; goto S;
      state F (f) :
        o=f; goto S;                                       o
    }                                                                 9
SCORE Hardware Model
 Paged FPGA
    Compute Page (CP)
             Fixed-size slice of RC hardware
             Fixed number of I/O ports
          Distributed, on-chip memory
             Configurable Memory Block (CMB)
             Stream access
          High-level interconnect
 Microprocessor
   Run-time support + user code



3/6/01                     Eylon Caspi – Qualifying Exam   10
SCORE Software Infrastructure
 Device Simulator
          Cycle-accurate behavioral simulation
          Parameterized (e.g. #pages)
          Interact with concurrent user processes (STMs) via stream API
 Page Scheduler
          Version 1: dynamic, list-based scheduling (by input availability)
          Version 2: static, precedence-based
 TDF Compiler
          Compiles to working C++ simulation code            Run
                                                              time
          No partitioning (page = 1 TDF operator)
 Applications
          Wavelet, JPEG, MPEG, IIR                                  Device size




3/6/01                       Eylon Caspi – Qualifying Exam                         11
Communication is King

 With virtualization,
  Inter-page delay is unknown, sensitive to:
            Placement
            Interconnect implementation
            Page schedule
            Technology – wire delay is growing


 Inter-page feedback is SLOW
          Partition to contain FB loops in page
          Schedule to contain FB loops on device




3/6/01                         Eylon Caspi – Qualifying Exam   12
Structural Partitioning is Not Enough

 Structural partitioning does not address
  feedback loops
    Wire min-cut
               FM, flow-based
          Minimum wire length
               Spectral
          Delay-optimal DAG mapping
               DAGON, FlowMap, Wong


 Structural partitioning does not address
  communication rates, dynamics
    All loops are NOT created equal

3/6/01                      Eylon Caspi – Qualifying Exam   13
FSM Decomposition is not enough
 Ashar+Devadas+Newton (ICCAD „89)                               Ma

          Minimize logic
                                                                 Mb



 Kuo+Liu+Cheng (ISCAS „95)                                      Ma

          Minimize wires
                                                                 Mb



 Benini+DeMicheli+Vermeulen (ISCAS „98)                    Fa
                                                                 Ma

          Minimize power                                   Fb
                                                                 Mb



 None consider inter-page delay
 None consider cutting / scheduling data-path separately
                                               from FSM
3/6/01                      Eylon Caspi – Qualifying Exam             14
Outline

 Motivation
 Compilation Methodology
 Evaluation + Architectural Studies
 Time Line




3/6/01       Eylon Caspi – Qualifying Exam   15
Compilation – Scope
 Synthesis + Partitioning                   memory
                                             segment
                                                                                memory
                                                                                segment

  of SFSMs                                                       Compile
          TDF  Pages
                                               TDF                              compute page
                                             operator



          Resource binding
                                                        stream             stream



 Target
          Parameterized hardware model / simulation

 Constrained optimization problem
          Constraints
               page area, IO, timing
          Optimality Criteria
             Primary:           Communication delay
             Secondary:         Communication bandwidth, Area

3/6/01                         Eylon Caspi – Qualifying Exam                        16
Compilation Flow Overview
(1) Optimizations
(2) Data path timing + scheduling
(3) Partitioning

 Ignore:
    Place / route / retime in page
            Known   solutions in the community
          Page scheduling
            Responsibility   of separate scheduler

3/6/01                 Eylon Caspi – Qualifying Exam   17
Synthesis + Partitioning Flow
                        Compiler Optimizations

         Optimization     Pipeline Extraction     p


         Data-path
                          Data Path Mapping
    p    Partitioning
                        Schedule DF into States
         Preliminary
         Code            Partition Large States   p



                            Cluster States        p



                            Page Packing          p



3/6/01                  Synthesize Page FSMs          18
 How Big is an Operator?
                       Area for 47 Operators
                        (Before Pipeline Extraction)

                3500

                                                       • JPEG Encode
                                                              • Wavelet Decode
                3000                                   • JPEG Decode
                                                       • MPEG•(I)
                                                                Wavelet Encode
                                                       • MPEG•(P)
                                                                JPEG Encode
                2500                                   • WaveletMPEG Encode
                                                              • Encode
                                                       • IIR
Area (4-LUTs)




                2000
                                                                             FSM Area
                                                                             DF Area
                1500


                1000


                 500


                   0
 3/6/01                 Eylon Caspi – Qualifying
                       Operator (sorted by area)Exam                                    19
Partitioning Tasks
                    Compiler Optimizations
(1) Decompose/
                      Pipeline Extraction     p
     shrink SFSMs
                      Data Path Mapping

                    Schedule DF into States

                     Partition Large States   p



                        Cluster States        p


(2) Pack SFSMs
                        Page Packing          p
     onto page
3/6/01              Synthesize Page FSMs          20
 Pipeline Extraction
  Hoist uncontrolled FF data-flow out of FSMD
  Benefits:
     Shrink FSM cyclic core
     Extracted pipeline has more freedom for scheduling and
                                                     partitioning
            x


                                 x
    DF
    CF                                          xz
                   Extract            x==0
    state




                                     pipeline                 pipeline



state foo(x):                             state foo(xz):
  if (x==0)...                              if (xz) ...             21
 Pipeline Extraction – Extractable Area
                           Extractable Data-Path Area
                                    for 47 Operators

                3500

                                                  • JPEG Encode
                3000                              • JPEG Decode
                                                  • MPEG (I)
                                                  • MPEG (P)
                2500                              • Wavelet Encode
                                                  • IIR
Area (4-LUTs)




                2000
                                                                     Extracted DF Area
                                                                     Residual DF Area
                1500


                1000


                 500


                   0
 3/6/01                         Eylon Caspi – Qualifying Exam
                       Operator (sorted by data-path area)                               22
 Pipeline Extraction – Residual SFSM

                         Area for 47 Operators
                           (After Pipeline Extraction)

                3000

                                                • JPEG Encode
                                                • JPEG Decode
                2500
                                                • MPEG (I)
                                                • MPEG (P)
                                                • Wavelet Encode
                2000                            • IIR
Area (4-LUTs)




                                                                   FSM Area
                1500
                                                                   Residual DF Area


                1000



                 500



                   0
 3/6/01                Operator (sorted–by area) Exam
                           Eylon Caspi Qualifying                                     23
Data-path Mapping / Scheduling

 Task:
    Bind technology-specific area/time to data-path primitives
    Schedule data-path primitives in state machine
 Fixed-frequency target
    Decompose primitives into multi-cycle operations
    Data-path module library / tree matching
 Pipeline linearized sequences / loops
    DAG mapping state logic is insufficient
 Compiler technology
    Code motion
    Software pipelining

3/6/01                Eylon Caspi – Qualifying Exam               24
Delay-Oriented State Clustering
 Indivisible unit: state (CF+DF)
          Spatial locality in state logic
 Cluster states into page-size
   sub-machines
          Inter-page communication for
            data flow, state flow
 Sequential delay is in
   inter-page state transfer
          Cluster to maintain local control
          Cluster to contain state loops
 Similar to:
    VLIW trace scheduling               [Fisher ‘81]
    FSM decomp. for low power           [Benini/DeMicheli ISCAS ‘98]
    VM/cache code placement
    GarpCC HW/SW partitioning           [Callahan ‘00]

3/6/01                         Eylon Caspi – Qualifying Exam            25
State Clustering Formulation
 Min-cut transition probabilities
   in state flow graph
          Probabilities from profiling                        w1             a1    w2
 Area-constrained                                                  p1              p2
          Balanced min-cut partitioning                                  w4
                                                                               w3
               [Yang+Wong, ACM „94]                                 a2              a3
            Iterate to desired partition area                            w5 w6
               (1-)A ≤ a(X) ≤ (1+)A                               p3              p4
 IO-constrained                                                              a4
          Add wire edges                                                           w9   p5
                                                               w7        w8
          Mix edge weights: (c)wwire + (1-c)wSF
          Use smallest IO-feasible c
 Requires all states to be smaller than page

3/6/01                         Eylon Caspi – Qualifying Exam                                  26
Page Packing
 Cluster SFSMs + pipelines
    Avoid page fragmentation
 Min-cut streams of top-level DFG
    Allow cutting pipelines, not SFSMs
    Area and IO constrained (Wong balanced min-cut partition)
    Disallow certain topologies
            No dynamic-rate streams in page
      Data-flow feedback?




3/6/01                    Eylon Caspi – Qualifying Exam      27
Outline

 Motivation
 Compilation Methodology
 Evaluation + Architectural Studies
 Time Line




3/6/01       Eylon Caspi – Qualifying Exam   28
Evaluating Paging Overhead
 Applications
    Must be rewritten in TDF
    Existing: • Wavelet, • JPEG, • MPEG, • IIR
    To do: • ADPCM, • BABAR particle detector
 Metrics
    Circuit area (#pages x page-size)
    Page delay (LUT depth per firing)
    Performance (total run-time, “makespan”)
 Baseline comparison
    “Unpartitioned”: page = 1 TDF operator
            Ideal virtualization with zero partitioning cost – cannot do better


3/6/01                      Eylon Caspi – Qualifying Exam                     29
Page Size Studies
 Paging overhead varies with:
    Application • Page size, IO • Match thereof
 Is paging overhead robust to a mismatch?
 Vary page parameters, measure:
    (1) Pure area overhead
    (2) Pure performance overhead
                      Execute spatially in expanded hardware
          (3) Virtualized performance overhead
                      Execute in fixed device size




             (1)                                  (2)           (3)
3/6/01                                                                30
Outline

 Motivation
 Compilation Methodology
 Evaluation + Architectural Studies
 Time Line




3/6/01       Eylon Caspi – Qualifying Exam   31
Status

 SCORE compiler / simulator / scheduler
    Compile+execute unpartitioned (page = 1 TDF op)
 Preliminary synthesis + partitioning work
    Pipeline extraction
    FSM synthesis to SIS
    Area-constrained state clustering
 To do
    Complete initial implementation
    Evaluate
    Improve – secondary implementation

3/6/01            Eylon Caspi – Qualifying Exam        32
To Complete Initial Implementation

 IO-constrained state clustering
 Decompose large states
 Page packing
 Data-path scheduling in states
 Synthesize partitioned SFSMs




3/6/01       Eylon Caspi – Qualifying Exam   33
Secondary Implementation – Possibilities


 Optimizations
    SW pipelining
    Use SUIF
 State clustering with replication
 Unified state clustering + page packing
    Cluster states of all operators simultaneously
 Finer-grained clustering
    Recast as BDF, min-cut stream rates


3/6/01            Eylon Caspi – Qualifying Exam       34
Time Line

                    Impl. 1
                                  Eval
                                               Impl. 2
                                                     Eval
                                                                          Thesis
                                                                          writing


   Month:   3   4    5   6    7   8   9   10 11 12   1   2    3   4   5   6   7     8




   Year:                      2001                                2002




3/6/01                        Eylon Caspi – Qualifying Exam                             35
Summary

 Partitioning and paging enables
   Software survival / scaling
   Efficient use of small HW for dynamic apps

 My Contributions
   Methodology for page synthesis + partitioning
               Necessary for efficient virtualization
          Evaluation framework
               Verify that paging can be efficient
          Architectural studies
3/6/01                       Eylon Caspi – Qualifying Exam   36
Supplemental Material
   SFSMs + transforms
   SCORE simulation + scaling results
   Page hardware model
   Synthesis observations
   Architectural studies




3/6/01          Eylon Caspi – Qualifying Exam   37
TDF  Dataflow Process Network

 Dataflow Process Network
  [Parks+Lee, IEEE May „95]
      Process enabled by set of firing rules: R = {R1, R2, …, RN}
      Firing rule = set of patterns:          Ri = {Ri,1, Ri,2 , …, Ri,p}

 DF process for a TDF operator:
      Feedback arc for state




                                                                                            state
                                                                                process
      One firing rule per state
          Patterns match state value + presence of desired inputs
          E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}
          Patterns:         Ri,j = [*]   if input j is     in state i‟s input signature
                             Ri,j =      if input j is not in state i‟s input signature
                             Ri,p = [i]   for final input, representing state arc
      These are sequential firing rules
      Partitioned SFSM adds “wait” state
3/6/01                        Eylon Caspi – Qualifying Exam                                38
SFSM Partitioning Transform
   Only 1 partition active at a time
          Transform to activate via streams                         A    C
   New state in each partition: “wait”
          Used when not active                                      B    D
          Waits for activation
              from other partition(s)
            Has one input signature
              (firing rule) per activator
   Firing rules are not sequential,
    but determinism guaranteed
                                                                  {C,D}   {A,B}
          Only 1 possible activator                   A                          C

   Activation streams from                                       Wait
                                                                   AB
                                                                          Wait
                                                                           CD
    given source to given dest.
                                                       B                          D
    partitions can be merged +                                    {C,D}   {A,B}
    binary-encoded

3/6/01                            Eylon Caspi – Qualifying Exam                   39
Distributing/Collecting Shared Streams

   Requires inter-page synchronization for ordering
                                                                                                i
   Two schemes for input distribution
          (1) send token to all pages
                  – Inactive pages must discard tokens,                                 A              C
                    must know how many to discard
          (2) send token only to active page
                  – Distributor must know state
                  – (a) present state requests token OR
                                                                                        B              D
                  – (b) previous state pre-fetches token

   One scheme for output collection                                                            o
                  – Collector must know state

   How to cluster distributors / collectors?
          Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok)
          Distributor scheme (2)(a) can be cast into delay-optimal state clustering:
                  – Decompose reading states into sequences of single-read states
                  – Pre-cluster states that read same stream – this forms distributors
                  – Sequential delay of read request is now modeled as state transfer to distributor

3/6/01                             Eylon Caspi – Qualifying Exam                                           40
Decomposing Large States

 A state may be larger than a page


 Decomposing into a sequence
  of page-size states leads to
  excessive inter-page transfer


 Better: delay-optimal DAG-
  mapping into parallel pages



3/6/01           Eylon Caspi – Qualifying Exam   41
SFSM Optimizations
 Many traditional compiler optimization
  techniques apply to TDF
    State flow ~ basic block flow
    Different cost model
            “Unlimited” registers and functional units
 E.g. work-reducing optimizations
   Constant folding / propagation
   Common subexpression elimintation
   Hoist loop invariants
   Strength reduction
3/6/01                    Eylon Caspi – Qualifying Exam   42
SCORE Functional Simulation
 FPGA based on HSRA [Berkeley, FPGA ‟99]
    CP:      512 4-LUTs
    CMB: 2Mbit DRAM
    Area for CP-CMB pair: .25: 12.9mm2 (1/9 of PII-450)
                                  .18:        6.7mm2   (1/16 of PIII-600)

      Page reconfiguration: 5000 cycles (from CMB)
      Synchronous operation (same clock speed as processor)
 x86 microprocessor
 Page Scheduler task
    Swap on timer interrupt (every 250,000 cycles)
    Fully dynamic scheduling

3/6/01               Eylon Caspi – Qualifying Exam                    43
Application: JPEG Encode




3/6/01   Eylon Caspi – Qualifying Exam   44
Scaling Results: JPEG Encode
               (Makespan in millions of cycles)
  Total Time




                                                  Physical Compute Pages

3/6/01                                            Eylon Caspi – Qualifying Exam   45
Page Hardware Model
 Page = fixed-size slice of rsrcs + stream interface
 FSM for:
    Firing • Output emission       • Data-path control • Branching


    FSM




                                                          Reconfigurable
                                                          Fixed logic




3/6/01              Eylon Caspi – Qualifying Exam                 46
Page Firing Logic
 Sample firing logic
          3 inputs (A,B,C)
          3 outputs (X,Y,Z)
          Single signature




3/6/01                         47
  How Large is a State?
              764                Histogram of Data-Path Area Per State
                    317
                                                    (1404 States from 5 Applications)

        200

        180
                                                                                                                 • JPEG Encode
                      162                                                                                        • JPEG Decode
        160                                                                                                      • MPEG (I)
                                                                                                                 • MPEG (P)
        140                                                                                                      • IIR

        120
Count




        100

        80
                                 68

        60

        40                                 35
                            31

        20                                                                                         8
                                      4         3    1    3   1        2             1         1                              3
         0
                                                          0

                                                                   0

                                                                            0

                                                                                 0

                                                                                          0

                                                                                               0

                                                                                                        0

                                                                                                             0

                                                                                                                     0

                                                                                                                          0

                                                                                                                                   0
                    20

                            40

                                      60

                                                80
          0




                                                         10

                                                                  12

                                                                           14

                                                                                16

                                                                                         18

                                                                                              20

                                                                                                       22

                                                                                                            24

                                                                                                                   26

                                                                                                                         28

                                                                                                                                  30
  3/6/01                                                  Eylon Caspi – Qualifying Exam                                                48
                                                             Data-Path Area (4-LUTs)
             SFSM Firing Delay
                Complex SFSM may require ≥1 cycle just for control
                   Evaluate firing rule, generate control signals, compute next state
                Should we partition SFSM to minimize FSM logic?
                No – incurring inter-page communication latency is worse!

                      Histogram of FSM Delay
                      Histogram of FSM Delay                                                       Histogram of FSM Inputs
                                                                                                    Histogram of FSM Inputs
                             for 4747 Operators
                               for Operators                                                                 for 47 Operators
                                                                                                                 for 47 Operators
                               (unpartitioned)                                                                    (unpartitioned)
        14                                                                       18
                                                                                                                                                • JPEG Encode
                                                      • JPEG Encode              16                                                             • JPEG Decode
        12                                            • JPEG Decode
                                                                                                                                                • MPEG (I)
                                                      • MPEG (I)                 14                                                             • MPEG (P)
                                                      • MPEG (P)
                                                                                                                                                • Wavelet Encode
        10                                            • Wavelet Encode           12
                                                                                                                                                • IIR
                                                      • IIR

                                                                         Count
                                                                                 10
        8
Count




                                                                                  8
        6                                                                         6

                                                                                  4
        4
                                                                                  2
        2
                                                                                  0
                                                                                      0

                                                                                          5

                                                                                              10

                                                                                                   15

                                                                                                        20

                                                                                                             25

                                                                                                                  30

                                                                                                                       35

                                                                                                                            40

                                                                                                                                 45

                                                                                                                                      50

                                                                                                                                           55

                                                                                                                                                  60

                                                                                                                                                       65

                                                                                                                                                             70
        0
             0    1      2         3      4       5        6         7
                                                                                                             Number of Inputs
                                Delay (4-LUTs)
                              4-LUT Depth
                                                                                                                                                       49
Scaling the Hardware Resources

 A simplified scaling model for
  architectural studies
 Scaling page size (LUTs) induces
  scaling of other resources, e.g.:
    Scaling memory
            Constant   CP-to-CMB ratio
          Scaling page IO
            Rent‟s   Rule: IO = CAp, (0 ≤ p ≤ 1)


3/6/01                   Eylon Caspi – Qualifying Exam   50

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:4/23/2011
language:English
pages:50