C or Fortran Program

Document Sample
C or Fortran Program Powered By Docstoc
					                              Parallelizing Applications into Silicon
                       Jonathan Babb, Martin Rinard, Csaba Andras Moritz, Walter Lee,
                           Matthew Frank, Rajeev Barua, and Saman Amarasinghe
                                      Laboratory for Computer Science
                                   Massachusetts Institute of Technology
                                           Cambridge, MA 02139

Abstract                                                         banking systems in less than a week. Low-power and low-
                                                                 cost embedded applications in cars and hand-held devices
The next decade of computing will be dominated by embed-         have indirectly become an everyday part of human existence.
ded systems, information appliances and application-speci c      Similarly, information appliances promise to be one of the
computers. In order to build these systems, designers will       dominating applications of the next decade.
need high-level compilation and CAD tools that generate              All of these hardware systems perform well, in part, be-
architectures that e ectively meet the needs of each appli-      cause they exploit large amounts of concurrency and special-
cation. In this paper we present a novel compilation sys-        ization. Currently, hardware architects exploit these aspects
tem that allows sequential programs, written in C or FOR-        by hand. But as more and more complex applications are
TRAN, to be compiled directly into custom silicon or re-         mapped into hardware, the di culty of exploiting concur-
con gurable architectures. This capability is also interesting   rency and specialization by hand makes it more and more
because trends in computer architecture are moving towards       di cult to design, develop and debug hardware. This pa-
more recon gurable hardware-like substrates, such as FPGA        per demonstrates a parallelizing compiler that automatically
based systems. Our system works by successfully combining        maps applications directly into a hardware implementation.
two resource-e cient computing disciplines: Small Memo-          We expect that the combination of such high-level design
ries and Virtual Wires.                                          tools with the added capability of recon gurable hardware
    For a given application, the compiler rst analyzes the       substrates (e.g. FPGA-based systems), will enable hard-
memory access patterns of pointers and arrays in the pro-        ware designers to develop application-speci c systems that
gram and constructs a partitioned memory system made up          can achieve unprecedented price/performance and energy ef-
of many small memories. The computation is implemented             ciency. This paper explains how such a high-level compi-
by active computing elements that are spatially distributed      lation system works.
within the memory array. A space-time scheduler assigns              For this sytem to generated e cient computing struc-
instructions to the computing elements in a way that max-        tures, we need to make two fundamental attitude changes:
imizes locality and minimizes physical communication dis-        1) A shift of focus from computation alone to also including
tance. It also generates an e cient static schedule for the      memory, and wires and 2) A shift of focus from a low-level,
interconnect. Finally, specialized hardware for the result-      hardware-centric design process to a high-level, software-
ing schedule of memory accesses, wires, and computation is       based development process.
generated as a multi-process state machine in synthesizable          The rst shift is driven by the key performance trend
Verilog.                                                         in VLSI: the performance of basic arithmetic units, such
    With this system, implemented as a set of SUIF compiler      as adders and multipliers, is increasing much more rapidly
passes, we have successfully compiled programs into hard-        than the performance of memories and wires. Wire lengths
ware and achieve specialization performance enhancements         a ect the performance in that the clock cycle time must be
by up to an order of magnitude versus a single general-          long enough for signals to propagate the length of the longest
purpose processor. We also achieve additional paralleliza-       wire. As designers exploit feature size decreases to put more
tion speedups similar to those obtainable using a tightly-       functionality on a chip, the wire lengths become the limiting
interconnected multiprocessor.                                   factor in determining cycle time. It is therefore crucial to
                                                                 deliver designs sensitive to wire length. We therefore choose
1 Introduction                                                   to organize the compiler to optimize the memories and wires,
                                                                 as well as distributing the computation, in order to achieve
When the performance, cost, or power requirements of an          good overall performance.
application cannot be met with a commercial o -the-shelf             The second shift is driven by both VLSI and the complex-
processor, engineers typically turn to custom and/or recon-      ity of applications we can now implement in hardware. In
  gurable hardware. Performance oriented hardware projects       the past decade, hardware design languages such as Verilog
have generated excellent results: IBM's Deep Blue chess          and VHDL enabled a dramatic increase in the sophistica-
computer recently defeated Garry Kasparov, the World Chess       tion of the circuits that designers were able to build. But
Champion the Electronic Frontier Foundation's DES crack-         these languages still require designers to specify the low-level
ing machine recently cracked the encryption standard of our      hardware components and schedule the operations in the
                                                                 hardware on a cycle-by-cycle basis. We believe that these
languages present such low-level abstractions that they are       a large memory can once the value is produced, it is closer
no longer adequate for the large, sophisticated systems that      to the point where it will be consumed. Multiple small mem-
are now possible to implement in hardware. We believe that        ories also have a higher aggregate memory bandwidth than
the next revolution in hardware design will come from com-        a single large memory, and each reference takes less power
pilers that are capable of implementing programs written in       because the wire capacitance in small memories is less than
a high-level language, such as C, directly into hardware.         the capacitance in a large memory.
    While these shifts may at rst appear to require a rad-            Compiler technology is an important element in exploit-
ically di erent approach, our results indicate that the solu-     ing the potential of small memories. If the compiler can
tion lies in the extension of known techniques from parallel      statically determine the memory location that will satisfy
computing and parallelizing compilers. Speci cally, we be-        each data reference, it can then generate more e cient com-
lieve that performance gains will be made possible by split-      putation structures. Also, the compiler can leverage the
ting large, relatively slow memories into arrays of small, fast   correspondence between computation and memory to place
memories, and by statically scheduling the interconnection        the computation close to the memory that it accesses. When
between these memories. In this paper computation is im-          possible, static disambiguation also allows the compiler to
plemented by custom logic interspersed among these mem-           statically schedule the memory and communication wires,
ories. However, our approach for managing memory and              eliminating the need to implement an expensive dynamic
wires is applicable whether the interspersed computation is       interconnect capable of delivering data from any memory to
implemented in recon gurable hardware, con gurable arith-         any piece of computation. Finally, static disambiguation al-
metic arrays, or even small microprocessors.                      lows the compiler to simplify address calculations. The bits
    The direct application of the research presented in this      in the address that would otherwise select the memory bank
paper will allow programmers to design complex distributed-       disappear, replaced by directly wired connections.
memory hardware which is correct-by-construction. That is,            In our hardware implementation of small memories,
if the applications can be successfully executed and tested       we cluster one or more memories together into directly-
on a sequential processor, our compiler will generate func-       connected tiles arranged in a space-e cient interconnect.
tionally correct hardware.                                        Roughly speaking, a tile is a hardware region within which
    In the remainder of this paper we will show, step-by-         wire delay is less than one clock cycle. For simplicity, we
step, the basic compiler transformations needed to gener-         also choose to correlate tiles with sequencing control, with
ate a distributed-memory hardware design from a sequential        one sequencer per tile.
program. This design will contain hardware and memory                 Besides memory banks, each tile contains custom logic
structures carefully matched to the program requirements.         as well as the ability to schedule communication between
At the end, we will present initial favorable results from a      neighbors. Unlike traditional memory banks, which serve
working compiler which automatically implements the set of        as sub-units of a centralized memory system, the memory
transformations we describe.                                      banks in each tile function autonomously and are directly
    The organization is as follows. Section 2 motivates our       addressable by the custom logic, without going through a
compilation strategy. Section 3 overviews the compilation         layer of arbitration logic. This organization enables mem-
process and introduces a running example subsequently used        ory ports which scale with the hardware size. While we will
for illustration. Section 4 then outlines two analyses we use     focus on custom hardware generation throughout this pa-
for decomposing the application data into multiple clusters       per, this approach is generally application to recon gurable
of small memories. Next, Section 5 describes how applica-         systems as well.
tion instructions are assigned to memory clusters and how
a space-time scheduler creates virtual interconnections be-       2.2 Virtual Wires
tween clusters. Section 6 shows how we generate hardware
from the resulting schedule. Section 7 then presents both         Our architectures are designed speci cally to keep all wire
specialization and parallelization results for several bench-     lengths within a xed small size. This mechanism allows the
mark applications. Finally, Section 8 describes related work      compiler to tightly control the constraints that wire lengths
in the eld, and Section 9 makes concluding remarks.               place on the clock cycle time. To maximize performance,
                                                                  the compiler uses a spatially-aware partitioner that maps the
2 Motivation                                                      data and computation onto speci c physical locations within
                                                                  the memory array. The compiler heuristically minimizes the
When compiling applications to hardware, we consider three        wire length between communicating components.
basic primitives: memory, wires, and logic. The following             Of course, applications also require communication be-
sections describe and motivate our compilation approach for       tween remote regions of the chip. The architecture imple-
each primitive.                                                   ments such communications by implementing a long \virtual
                                                                  wire" out of multiple short physical wires. Conceptually, a
                                                                  virtual wire is a pipelined connection between the endpoints
2.1 Small Memories                                                of the wire. This connection takes several clock cycles to
Our e cient memory architecture contains many small mem-          traverse, but allows pipelined communication by occupying
ories, with the computation distributed among the memo-           only one short physical wire in each clock cycle.
ries to maximize parallelism while minimizing the physical            Virtual wires are a key concept for a compiler that op-
communication distance. This organization has signi cant          timizes wire performance. Our compiler generates architec-
bene ts when compared with a standard architecture, which         tures that connect the small memories using virtual wires
segregates the memory from the computation. First, a sys-         short physical wires are statically scheduled for maximum
tem of small memories have shorter e ective memory access         e ciency and minimum hardware interconnect overhead.
latencies. A small memory can produce a value quicker than
                     C or Fortran Program                        tion and parallelization techniques as well as state-of-the-art
                                                                 hardware synthesis technology. Figure 1 shows the overall
                                                                 compiler ow. After reading in the program and performing
              Traditional Compiler Optimzations                  traditional compiler optimizations, the compiler performs
                                                                 three major sets of transformations. Each set of transfor-
                                                                 mations handles one class of primitives: memory, wires, or
                                                                 logic. Following these transformation, traditional computer-
                                                                 aided-design (CAD) optimization can be applied to generate
                  Small Memory Partitioning
                                                                 the nal distributed-memory hardware.
                                                                     Phase ordering is determined by the relative importance
                                                                 of each primitive in determining overall application perfor-
                                                                 mance. First, we analyze and partition program data into
                   Virtual Wires Scheduling                      memories, and assign computation to those memories to cre-
                                                                 ate. We then schedule communication onto wires. Finally,
                                                                 we generate custom logic to perform all the computation.
                                                                 The following sections describe each component in turn.
                                                                     Figure 2 introduces a running example which illustrates
                  Custom Logic Generation
                                                                 the steps the smart memory compiler and synthesizer per-
                                                                 form. Figure 2(a) shows the initial code and data fed to the
                                                                 compiler. The code is a simple for loop containing a ne
                                                                 references to two arrays, A and B. Both the data arrays
                Traditional CAD Optimizations                    are initially mapped to the same monolithic memory bank.
                                                                 Subsequent sections will discuss how this program is trans-
                                                                 formed by smart memory compilation, scheduling and hard-
                                                                 ware generation.

                                                                 4 Small Memory Formation
                  Figure 1: Compiler Flow
                                                                 The rst phase of our compilation system is small memory
                                                                 formation. This phase consists of two main steps. First,
2.3 Custom Logic                                                 the total memory required per application is decomposed
To maximize the e ciency of the nal circuit, our compiler        into smaller memories based on the application's data ac-
implements the computation using custom logic. This logic        cess pattern. Second, the computation is assigned to the
consists of a datapath and nite state control. Because this      appropriate memories partitions.
logic is customized for the speci c computation at hand, it
can be smaller, consume less power, and execute faster than      4.1 Memory decomposition
more general purpose implementations. Examples of speci c        The goal of memory decomposition is to partition the pro-
optimizations include arithmetic units optimized for the spe-    gram data across many small memories, but yet enforce the
ci c operations in the program and reductions in datapath        attribute that each program data reference can refer to only
widths for data items with small bitwidths. The use of a         a single memory bank. Such decomposition enables high ag-
small nite state control eliminates the area, latency, and       gregrate memory bandwidth, keeps the access logic simple,
power consumption of general-purpose hardware designed           and enables locality between a memory and the computa-
to execute a stream of instructions.                             tions which access it.
    Conceptually, the compiler generates hardware resources          Our memory decomposition problem is similar to that
such as adders for each instruction in the program. In prac-     for Raw architectures 20]. The Raw compiler uses static
tice, the compiler reuses resources across clock cycles: if      promotion techniques to divide the program data across the
two instructions require the same set of resources but exe-      tile in such a way that the location of the data for each ac-
cute in di erent cycles, the compiler uses the same resource     cess is known at compile time 2, 3]. Our memory compiler
for both instructions. This approach preserves the bene ts       leverages the same techqniues. At the coarse level, di er-
of specialized hardware while generating e cient circuits.       ent objects may be allocated to di erent memories through
    While generating custom logic provides the maximum           equivalence class uni cation. At the ne level, each array is
performance, our approach is also e ective for other imple-      potentially distributed across several memories through the
mentation technologies. Speci cally, the compiler could also     aid of modulo unrolling. We overview each technique below.
map the computation e ectively onto a platform with recon-
  gurable logic or programmable ALUs instead of custom           Equivalence Class Uni cation Equivalence class uni-
logic.                                                             cation 2] is a static promotion technique which uses the
                                                                 pointer analysis to collect groups of objects, each of which
3 Compilation Overview                                           has to be mapped to a single memory. The compiler uses
                                                                 SPAN 15], a sophisticated ow-sensitive, context-sensitive
Our compilation system is responsible for translating an ap-     interprocedural pointer analysis package. Pointer analysis is
plication program into a parallel architecture specialized for   used to provide location set lists for every pointer in the pro-
that application. To make this translation feasible, our com-    gram, where every location set corresponds to an abstract
pilation system incorporates both the latest code optimiza-      data object in the program. A pointer's location set list
                                                                (a)                           A[]


                                                                         Code      ...

                 (b)                                                   (c)      A0[]   B0[]     A1[]     B1[]                 (d)     A0[]    B0[]     A1[]          B1[]
                          A[]                B[]

                                                                                A2[]   B2[]     A3[]     B3[]                         A2[]    B2[]     A3[]          B3[]

                                                                                for(i=0;i<100;i+=4) {                                 i’=0;
                         for(i=0;i<100;i++)                                       A[i]=A[i]*B[i+1]                                    for(i=0;i<100;i+=4) {
                          A[i]=A[i]*B[i+1]                                        A[i+1]=A[i+1]*B[i+2]                                  A0[i’]=A0[i’]*B1 [i’]
                                                                                  A[i+2}=A[i+2]*B[i+3}                                  A1[i’]=A1 [i’]*B2 [i’]
                                                                                  A[i+3}=A[i+3]*B[i+4]                                  A2 [i’}=A2 [i’]*B3 [i’}
                                                                                }                                                       A3 [i’}=A3 [i’]*B0 [i’+1]
                                                                                                                                        i’ = i’ + 1

                                                                                                                A0[]   B0[]                  A1[]    B1[]
                                                                                                           1: t= i’+1                    1: tmp0=ldB 1(i’)
                                                                                                              tmp3=ldB (t)                  t2=ldA 1(i’)
                                                                                                              t2=ldA 0(i’)               2: send(tmp0)
                 (e)                                                                           (f)         2: send(tmp3)                 3: tmp1=rcv()
                                                                                                           3: tmp0=rcv()                 4: t3=t2*tmp1
                             A0[]   B0[]                 A1[]   B1[]
                                                                                                           4: t3=t2*tmp0                 5: stA1(i’)
                          tmp3=B0[i’+1]            tmp0=B1[i’]                                             5: stA0(i’)
                          A 0[i’]=A0[i’]*tmp0      A 1[i’]=A1[i’]*tmp1

                             A2[]   B2[]                 A3[]   B3[]
                                                                                                                A2[]   B2[]                  A3[]    B3[]
                       tmp1=B2[i’]                 tmp2=B3[i’]
                       A2[i’]=A 2[i’]*tmp2         A 3[i’]=A3[i’]*tmp3                                     1: tmp1=ldB 2(i’)             1: tmp2=ldB 3(i’)
                                                                                                              t2=ldA 2(i’)                  t2=ldA 3(i’)
                                                                                                           2: send(tmp1)                 2: send(tmp2)
                                                                                                           3: tmp2=rcv()                 3: tmp3=rcv()
                                                                                                           4: t3=t2*tmp2                 4: t3=t2*tmp3
                                                                                                           5: stA2(i’)                   5: stA3(i’)

                                                                                                                                      Control FSM
                                                                                                    Static handshakes
                                    switch(pc)                                                                                           Logic
                 (g)                {                                                         (h)                                        State
                                      case 1:
                                       tmp3=ldB 0(t)
                                      case 2:
                                       pc=3                                                                                     A[]                  B[]
                                      case 3:
                                       pc=4                                                                                                             i’       1
                                     }                                                                                                                       +

                                                                                                                                             *             ...

Figure 2: Example. (a) initial program (b) after equivalence class specialization (c) after modulo unrolling step 1 (d)
after modulo unrolling step 2 (e) after migration of computation to memories (f) after scheduling of interconnect between
memories (g) after state machine generation (h) resulting hardware architecture.
is a list of abstract data objects to which it can reference.          4.2 Computation Assignment
We use this information to derive alias equivalence classes,           The previous memory partitioning phase has decomposed
which are1 groups of pointers related through their location           program data structures into separate memories, with
set lists. Pointers in the same alias equivalence class can            groups of memories clustered together into tiles. We next
potentially alias to the same object, while pointers in di er-         locate computation close to the most appropriate memory.
ent equivalence classes can never reference the same object.           The load and store instructions that access a memory are
    Once the alias equivalence classes are determined, equiv-          always assigned to the tile containing that memory. The
alence class uni cation places all objects for an alias class          remaining computation can be assigned to any tile its ac-
onto a single memory. This placement ensures that all point-           tual assignment is selected based on two factors: minimize
ers to that alias class refer to one memory. By mapping                communication between tiles and minimizing the latency of
objects for every alias equivalence class in such a manner,            the critical path.
all memory references can be constrained to addressing a                   Our algorithm for assigning computation to memory
single memory bank. By mapping di erent alias equiva-                  tiles directly leverages the non-uniform resource architecure
lence classes to di erent banks, memory parallelism can be             (NURA) algorithms developed for the Raw Machine 11].
attained.                                                              In this algorithm, instruction-level parallelism within a ba-
    Figure 2(b) shows the results of equivalence class uni -           sic block is orchestrated across multiple tiles. This work
cation on the initial code in Figure 2(a). Since no static ref-        is in turn an extension of the original MIT Virtual Wires
erence in the program can address both A ] and B ], pointer            Project 1] work, in which only circuit-level, or combina-
analysis determines that A ] and B ] are in di erent alias             tional, parallelism with a clock cycle was orchestrated across
equivalence class. This analysis allows the two arrays to              multiple tiles.
be mapped to di erent memories, while ensuring that each                   The compiler performs computation assignment in three
memory reference only addresses a single memory.                       steps: clustering, merging, and placement. Clustering
Modulo Unrolling The major limitation of equivalence                   groups together instructions, such that instructions within
class uni cation is that arrays are treated as single objects          a cluster have no parallelism that can pro tably be ex-
that belong to a single alias equivalence class. Mapping               ploited across tiles given the cost of communication. Merg-
an entire array to a single memory sequentializes accesses             ing merges the cluster to reduce the number of clusters down
to that array and destroys the parallelism found in many               to the number of tiles. Placement performs a bijective map-
loops. Therefore, we use a more advanced strategy called               ping from the merged clusters to tiles, taking into account
modulo unrolling 3] to allow arrays to be distributed across           the topology of the interconnect.
di erent memories.                                                         Figure 2(e) shows the results of computation assignment
    Modulo unrolling applies for array accesses whose indices          in our example. As we described in the previous section,
are a ne functions of enclosing loop induction variables.              each tile contains two memories, one for each array. Com-
First, arrays are partitioned into smaller memories through            putation is assigned directly into the small memory array.
low-order interleaving. In this scheme, consecutive elements           In the gure, each tile has been assigned a subset of the
of the array are interleaved in a round-robin manner across            original instructions.
memory banks on successive tiles. Next, modulo unrolling
shows that it is always possible to unroll loops by certain            5 Virtual Wires Scheduling
factors such that in the resulting code all a ne accesses go
to only one of the smaller memories. In certain cases, ad-             While the previous phase of the compiler produces the small
ditional transformations may be required. One feature of               memories, this next phase of the compiler is responsible for
modulo unrolling is that it does not a ect the disambigua-             managing inter-tile communication. We term this phrase
tion of accesses to any other equivalence class. Details on            Virtual Wires Scheduling.
determining the symbolic derivations of the minimum unroll                 Although moving program instruction into the memory
factors required, and the code generation techniques needed,           system eliminates long memory communication latencies in
are given in 3].                                                       comparison to monolithic processor design, new inter-tile
    Figure 2(c) shows the result of low-order interleaving             communication paths will now be required for the program
and unrolling phases of modulo unrolling on the code in                to execute correctly. Namely, there are data dependencies
Figure 2(b) when the number of desired memories is four.               between the instructions assigned to each tile.
Low order interleaving splits each array into four sub-arrays              In Figure 2(e) the data dependences between tiles are ex-
whose sizes are a quarter of the original. Modulo unrolling            plicitly shown as temporary variables introduced in the code.
uses symbolic formulas to predict that the unroll factor re-           For example, tmp0 needs to be communicated between the
quired for static dissambiguation is four. Now we can see              upper-left tile and the upper-right tile. Besides data de-
that each reference always goes to one tile. Speci cally, the          pendencies, there are also control dependencies introduced.
A i], A i+1], A i+2] and A i+3] references access sub-arrays           However, this control ow between basic blocks is explicitly
0, 1, 2 and 3. The B i+1], B i+2], B i+3] and B i+4] refer-            orchestrated by the compiler through asynchronous global
ences access sub-arrays 1, 2, 3 and 0. Figure 2(d) shows the           branching 11], an asynchronous mechanism for implement-
code after the code generation required by modulo unrolling.           ing branching across all the tiles using static communication
The array references have been converted to references to the          and individual branches on each tile. Thus control depen-
partitioned sub-arrays, with appropriate updated indices.              dencies will be turned into data dependencies as well.
   1 More formally, alias equivalence classes are the connected com-       In the same manner as the original virtual wires schedul-
ponents of a graph whose nodes are pointers and whose edges are        ing algorithm for logic emulation 1], our static scheduler
between nodes whose location set lists have at least one common        will multiplex logical communications such as tmp0 with
element.                                                               other communications between the same source and desti-
nation, sharing the same physical channel. The multiplexing       called pc, which serves a function similar to the program
of this physical channel will be directly synthesized in the      counter in a conventional processor. Each state of the FSM
target hardware technology by a following hardware gen-           corresponds to the work that will be done in a single cycle
eration phase. Note that communication required between           in the resulting hardware. The resulting FSM will generate
non-neighbor tiles must be routed through an intermediate         both the proper control to calculate the next state, including
tile.                                                             any calculation to generate branch destinations, and also the
    In contrast to the Raw space-time scheduler, which tar-       proper control signals to the virtual functional units that will
gets a simple processor and static switch in each tile, our       be generated in the next phase.
compilation target is more exible, allowing multiple in-              Each state of the FSM contains a set of possibly depen-
structions to be executed in parallel, with the only constraint   dent operations. The FSM generator turns the set of oper-
being that each memory bank within each tile can be ac-           ations in each state into combinational logic. Any depen-
cessed at most once per cycle. This additional exibility is       dences between operations in the same state are connected
possible because we are targeting custom hardware instead         by specifying wires between the dependent functions. For
of a xed processing unit.                                         any data values that are live between states (produced in
    In constrast to compilation for a VLIW, which also in-        one state and consumed in a di erent state), the FSM gen-
volves statically scheduling multiple instructions, we are not    erator allocates a register to hold that value.
constrained by an initial set of function units, register les,
and busses. Our scheduler minimizes the execution latency         6.2 Register Transfer Level
of the program by scheduling to virtual function units as
dictated by the application.                                      The nal phase of compilation involves generating a Register
    As a nal note, during scheduling we have essentially re-      Transfer Level (RTL) model from the nite state machine.
laxed the resource allocation problem for both registers and      The resulting RTL speci cation for each tile is technology-
function units, while retaining the constraints for memory        independent and application-speci c. Each tile contains a
accesses and inter-tile wires. This strategy is similar to vir-   datapath of inter-connected functional units, several memo-
tual register allocation strategies for sequential processors,    ries, and control elements. Figure 2(h) shows a logic diagram
except that we relax all computation resource constraints.        for the circuit that would be generated from the FSM. The
Because we are parallelizing sequential code and thus pre-        datapath is synthesized out of registers (represented by solid
dominately limited by memory and communication costs,             black bars in the gure), higher order functional units (such
this relaxation is feasible. Resource allocation is delayed       as adders and multipliers) and multiplexers. A controller is
until the later custom logic generation phase.                    speci ed in RTL to control the actions of the datapath and
    Figure 2(f) shows the results of communication schedul-       the inter-processor I/O.
ing. The send instruction in each tile represents multiplexed         The next step synthesizes logic for the address, data,
communications across the inter-tile channels. The following      and enable signals for each memory reference in the FSM.
custom logic generation phase will convert these instructions     For example the address signals are shown as arrows going
to pipeline registers and multiplexers controlled by bits in      into the A ] and B ] memories in Figure 2(h). Whenever a
the state machine in each tile.                                   reference has been statically mapped to a speci c memory,
                                                                  the address calculation is specialized to eliminate the bits
6 Custom Logic Generation                                         that select that memory as the target, as described further
                                                                  in Section 6.3. This specialization simpli es the address
By the time the custom logic generation phase executes, the       calculation hardware, so it consumes less power and less
previous phases have mapped the data onto the memories,           area.
extracted the concurrency, and generated a parallel thread            The compiler-generated architecture must include com-
of instructions for each tile. They have also scheduled the ex-   munication channels between each FSM for inter-tile com-
ecution: each memory access, communication operation and          munication. The communication synthesis step creates these
instruction has been mapped to a speci c execution cycle.         channels from the inter-tile communication dependencies
Finally, the space-time scheduler has mapped the compu-           and synthesizes the appropriate control logic into the FSM.
tation and communication to a speci c location within the         This logic is shown as the multiplexors on the bottom and
memory array.                                                     right hand side of Figure 2(h). In addition, the control
    The custom logic generation phase is responsible for gen-     logic includes static handshaking signals between each pair
erating a hardware implementation of the communication            of communicating FSMs as shown in the upper left corner
and computation schedules. First the system generates a           of Figure 2(h).
  nite state machine that produces the cycle-by-cycle con-            Finally, the compiler outputs technology independent
trols for all the functional units, registers and memories that   Verilog RTL. This step is a modi cation of the Stanford
make up the tile. Then the nite state machine is synthe-          VeriSUIF Verilog output passes. SUIF data structures are
sized into a technology independent register-transfer-level         rst converted into Verilog data structures and then written
speci cation of the nal circuit. We discuss each of these         as Verilog RTL to an output le. The output fully describes
activities below.                                                 the cycle-level behavior of the circuit, including all state
                                                                  transitions and actions to be take every cycle.
                                                                      We have described the steps that are required to compile
6.1 Finite State Machine                                          a program into hardware. In the next section we describe
As shown in Figure 2(g), the nite state machine (FSM)             some of the additional optimization opportunities that are
generator takes the thread of scheduled instructions that has     available when producing hardware.
been generated for each tile, and produces a state machine
for each thread. This state machine contains a state register,
                 i                                                 simpler integer and bit level micro-operations using a tech-
                                         A                         nique similar to 4]. These resulting micro-operations can
                                                                   then be optimized and scheduled by the datapath genera-

                                                                   tor. Because the constituent exponent and mantissa micro-
                                                                   operations are exposed to the compiler, the compiler can
                                                                   exploit the parallelism both inside individual oating point
                 j                      data                       operations and also between di erent oating point opera-
                                                                       As an example of the specialization that can occur, con-
                                                                   sider dividing a oating point number with a constant that
Figure 3: Example of address specialization. The address           can be written as a factor of two, e.g. y = x=2:0. Exe-
calculation for the array reference A i] j] can be entirely        cuting this division operation would take many cycles on a
eliminated. Instead, the values of i and j are placed on the       traditional oating-point execution unit.
corresponding address wires.                                           However, by exposing the micro-operations required to
                                                                   do the oating-point computation in terms of operations on
                                                                   the exponent and mantissa and generating specialized hard-
6.3 Other Custom Logic                                             ware for the computation, the cycle count of this operation
                                                                   can be optimized by an order of magnitude. The required
Along with the parallelism that is available in custom logic,      micro-operation to perform such a division is to subtract 1
an additional advantage is that when the bit-level opera-          from the exponent of x. The time to execute the oating-
tions are exposed to the compiler, the compiler can perform        point division operation reduces to the latency of a single
additional optimizations beyond those found by traditional           xed-point subtraction.
software optimizers. These include the possibility of folding
constants directly into the circuitry and using other compile-
time knowledge to completely eliminate gates.                      7 Experimental Results
Address Calculation. The primary example of hard-                  We have implemented a compiler that is capable of accepting
ware specialization involves simpli cation of address cal-         a sequential application, written in either C or Fortran, and
culation. As an example, consider what would happen if             automatically generating a specialized architecture for that
the code in Figure 2 had referenced a two dimensional ar-          application. The compiler contains all of the functionality
ray, like A i] j]. The naive code for this address calcula-        described in the previous sections of this paper.
tion would be A + (i * Y DIM + j) * 4. In the case that                This section presents experimental results for an initial
Y DIM was a power of 2, most compilers would strength              set of applications that we have compiled to hardware. For
reduce the multiplication operations to shifts to produce          each application, our compiler produces an architecture de-
A + (i << LOG2 Y DIM + j) << 2.                                    scription in RTL Verilog. We further synthesize this archi-
    In most software systems this is the minimal calculation       tecture to logical gates with a commercial CAD tool. The
to reference a two dimensional array. In hardware systems,         current target technology in which we report gate counts
on the other hand, we can specialize even further. First,          is the reference technology provided with the IKOS Virtua-
since we perform equivalence class uni cation, as described        Logic System 9]. This system is a logic emulator that can
in Section 4.1, the A array will live in its own private memory.   be used to validate designs of up to one million gates, as
Then every reference will lie at an o set from location 0 of       well as additional custom memories.
the memory for array A, so the rst addition operation can              Our current timing model, enforced during scheduling,
be eliminated. Furthermore, the memory for A can be built          limits the amount of computation that may be assigned to
to be of the same width as the data word, so that the nal          any one clock cycle to be less than the equivalent delay of
multiplication by 4 can be completely eliminated, leaving          a 32-bit addition or the latency of a small memory access.
(i << LOG2 Y DIM + j).                                             The exact clock period of each circuit will be technology-
    While in a conventional RISC processor logical opera-          dependent, but due to this constraint will be similar to that
tions require the same number of cycles as an addition op-         of a simple processor implemented in the same technology.
eration, in hardware a logical or operation is both faster         We report execution times as total cycle counts for each
and smaller than an adder. Our compiler performs bit-width         application.
optimization to transform these add operations into or op-             Table 1 gives the characteristics of the benchmarks used
erations. The compiler calculates the maximum bit-width            for the evaluation. These applications include one tradi-
of j, and nds that since j < Y DIM, the maximum value of           tional scienti c program and three multimedia applications.
j never consumes more than LOG Y DIM bits, so the bits of j        The input programs are all sequential. Jacobi is an integer
and the bits of (i << LOG2 Y DIM) never overlap. The ad-           version of a iterative relaxation algorithm. Adpcm-encode
dition in this expression can then be optimized away to pro-       is the compression part of the compression/decompression
duce (i << LOG Y DIM | j). Finally, since oring anything           pair in Adpcm. MPEG-kernel is the portion of MPEG which
with 0 produces that value, the or gates can be eliminated,        takes up 70% of the total run-time. SHA implements a se-
and replaced with wires. The result is that the two values, i      cure hash algorithm.
and j, can be concatenated to form the nal address. This               For reference, we also compare our results with previ-
  nal transformation is shown in Figure 3.                         ously published simulation results from a parallel Raw ma-
                                                                   chine with one to 16 processors. Because our compiler is
                                                                   based on the same frontend infrastructure for parallelization
Floating Point. For each oating point operation in the             as the Raw compiler, this comparison allows us to isolate the
program the compiler replaces the operation with a set of          e ect of customizing the tile for an application.
                 Benchmark       Type           Source        Lines Seq. RT   Primary Description
                                                            of code (cycles) Array size
                 Jacobi          Dense Mat.     Rawbench         59   2.38M     32 32 Jacobi Relaxation
                 Adpcm-encode    Multimedia     Mediabench      133    7.1M       1000 Speech compression
                 MPEG-kernel     Multimedia     UC Berkeley      86   14.6K     32 32 MPEG-1 Video Software Encoder kernel
                 SHA             Multimedia     Perl Oasis      608    1.0M    512 16 Secure Hash Algorithm

Table 1: Benchmark characteristics. Column Seq.             RT   shows the run-time for the uniprocessor code generated by the Machsuif MIPS

            16                                              mips R2000              Benchmark    Logic Registers Memory Total
                                                            customized              Jacobi        3830     5064   65536 74430
            12                                                                      Adpcm-encode 3786      3704    3210 10700

                                                                                    MPEG-kernel   1833     2312    3584 7729
             8                                                                      SHA          35100    12680   16384 64164
                                                                                            Table 2: Gate counts for one tile.
                 Adpcm-encode     Jacobi      MPEG-kernel      SHA

                          Figure 4: Base comparison                            tage of ILP even for the single tile case. For both the hard-
                                                                               wired and the virtual wired case, performance continues to
                                                                               increase as we add more hardware. The virtual wired case
7.1 Base Comparison                                                            must pay the penalty of communication costs, but neverthe-
                                                                               less, absolute performance remains well above even a 16 pro-
Figure 4 presents execution times for each application in                      cessor Raw machine. Raw achieves better scalability for Ja-
comparison with execution time on a single MIPS R2000                          cobi because of it does not take advantage of initial amount
processor (the basic component of one Raw tile). Roughly                       of instruction-level parallelism in the one processor case.
speaking, customization results in an order of magnitude                           Figure 6 reports the increase in hardware area, includ-
fewer clock cycles (5 to 10) required to perform the same                      ing memory, as the number of tiles are increased. Note that
computation. As we expect the clock cycles to be similar,                      for the virtual wires case, the hardware areas grow more
this reduction will translate directly into proportional per-                  rapidly because of the increasing amount of communication
formance increases for custom hardware.                                        logic that is required. In Figure 7, we show the hardware
    Table 2 presents the gate counts for the single tile compi-                composition, in percentages, of each architecture. Notice
lation. We assume that one gate is approximately equivalent                    how the memory starts out taking a fairly large percentage
to one byte of memory for SRAM comparisons. Note that                          of the area, but as we decrease the granularity of the mem-
the logic and register gate counts for each application except                 ory system by adding more tiles, the system becomes more
SHA are smaller than the size of a simple processor (about                     balanced. As we totally dominate the memory area with
20K gates).                                                                    additional hardware, the speedup curves tail o .

7.2 Parallelized Comparison                                                    Implications for Con gurable Hardware
For the parallelized case, we report results for two compila-                  Note that while we have not directly considered recon g-
tion strategies: a hardwired system and a virtual wired sys-                   urable hardware, such as an FPGA-based system described
tem. The hardwired system allows direct, dedicated wires                       in the related work, if a recon gurable system were to imple-
between arbitrary memories and assumes that signals can                        ment virtual wires and small memories directly in the fabric,
propagate the full length of each wire in one clock cycle.                     we believe similar performance gains might be achievable.
The total cycle count is always smaller for the hardwired                      However, we should note that in recon gurable logic-based
system because there are no cycles dedicated to schedul-                       architectures the potential clock speeds will most likely not
ing the wires. Additionally, the hardwired systems do have                     be on an equal basis and an appropriate speed penalty for
additional hardware synthesized for multiplexing the inter-                    the custom logic will need to be taken into account.
connect. It therefore provides a good comparison point that
allows us to isolate the costs of virtual wire scheduling. Bear                8 Related Work
in mind, however, that for large circuits the hardwired sys-
tem would contain long wires with large propagation delays.                    Silicon compilation has existed in one form or another since
These delays slow down the clock, degrading the overall per-                   the early days of designing chips. Even before schematic
formance of the application. The scaling factors in future                     capture, designers wrote programs which directly generated
VLSI system will only exacerbate this phenomenon, making                       the geometry and layout for a particular design. In the
virtual wires scheduling a necessity.                                          past twenty years various research has proposed to compile
    Figure 5 presents the resulting speedups for Jacobi and                    PASCAL, FORTRAN, C, and Scheme into hardware. We
MPEG, the two application which are parallelizable, as we                      must be careful to distinguish between compiling a program
increase the number of tiles. Note that for both cases each                    into hardware, and describing hardware with a higher-level
tile already has multiple memory banks and takes advan-
                      128                                                                                                            64



                       64                                                                                                            32









                        2                         Custom-hard-wires                                                                                            Custom-hard-wires

                                                  Custom-virtual-wires                                                                                         Custom-virtual-wires
                        1                                                                                                             1


                        0   |           |           |       |           |
                                                                                                                                      0   |          |           |       |           |

                            0          4           8        12         16                                                                 0          4          8        12         16
                                                                     Ntiles                                                                                                       Ntiles
                                    Speedup scalability for jacobi                                                                                Speedup scalability for mpeg

                                                                            Figure 5: Speedup for jacobi and mpeg.

                300000                                                                                                        100000


                                     Custom-virtual-wires                                                                                          Custom-virtual-wires
                                     Custom-hard-wires                                                                                             Custom-hard-wires
                240000                                                                                                              80000

                180000                                                                                                              60000

                120000                                                                                                              40000

                      60000                                                                                                         20000

                            0   |           |           |        |          |
                                                                                                                                          0   |
                                                                                                                                              |          |           |        |          |


                                0           4           8       12      16                                                                    0          4           8       12      16
                                                                      Ntiles                                                                                                       Ntiles
                                            Gate count for jacobi                                                                                        Gate count for mpeg

                                                                                        Figure 6: Hardware Size

100                                                                                                         100
                                                                                Logic                                                                                                        Logic
                                                                                Registers                                                                                                    Registers
 80                                                                                                          80
                                                                                Memory                                                                                                       Memory

 60                                                                                                          60

 40                                                                                                          40

 20                                                                                                          20

  0                                                                                                           0
      1 2 4 8 16                        1        2 4 8 16                                                           1 2 4 8 16                       1        2 4 8 16
      Virtual wires                             Hardwires                                                           Virtual wires                            Hardwires

                                Jacobi                                                                                                        Mpeg

                                                                                  Figure 7: Hardware composition.
language { they are not the same. In our work, we compile       than design hardware, for a recon gurable architecture. De-
sequential programs that describe an algorithm in a manner      signs reported to take months to design could be written in a
easy for the programmer to understand.                          day. While data-parallel C extended the language to handle
    Recent work in RTL and behavioral synthesis 18] in-         bit-level operations and systolic communication, all control
volves synthesizing higher-level algorithms into hardware.        ow is managed by the host. Hardware compilation was only
Where possible, we leverage this work in our compilation        concerned with basic blocks of parallel instructions. This
process. For example, we leave resource allocation to be        approach has been ported to National Semiconductors new
performed during the following synthesis pass. However, to      processor/FPGA chip based on the CLAy architecture 13].
our knowledge this is the rst system which can manage               Programmable Active Memories 19], designed at Com-
memory, multiplex wires, and transform general high-level       paq Paris Research Lab, interfaces to a host processor via
programming languages directly into custom hardware.            memory-mapped I/O. The programming model is to treat
    We continue by discussing two additional areas of re-       the recon gurable logic as a memory capable of perform-
lated work: parallel architectures and recon gurable archi-     ing computation. The actual design of the con guration for
tectures.                                                       each PAM application was speci ed in a C-syntax hardware
                                                                description language.
8.1 Parallel Architectures                                          In the PRISM project 17], functions derived from a sub-
                                                                set of C are compiled into an FPGA. The PRISM-I subset
Other researchers have parallelized some of the benchmarks      included if-then-else as well as for loops of xed count. The
in this paper. Automatic parallelization has been demon-        PRISM-II subset included variable length for loops, while,
strated to work well for dense matrix scienti c codes 8].       do-while, switch-case, break, and continue.
In contrast to this work, our approach to generating paral-         Other projects include compilation of Ruby 7] - a lan-
lelism stems from an ability to exploit ne-grain ILP, rather    guage of functions and relations, and compilation of vec-
than the coarse-grain parallelism targeted by 8]. Multi-        torizable loops in Modula-2 for recon gurable architec-
processors are mostly restricted to such coarse-grain par-      tures 21].
allelism because of their high costs of communication and
synchronization. Unfortunately, nding such coarse grain         9 Conclusion
parallelism often requires whole program analysis by the
compiler, which works well only in restricted domains. In       In this paper we have described and evaluated a novel com-
a custom architecture, we can successfully exploit ILP be-      pilation system that allows sequential programs written in C
cause of the register and wire-level latencies in hardware.     and Fortran to be compiled directly into application-speci c
Of course, hardware can exploit coarse-grain parallelism as     hardware substrates.
well.                                                               Our approach extends known techniques from paral-
     Software distributed shared memory schemes on multi-       lelizing compilers, such as memory disambiguation, static
processors (DSMs) 16] 5] are similar in spirit to our ap-       scheduling and data partitioning. We focus on memory and
proach for managing memory. They emulate in software the        wires rst, and then computation. We start by partitioning
task of cache coherence, one which is traditionally performed   the data structures in the program into and array of small,
by complex hardware. In contrast, this work turns sequen-       fast memories. In our model, computation is performed by
tial accesses from a single memory image into decentralized     custom logic computing elements interspersed among these
accesses across multiple small memories. This technique en-     memories. The compiler leverages the correspondence be-
ables the parallelization of sequential programs into a high    tween memory and computation to place the computation
bandwidth memory array.                                         close to the memory that it accesses, such that communi-
     Both the Berkeley IRAM research project 10] and Stan-      cation costs are minimized. Similarly, a static schedule is
ford's new Smart Memories Project 12] focus on building         generated for the interconnect that optimizes wire utiliza-
future-generation computing system that are more tightly        tion and minimizes interconnect latency. Finally, the spe-
coupled with memory. The IRAM's approach is to improve          cialized hardware for smart memories, virtual wires, and
performance of the memory system by tting more data on a        other custom logic is compiled to hardware in the form of a
chip. They achieve this goal by using high-density dynamic      multi-process state machine in synthesizable Verilog.
RAM (DRAM) memory instead of lower-density SRAM                     With this compilation system we have obtained special-
caches and treating the on-chip DRAM memory as the main         ization performance improvements by up to an order of mag-
memory instead of a redundant copy. The Smart Memo-             nitude versus a single general purpose processor and addi-
ries Project's stated purpose is to build a future-generation   tional parallelization speedups similar to those obtainable
computing system that provides e ciency, generality, and        using a tightly interconnected multiprocessor.
programmability in a single system.

8.1.1 Recon gurable Architectures                               References
A recon gurable computing system, comprised of an array of       1] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and
interconnected Field Programmable Gate Array (FPGA) de-             A. Agarwal. Logic Emulation with Virtual Wires. IEEE
vices are common hardware targets for application-speci c           Transactions on Computer Aided Design, 16(6):609{
computing. Splash 6] and PAM 19] are the rst substan-               626, June 1997.
tial recon gurable computing systems. As part of the Splash      2] Rajeev Barua, Walter Lee, Saman Amarasinghe,
project, a team lead by Maya Gokhale ported data-parallel           and Anant Agarwal. Maps: A Compiler-Managed
C 14] to the Splash recon gurable architecture. This e ort          Memory System for Raw Machines. Technical
was one of the rst to actually compile programs, rather
      report, M.I.T. LCS-TM-583, July 1998.             Also   14] Maya Gokhale and Brian Schott. Data-Parallel C on a
      http://www.cag.lcs.mit.edu/raw/.                             Recon gurable Logic Array. Journal of Supercomput-
                                                                   ing, September 1995.
 3]   Rajeev Barua, Walter Lee, Saman Amarasinghe, and
      Anant Agarwal. Memory Bank Disambiguation us-            15] Radu Rugina and Martin Rinard. Span: A shape and
      ing Modulo Unrolling for Raw Machines. In Pro-               pointer analysis package. Technical report, M.I.T. LCS-
      ceedings of the ACM/IEEE Fifth Int'l Conference on           TM-581, June 1998.
      High Performance Computing(HIPC), Dec 1998. Also
      http://www.cag.lcs.mit.edu/raw/.                         16] Daniel J. Scales, Kourosh Gharachorloo, and Chan-
                                                                   dramohan A. Thekkath. Shasta: A low overhead,
 4]   William J. Dally. Micro-optimization of oating-              software-only approach for supporting ne-grain shared
      point operations. In Proceedings of the Third Inter-         memory. In Proceedings of the Seventh International
      national Conference on Architectural Support for Pro-        Conference on Architectural Support for Programming
      gramming Languages and Operating Systems, pages              Languages and Operating Systems, pages 174{185,
      283{289, Boston, Massachusetts, April 3{6, 1989.             Cambridge, Massachusetts, October 1{5, 1996.
 5]   Sandhya Dwarkadas, Alan L. Cox, and Willy                17] A. Smith, M. Wazlowski, L. Agarwal, T. Lee, E. Lam,
      Zwaenepoel. An integrated compile-time/run-time soft-        P. Athans, H. Silverman, and S. Ghosh. PRISM II Com-
      ware distributed shared memory system. In Proceed-           piler and Architecture. In Proceedings IEEE Workshop
      ings of the Seventh International Conference on Ar-          on FPGA-based Custom Computing Machines, pages
      chitectural Support for Programming Languages and            9{16, Napa, CA, April 1993. IEEE.
      Operating Systems, pages 186{197, Cambridge, Mas-
      sachusetts, October 1{5, 1996.                           18] Synopsys, Inc. Behavioral Compiler User Guide, V
                                                                   1997.08, August 1997.
 6]   Maya Gokhale, William Holmes, Andrew Kopser, Sara
      Lucas, Ronald Minnich, Douglas Sweeney, and Daniel       19] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. H.
      Lopresti. Building and using a highly parallel pro-          Touati, and P. Boucard. Programmable Active Mem-
      grammable logic array. Computer, 24(1), January 1991.        ories: Recon gurable Systems Come of Age. IEEE
                                                                   Transactions on VLSI Systems, 4(1), March 1996.
 7]   Shaori Guo and Wayne Luk. Compiling Ruby into FP-
      GAs. In Field Programmable Logic and Applications,       20] Elliot Waingold, Michael Taylor, Devabhaktuni Srikr-
      August 1995.                                                 ishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim,
                                                                   Matthew Frank, Peter Finch, Rajeev Barua, Jonathan
 8]   M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R.         Babb, Saman Amarasinghe, and Anant Agarwal. Bar-
      Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Max-          ing It All to Software: Raw Machines. IEEE Computer,
      imizing multiprocessor performance with the suif com-        30(9):86{93, September 1997. Also available as MIT-
      piler. COMPUTER, 29(12):84{89, December 1996.                LCS-TR-709.
 9]   IKOS Systems, Inc. VirtuaLogic Emulation System          21] M. Weinhardt. Compilation and Pipeline Synthesis for
      Documentation, 1996. Version 1.2.                            Recon gurable Architectures - High Performance by
                                                                   Con gware. In Recon gurable Architecture Workshop,
10]   Christoforos E. Kozyrakis, Stylianos Perissakis, David       1997.
      Patterson, Thomas Anderson, Krste Asanovic, Neal
      Cardwell, Richard Fromm, Jason Golbus, Benjamin
      Gribstad, Kimberly Keeton, Randi Thomas, Noah
      Treuhaft, and Katherine Yelick. Scalable processors in
      the billion transistors era: IRAM. IEEE Computer,
      pages 75{78, September 1997.
11]   Walter Lee, Rajeev Barua, Matthew Frank, Dev-
      abhatuni Srikrishna, Jonathan Babb, Vivek Sarkar,
      and Saman Amarasinghe. Space-Time Scheduling of
      Instruction-Level Parallelism on a Raw Machine. In
      Proceedings of the Eighth ACM Conference on Archi-
      tectural Support for Programming Languages and Op-
      erating Systems, pages 46{57, San Jose, CA, October
12]   Mark Horowitz, personal communications.
      Stanford University Smart Memories Project.
      http://velox.stanford.edu/smart memories.
13]   Maya B. Gokhale, Janice M. Stone, Matthew Frank,
      Sarno Corporation. NAPA C: Compiling for a Hybrid
      RISC/FPGA Architecture. In FCCM98, Napa Valley,
      California, April 1998.