C or Fortran Program
Document Sample


Parallelizing Applications into Silicon
Jonathan Babb, Martin Rinard, Csaba Andras Moritz, Walter Lee,
Matthew Frank, Rajeev Barua, and Saman Amarasinghe
Laboratory for Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139
jbabb@lcs.mit.edu
Abstract banking systems in less than a week. Low-power and low-
cost embedded applications in cars and hand-held devices
The next decade of computing will be dominated by embed- have indirectly become an everyday part of human existence.
ded systems, information appliances and application-speci c Similarly, information appliances promise to be one of the
computers. In order to build these systems, designers will dominating applications of the next decade.
need high-level compilation and CAD tools that generate All of these hardware systems perform well, in part, be-
architectures that e ectively meet the needs of each appli- cause they exploit large amounts of concurrency and special-
cation. In this paper we present a novel compilation sys- ization. Currently, hardware architects exploit these aspects
tem that allows sequential programs, written in C or FOR- by hand. But as more and more complex applications are
TRAN, to be compiled directly into custom silicon or re- mapped into hardware, the di culty of exploiting concur-
con gurable architectures. This capability is also interesting rency and specialization by hand makes it more and more
because trends in computer architecture are moving towards di cult to design, develop and debug hardware. This pa-
more recon gurable hardware-like substrates, such as FPGA per demonstrates a parallelizing compiler that automatically
based systems. Our system works by successfully combining maps applications directly into a hardware implementation.
two resource-e cient computing disciplines: Small Memo- We expect that the combination of such high-level design
ries and Virtual Wires. tools with the added capability of recon gurable hardware
For a given application, the compiler rst analyzes the substrates (e.g. FPGA-based systems), will enable hard-
memory access patterns of pointers and arrays in the pro- ware designers to develop application-speci c systems that
gram and constructs a partitioned memory system made up can achieve unprecedented price/performance and energy ef-
of many small memories. The computation is implemented ciency. This paper explains how such a high-level compi-
by active computing elements that are spatially distributed lation system works.
within the memory array. A space-time scheduler assigns For this sytem to generated e cient computing struc-
instructions to the computing elements in a way that max- tures, we need to make two fundamental attitude changes:
imizes locality and minimizes physical communication dis- 1) A shift of focus from computation alone to also including
tance. It also generates an e cient static schedule for the memory, and wires and 2) A shift of focus from a low-level,
interconnect. Finally, specialized hardware for the result- hardware-centric design process to a high-level, software-
ing schedule of memory accesses, wires, and computation is based development process.
generated as a multi-process state machine in synthesizable The rst shift is driven by the key performance trend
Verilog. in VLSI: the performance of basic arithmetic units, such
With this system, implemented as a set of SUIF compiler as adders and multipliers, is increasing much more rapidly
passes, we have successfully compiled programs into hard- than the performance of memories and wires. Wire lengths
ware and achieve specialization performance enhancements a ect the performance in that the clock cycle time must be
by up to an order of magnitude versus a single general- long enough for signals to propagate the length of the longest
purpose processor. We also achieve additional paralleliza- wire. As designers exploit feature size decreases to put more
tion speedups similar to those obtainable using a tightly- functionality on a chip, the wire lengths become the limiting
interconnected multiprocessor. factor in determining cycle time. It is therefore crucial to
deliver designs sensitive to wire length. We therefore choose
1 Introduction to organize the compiler to optimize the memories and wires,
as well as distributing the computation, in order to achieve
When the performance, cost, or power requirements of an good overall performance.
application cannot be met with a commercial o -the-shelf The second shift is driven by both VLSI and the complex-
processor, engineers typically turn to custom and/or recon- ity of applications we can now implement in hardware. In
gurable hardware. Performance oriented hardware projects the past decade, hardware design languages such as Verilog
have generated excellent results: IBM's Deep Blue chess and VHDL enabled a dramatic increase in the sophistica-
computer recently defeated Garry Kasparov, the World Chess tion of the circuits that designers were able to build. But
Champion the Electronic Frontier Foundation's DES crack- these languages still require designers to specify the low-level
ing machine recently cracked the encryption standard of our hardware components and schedule the operations in the
hardware on a cycle-by-cycle basis. We believe that these
languages present such low-level abstractions that they are a large memory can once the value is produced, it is closer
no longer adequate for the large, sophisticated systems that to the point where it will be consumed. Multiple small mem-
are now possible to implement in hardware. We believe that ories also have a higher aggregate memory bandwidth than
the next revolution in hardware design will come from com- a single large memory, and each reference takes less power
pilers that are capable of implementing programs written in because the wire capacitance in small memories is less than
a high-level language, such as C, directly into hardware. the capacitance in a large memory.
While these shifts may at rst appear to require a rad- Compiler technology is an important element in exploit-
ically di erent approach, our results indicate that the solu- ing the potential of small memories. If the compiler can
tion lies in the extension of known techniques from parallel statically determine the memory location that will satisfy
computing and parallelizing compilers. Speci cally, we be- each data reference, it can then generate more e cient com-
lieve that performance gains will be made possible by split- putation structures. Also, the compiler can leverage the
ting large, relatively slow memories into arrays of small, fast correspondence between computation and memory to place
memories, and by statically scheduling the interconnection the computation close to the memory that it accesses. When
between these memories. In this paper computation is im- possible, static disambiguation also allows the compiler to
plemented by custom logic interspersed among these mem- statically schedule the memory and communication wires,
ories. However, our approach for managing memory and eliminating the need to implement an expensive dynamic
wires is applicable whether the interspersed computation is interconnect capable of delivering data from any memory to
implemented in recon gurable hardware, con gurable arith- any piece of computation. Finally, static disambiguation al-
metic arrays, or even small microprocessors. lows the compiler to simplify address calculations. The bits
The direct application of the research presented in this in the address that would otherwise select the memory bank
paper will allow programmers to design complex distributed- disappear, replaced by directly wired connections.
memory hardware which is correct-by-construction. That is, In our hardware implementation of small memories,
if the applications can be successfully executed and tested we cluster one or more memories together into directly-
on a sequential processor, our compiler will generate func- connected tiles arranged in a space-e cient interconnect.
tionally correct hardware. Roughly speaking, a tile is a hardware region within which
In the remainder of this paper we will show, step-by- wire delay is less than one clock cycle. For simplicity, we
step, the basic compiler transformations needed to gener- also choose to correlate tiles with sequencing control, with
ate a distributed-memory hardware design from a sequential one sequencer per tile.
program. This design will contain hardware and memory Besides memory banks, each tile contains custom logic
structures carefully matched to the program requirements. as well as the ability to schedule communication between
At the end, we will present initial favorable results from a neighbors. Unlike traditional memory banks, which serve
working compiler which automatically implements the set of as sub-units of a centralized memory system, the memory
transformations we describe. banks in each tile function autonomously and are directly
The organization is as follows. Section 2 motivates our addressable by the custom logic, without going through a
compilation strategy. Section 3 overviews the compilation layer of arbitration logic. This organization enables mem-
process and introduces a running example subsequently used ory ports which scale with the hardware size. While we will
for illustration. Section 4 then outlines two analyses we use focus on custom hardware generation throughout this pa-
for decomposing the application data into multiple clusters per, this approach is generally application to recon gurable
of small memories. Next, Section 5 describes how applica- systems as well.
tion instructions are assigned to memory clusters and how
a space-time scheduler creates virtual interconnections be- 2.2 Virtual Wires
tween clusters. Section 6 shows how we generate hardware
from the resulting schedule. Section 7 then presents both Our architectures are designed speci cally to keep all wire
specialization and parallelization results for several bench- lengths within a xed small size. This mechanism allows the
mark applications. Finally, Section 8 describes related work compiler to tightly control the constraints that wire lengths
in the eld, and Section 9 makes concluding remarks. place on the clock cycle time. To maximize performance,
the compiler uses a spatially-aware partitioner that maps the
2 Motivation data and computation onto speci c physical locations within
the memory array. The compiler heuristically minimizes the
When compiling applications to hardware, we consider three wire length between communicating components.
basic primitives: memory, wires, and logic. The following Of course, applications also require communication be-
sections describe and motivate our compilation approach for tween remote regions of the chip. The architecture imple-
each primitive. ments such communications by implementing a long \virtual
wire" out of multiple short physical wires. Conceptually, a
virtual wire is a pipelined connection between the endpoints
2.1 Small Memories of the wire. This connection takes several clock cycles to
Our e cient memory architecture contains many small mem- traverse, but allows pipelined communication by occupying
ories, with the computation distributed among the memo- only one short physical wire in each clock cycle.
ries to maximize parallelism while minimizing the physical Virtual wires are a key concept for a compiler that op-
communication distance. This organization has signi cant timizes wire performance. Our compiler generates architec-
bene ts when compared with a standard architecture, which tures that connect the small memories using virtual wires
segregates the memory from the computation. First, a sys- short physical wires are statically scheduled for maximum
tem of small memories have shorter e ective memory access e ciency and minimum hardware interconnect overhead.
latencies. A small memory can produce a value quicker than
C or Fortran Program tion and parallelization techniques as well as state-of-the-art
hardware synthesis technology. Figure 1 shows the overall
compiler ow. After reading in the program and performing
Traditional Compiler Optimzations traditional compiler optimizations, the compiler performs
three major sets of transformations. Each set of transfor-
mations handles one class of primitives: memory, wires, or
logic. Following these transformation, traditional computer-
aided-design (CAD) optimization can be applied to generate
Small Memory Partitioning
the nal distributed-memory hardware.
Phase ordering is determined by the relative importance
of each primitive in determining overall application perfor-
mance. First, we analyze and partition program data into
Virtual Wires Scheduling memories, and assign computation to those memories to cre-
ate. We then schedule communication onto wires. Finally,
we generate custom logic to perform all the computation.
The following sections describe each component in turn.
Figure 2 introduces a running example which illustrates
Custom Logic Generation
the steps the smart memory compiler and synthesizer per-
form. Figure 2(a) shows the initial code and data fed to the
compiler. The code is a simple for loop containing a ne
references to two arrays, A and B. Both the data arrays
Traditional CAD Optimizations are initially mapped to the same monolithic memory bank.
Subsequent sections will discuss how this program is trans-
formed by smart memory compilation, scheduling and hard-
ware generation.
Hardware
4 Small Memory Formation
Figure 1: Compiler Flow
The rst phase of our compilation system is small memory
formation. This phase consists of two main steps. First,
2.3 Custom Logic the total memory required per application is decomposed
To maximize the e ciency of the nal circuit, our compiler into smaller memories based on the application's data ac-
implements the computation using custom logic. This logic cess pattern. Second, the computation is assigned to the
consists of a datapath and nite state control. Because this appropriate memories partitions.
logic is customized for the speci c computation at hand, it
can be smaller, consume less power, and execute faster than 4.1 Memory decomposition
more general purpose implementations. Examples of speci c The goal of memory decomposition is to partition the pro-
optimizations include arithmetic units optimized for the spe- gram data across many small memories, but yet enforce the
ci c operations in the program and reductions in datapath attribute that each program data reference can refer to only
widths for data items with small bitwidths. The use of a a single memory bank. Such decomposition enables high ag-
small nite state control eliminates the area, latency, and gregrate memory bandwidth, keeps the access logic simple,
power consumption of general-purpose hardware designed and enables locality between a memory and the computa-
to execute a stream of instructions. tions which access it.
Conceptually, the compiler generates hardware resources Our memory decomposition problem is similar to that
such as adders for each instruction in the program. In prac- for Raw architectures 20]. The Raw compiler uses static
tice, the compiler reuses resources across clock cycles: if promotion techniques to divide the program data across the
two instructions require the same set of resources but exe- tile in such a way that the location of the data for each ac-
cute in di erent cycles, the compiler uses the same resource cess is known at compile time 2, 3]. Our memory compiler
for both instructions. This approach preserves the bene ts leverages the same techqniues. At the coarse level, di er-
of specialized hardware while generating e cient circuits. ent objects may be allocated to di erent memories through
While generating custom logic provides the maximum equivalence class uni cation. At the ne level, each array is
performance, our approach is also e ective for other imple- potentially distributed across several memories through the
mentation technologies. Speci cally, the compiler could also aid of modulo unrolling. We overview each technique below.
map the computation e ectively onto a platform with recon-
gurable logic or programmable ALUs instead of custom Equivalence Class Uni cation Equivalence class uni-
logic. cation 2] is a static promotion technique which uses the
pointer analysis to collect groups of objects, each of which
3 Compilation Overview has to be mapped to a single memory. The compiler uses
SPAN 15], a sophisticated ow-sensitive, context-sensitive
Our compilation system is responsible for translating an ap- interprocedural pointer analysis package. Pointer analysis is
plication program into a parallel architecture specialized for used to provide location set lists for every pointer in the pro-
that application. To make this translation feasible, our com- gram, where every location set corresponds to an abstract
pilation system incorporates both the latest code optimiza- data object in the program. A pointer's location set list
Data
(a) A[]
B[]
Code ...
for(i=0;i<100;i++)
A[i]=A[i]*B[i+1]
...
(b) (c) A0[] B0[] A1[] B1[] (d) A0[] B0[] A1[] B1[]
A[] B[]
A2[] B2[] A3[] B3[] A2[] B2[] A3[] B3[]
for(i=0;i<100;i+=4) { i’=0;
for(i=0;i<100;i++) A[i]=A[i]*B[i+1] for(i=0;i<100;i+=4) {
A[i]=A[i]*B[i+1] A[i+1]=A[i+1]*B[i+2] A0[i’]=A0[i’]*B1 [i’]
A[i+2}=A[i+2]*B[i+3} A1[i’]=A1 [i’]*B2 [i’]
A[i+3}=A[i+3]*B[i+4] A2 [i’}=A2 [i’]*B3 [i’}
} A3 [i’}=A3 [i’]*B0 [i’+1]
i’ = i’ + 1
}
A0[] B0[] A1[] B1[]
1: t= i’+1 1: tmp0=ldB 1(i’)
tmp3=ldB (t) t2=ldA 1(i’)
t2=ldA 0(i’) 2: send(tmp0)
(e) (f) 2: send(tmp3) 3: tmp1=rcv()
3: tmp0=rcv() 4: t3=t2*tmp1
A0[] B0[] A1[] B1[]
4: t3=t2*tmp0 5: stA1(i’)
tmp3=B0[i’+1] tmp0=B1[i’] 5: stA0(i’)
A 0[i’]=A0[i’]*tmp0 A 1[i’]=A1[i’]*tmp1
A2[] B2[] A3[] B3[]
A2[] B2[] A3[] B3[]
tmp1=B2[i’] tmp2=B3[i’]
A2[i’]=A 2[i’]*tmp2 A 3[i’]=A3[i’]*tmp3 1: tmp1=ldB 2(i’) 1: tmp2=ldB 3(i’)
t2=ldA 2(i’) t2=ldA 3(i’)
2: send(tmp1) 2: send(tmp2)
3: tmp2=rcv() 3: tmp3=rcv()
4: t3=t2*tmp2 4: t3=t2*tmp3
5: stA2(i’) 5: stA3(i’)
Control FSM
Static handshakes
switch(pc) Logic
(g) { (h) State
case 1:
t=i’+1
tmp3=ldB 0(t)
t2=ldA0(i’)
pc=2
break
case 2:
send(tmp3)
pc=3 A[] B[]
break
case 3:
tmp0=rcv()
pc=4 i’ 1
break
.....
} +
* ...
Figure 2: Example. (a) initial program (b) after equivalence class specialization (c) after modulo unrolling step 1 (d)
after modulo unrolling step 2 (e) after migration of computation to memories (f) after scheduling of interconnect between
memories (g) after state machine generation (h) resulting hardware architecture.
is a list of abstract data objects to which it can reference. 4.2 Computation Assignment
We use this information to derive alias equivalence classes, The previous memory partitioning phase has decomposed
which are1 groups of pointers related through their location program data structures into separate memories, with
set lists. Pointers in the same alias equivalence class can groups of memories clustered together into tiles. We next
potentially alias to the same object, while pointers in di er- locate computation close to the most appropriate memory.
ent equivalence classes can never reference the same object. The load and store instructions that access a memory are
Once the alias equivalence classes are determined, equiv- always assigned to the tile containing that memory. The
alence class uni cation places all objects for an alias class remaining computation can be assigned to any tile its ac-
onto a single memory. This placement ensures that all point- tual assignment is selected based on two factors: minimize
ers to that alias class refer to one memory. By mapping communication between tiles and minimizing the latency of
objects for every alias equivalence class in such a manner, the critical path.
all memory references can be constrained to addressing a Our algorithm for assigning computation to memory
single memory bank. By mapping di erent alias equiva- tiles directly leverages the non-uniform resource architecure
lence classes to di erent banks, memory parallelism can be (NURA) algorithms developed for the Raw Machine 11].
attained. In this algorithm, instruction-level parallelism within a ba-
Figure 2(b) shows the results of equivalence class uni - sic block is orchestrated across multiple tiles. This work
cation on the initial code in Figure 2(a). Since no static ref- is in turn an extension of the original MIT Virtual Wires
erence in the program can address both A ] and B ], pointer Project 1] work, in which only circuit-level, or combina-
analysis determines that A ] and B ] are in di erent alias tional, parallelism with a clock cycle was orchestrated across
equivalence class. This analysis allows the two arrays to multiple tiles.
be mapped to di erent memories, while ensuring that each The compiler performs computation assignment in three
memory reference only addresses a single memory. steps: clustering, merging, and placement. Clustering
Modulo Unrolling The major limitation of equivalence groups together instructions, such that instructions within
class uni cation is that arrays are treated as single objects a cluster have no parallelism that can pro tably be ex-
that belong to a single alias equivalence class. Mapping ploited across tiles given the cost of communication. Merg-
an entire array to a single memory sequentializes accesses ing merges the cluster to reduce the number of clusters down
to that array and destroys the parallelism found in many to the number of tiles. Placement performs a bijective map-
loops. Therefore, we use a more advanced strategy called ping from the merged clusters to tiles, taking into account
modulo unrolling 3] to allow arrays to be distributed across the topology of the interconnect.
di erent memories. Figure 2(e) shows the results of computation assignment
Modulo unrolling applies for array accesses whose indices in our example. As we described in the previous section,
are a ne functions of enclosing loop induction variables. each tile contains two memories, one for each array. Com-
First, arrays are partitioned into smaller memories through putation is assigned directly into the small memory array.
low-order interleaving. In this scheme, consecutive elements In the gure, each tile has been assigned a subset of the
of the array are interleaved in a round-robin manner across original instructions.
memory banks on successive tiles. Next, modulo unrolling
shows that it is always possible to unroll loops by certain 5 Virtual Wires Scheduling
factors such that in the resulting code all a ne accesses go
to only one of the smaller memories. In certain cases, ad- While the previous phase of the compiler produces the small
ditional transformations may be required. One feature of memories, this next phase of the compiler is responsible for
modulo unrolling is that it does not a ect the disambigua- managing inter-tile communication. We term this phrase
tion of accesses to any other equivalence class. Details on Virtual Wires Scheduling.
determining the symbolic derivations of the minimum unroll Although moving program instruction into the memory
factors required, and the code generation techniques needed, system eliminates long memory communication latencies in
are given in 3]. comparison to monolithic processor design, new inter-tile
Figure 2(c) shows the result of low-order interleaving communication paths will now be required for the program
and unrolling phases of modulo unrolling on the code in to execute correctly. Namely, there are data dependencies
Figure 2(b) when the number of desired memories is four. between the instructions assigned to each tile.
Low order interleaving splits each array into four sub-arrays In Figure 2(e) the data dependences between tiles are ex-
whose sizes are a quarter of the original. Modulo unrolling plicitly shown as temporary variables introduced in the code.
uses symbolic formulas to predict that the unroll factor re- For example, tmp0 needs to be communicated between the
quired for static dissambiguation is four. Now we can see upper-left tile and the upper-right tile. Besides data de-
that each reference always goes to one tile. Speci cally, the pendencies, there are also control dependencies introduced.
A i], A i+1], A i+2] and A i+3] references access sub-arrays However, this control ow between basic blocks is explicitly
0, 1, 2 and 3. The B i+1], B i+2], B i+3] and B i+4] refer- orchestrated by the compiler through asynchronous global
ences access sub-arrays 1, 2, 3 and 0. Figure 2(d) shows the branching 11], an asynchronous mechanism for implement-
code after the code generation required by modulo unrolling. ing branching across all the tiles using static communication
The array references have been converted to references to the and individual branches on each tile. Thus control depen-
partitioned sub-arrays, with appropriate updated indices. dencies will be turned into data dependencies as well.
1 More formally, alias equivalence classes are the connected com- In the same manner as the original virtual wires schedul-
ponents of a graph whose nodes are pointers and whose edges are ing algorithm for logic emulation 1], our static scheduler
between nodes whose location set lists have at least one common will multiplex logical communications such as tmp0 with
element. other communications between the same source and desti-
nation, sharing the same physical channel. The multiplexing called pc, which serves a function similar to the program
of this physical channel will be directly synthesized in the counter in a conventional processor. Each state of the FSM
target hardware technology by a following hardware gen- corresponds to the work that will be done in a single cycle
eration phase. Note that communication required between in the resulting hardware. The resulting FSM will generate
non-neighbor tiles must be routed through an intermediate both the proper control to calculate the next state, including
tile. any calculation to generate branch destinations, and also the
In contrast to the Raw space-time scheduler, which tar- proper control signals to the virtual functional units that will
gets a simple processor and static switch in each tile, our be generated in the next phase.
compilation target is more exible, allowing multiple in- Each state of the FSM contains a set of possibly depen-
structions to be executed in parallel, with the only constraint dent operations. The FSM generator turns the set of oper-
being that each memory bank within each tile can be ac- ations in each state into combinational logic. Any depen-
cessed at most once per cycle. This additional exibility is dences between operations in the same state are connected
possible because we are targeting custom hardware instead by specifying wires between the dependent functions. For
of a xed processing unit. any data values that are live between states (produced in
In constrast to compilation for a VLIW, which also in- one state and consumed in a di erent state), the FSM gen-
volves statically scheduling multiple instructions, we are not erator allocates a register to hold that value.
constrained by an initial set of function units, register les,
and busses. Our scheduler minimizes the execution latency 6.2 Register Transfer Level
of the program by scheduling to virtual function units as
dictated by the application. The nal phase of compilation involves generating a Register
As a nal note, during scheduling we have essentially re- Transfer Level (RTL) model from the nite state machine.
laxed the resource allocation problem for both registers and The resulting RTL speci cation for each tile is technology-
function units, while retaining the constraints for memory independent and application-speci c. Each tile contains a
accesses and inter-tile wires. This strategy is similar to vir- datapath of inter-connected functional units, several memo-
tual register allocation strategies for sequential processors, ries, and control elements. Figure 2(h) shows a logic diagram
except that we relax all computation resource constraints. for the circuit that would be generated from the FSM. The
Because we are parallelizing sequential code and thus pre- datapath is synthesized out of registers (represented by solid
dominately limited by memory and communication costs, black bars in the gure), higher order functional units (such
this relaxation is feasible. Resource allocation is delayed as adders and multipliers) and multiplexers. A controller is
until the later custom logic generation phase. speci ed in RTL to control the actions of the datapath and
Figure 2(f) shows the results of communication schedul- the inter-processor I/O.
ing. The send instruction in each tile represents multiplexed The next step synthesizes logic for the address, data,
communications across the inter-tile channels. The following and enable signals for each memory reference in the FSM.
custom logic generation phase will convert these instructions For example the address signals are shown as arrows going
to pipeline registers and multiplexers controlled by bits in into the A ] and B ] memories in Figure 2(h). Whenever a
the state machine in each tile. reference has been statically mapped to a speci c memory,
the address calculation is specialized to eliminate the bits
6 Custom Logic Generation that select that memory as the target, as described further
in Section 6.3. This specialization simpli es the address
By the time the custom logic generation phase executes, the calculation hardware, so it consumes less power and less
previous phases have mapped the data onto the memories, area.
extracted the concurrency, and generated a parallel thread The compiler-generated architecture must include com-
of instructions for each tile. They have also scheduled the ex- munication channels between each FSM for inter-tile com-
ecution: each memory access, communication operation and munication. The communication synthesis step creates these
instruction has been mapped to a speci c execution cycle. channels from the inter-tile communication dependencies
Finally, the space-time scheduler has mapped the compu- and synthesizes the appropriate control logic into the FSM.
tation and communication to a speci c location within the This logic is shown as the multiplexors on the bottom and
memory array. right hand side of Figure 2(h). In addition, the control
The custom logic generation phase is responsible for gen- logic includes static handshaking signals between each pair
erating a hardware implementation of the communication of communicating FSMs as shown in the upper left corner
and computation schedules. First the system generates a of Figure 2(h).
nite state machine that produces the cycle-by-cycle con- Finally, the compiler outputs technology independent
trols for all the functional units, registers and memories that Verilog RTL. This step is a modi cation of the Stanford
make up the tile. Then the nite state machine is synthe- VeriSUIF Verilog output passes. SUIF data structures are
sized into a technology independent register-transfer-level rst converted into Verilog data structures and then written
speci cation of the nal circuit. We discuss each of these as Verilog RTL to an output le. The output fully describes
activities below. the cycle-level behavior of the circuit, including all state
transitions and actions to be take every cycle.
We have described the steps that are required to compile
6.1 Finite State Machine a program into hardware. In the next section we describe
As shown in Figure 2(g), the nite state machine (FSM) some of the additional optimization opportunities that are
generator takes the thread of scheduled instructions that has available when producing hardware.
been generated for each tile, and produces a state machine
for each thread. This state machine contains a state register,
i simpler integer and bit level micro-operations using a tech-
A nique similar to 4]. These resulting micro-operations can
then be optimized and scheduled by the datapath genera-
address
tor. Because the constituent exponent and mantissa micro-
operations are exposed to the compiler, the compiler can
exploit the parallelism both inside individual oating point
j data operations and also between di erent oating point opera-
tions.
As an example of the specialization that can occur, con-
sider dividing a oating point number with a constant that
Figure 3: Example of address specialization. The address can be written as a factor of two, e.g. y = x=2:0. Exe-
calculation for the array reference A i] j] can be entirely cuting this division operation would take many cycles on a
eliminated. Instead, the values of i and j are placed on the traditional oating-point execution unit.
corresponding address wires. However, by exposing the micro-operations required to
do the oating-point computation in terms of operations on
the exponent and mantissa and generating specialized hard-
6.3 Other Custom Logic ware for the computation, the cycle count of this operation
can be optimized by an order of magnitude. The required
Along with the parallelism that is available in custom logic, micro-operation to perform such a division is to subtract 1
an additional advantage is that when the bit-level opera- from the exponent of x. The time to execute the oating-
tions are exposed to the compiler, the compiler can perform point division operation reduces to the latency of a single
additional optimizations beyond those found by traditional xed-point subtraction.
software optimizers. These include the possibility of folding
constants directly into the circuitry and using other compile-
time knowledge to completely eliminate gates. 7 Experimental Results
Address Calculation. The primary example of hard- We have implemented a compiler that is capable of accepting
ware specialization involves simpli cation of address cal- a sequential application, written in either C or Fortran, and
culation. As an example, consider what would happen if automatically generating a specialized architecture for that
the code in Figure 2 had referenced a two dimensional ar- application. The compiler contains all of the functionality
ray, like A i] j]. The naive code for this address calcula- described in the previous sections of this paper.
tion would be A + (i * Y DIM + j) * 4. In the case that This section presents experimental results for an initial
Y DIM was a power of 2, most compilers would strength set of applications that we have compiled to hardware. For
reduce the multiplication operations to shifts to produce each application, our compiler produces an architecture de-
A + (i << LOG2 Y DIM + j) << 2. scription in RTL Verilog. We further synthesize this archi-
In most software systems this is the minimal calculation tecture to logical gates with a commercial CAD tool. The
to reference a two dimensional array. In hardware systems, current target technology in which we report gate counts
on the other hand, we can specialize even further. First, is the reference technology provided with the IKOS Virtua-
since we perform equivalence class uni cation, as described Logic System 9]. This system is a logic emulator that can
in Section 4.1, the A array will live in its own private memory. be used to validate designs of up to one million gates, as
Then every reference will lie at an o set from location 0 of well as additional custom memories.
the memory for array A, so the rst addition operation can Our current timing model, enforced during scheduling,
be eliminated. Furthermore, the memory for A can be built limits the amount of computation that may be assigned to
to be of the same width as the data word, so that the nal any one clock cycle to be less than the equivalent delay of
multiplication by 4 can be completely eliminated, leaving a 32-bit addition or the latency of a small memory access.
(i << LOG2 Y DIM + j). The exact clock period of each circuit will be technology-
While in a conventional RISC processor logical opera- dependent, but due to this constraint will be similar to that
tions require the same number of cycles as an addition op- of a simple processor implemented in the same technology.
eration, in hardware a logical or operation is both faster We report execution times as total cycle counts for each
and smaller than an adder. Our compiler performs bit-width application.
optimization to transform these add operations into or op- Table 1 gives the characteristics of the benchmarks used
erations. The compiler calculates the maximum bit-width for the evaluation. These applications include one tradi-
of j, and nds that since j < Y DIM, the maximum value of tional scienti c program and three multimedia applications.
j never consumes more than LOG Y DIM bits, so the bits of j The input programs are all sequential. Jacobi is an integer
and the bits of (i << LOG2 Y DIM) never overlap. The ad- version of a iterative relaxation algorithm. Adpcm-encode
dition in this expression can then be optimized away to pro- is the compression part of the compression/decompression
duce (i << LOG Y DIM | j). Finally, since oring anything pair in Adpcm. MPEG-kernel is the portion of MPEG which
with 0 produces that value, the or gates can be eliminated, takes up 70% of the total run-time. SHA implements a se-
and replaced with wires. The result is that the two values, i cure hash algorithm.
and j, can be concatenated to form the nal address. This For reference, we also compare our results with previ-
nal transformation is shown in Figure 3. ously published simulation results from a parallel Raw ma-
chine with one to 16 processors. Because our compiler is
based on the same frontend infrastructure for parallelization
Floating Point. For each oating point operation in the as the Raw compiler, this comparison allows us to isolate the
program the compiler replaces the operation with a set of e ect of customizing the tile for an application.
Benchmark Type Source Lines Seq. RT Primary Description
of code (cycles) Array size
Jacobi Dense Mat. Rawbench 59 2.38M 32 32 Jacobi Relaxation
Adpcm-encode Multimedia Mediabench 133 7.1M 1000 Speech compression
MPEG-kernel Multimedia UC Berkeley 86 14.6K 32 32 MPEG-1 Video Software Encoder kernel
SHA Multimedia Perl Oasis 608 1.0M 512 16 Secure Hash Algorithm
Table 1: Benchmark characteristics. Column Seq. RT shows the run-time for the uniprocessor code generated by the Machsuif MIPS
compiler.
16 mips R2000 Benchmark Logic Registers Memory Total
customized Jacobi 3830 5064 65536 74430
12 Adpcm-encode 3786 3704 3210 10700
Speedup
MPEG-kernel 1833 2312 3584 7729
8 SHA 35100 12680 16384 64164
4
Table 2: Gate counts for one tile.
0
Adpcm-encode Jacobi MPEG-kernel SHA
Figure 4: Base comparison tage of ILP even for the single tile case. For both the hard-
wired and the virtual wired case, performance continues to
increase as we add more hardware. The virtual wired case
7.1 Base Comparison must pay the penalty of communication costs, but neverthe-
less, absolute performance remains well above even a 16 pro-
Figure 4 presents execution times for each application in cessor Raw machine. Raw achieves better scalability for Ja-
comparison with execution time on a single MIPS R2000 cobi because of it does not take advantage of initial amount
processor (the basic component of one Raw tile). Roughly of instruction-level parallelism in the one processor case.
speaking, customization results in an order of magnitude Figure 6 reports the increase in hardware area, includ-
fewer clock cycles (5 to 10) required to perform the same ing memory, as the number of tiles are increased. Note that
computation. As we expect the clock cycles to be similar, for the virtual wires case, the hardware areas grow more
this reduction will translate directly into proportional per- rapidly because of the increasing amount of communication
formance increases for custom hardware. logic that is required. In Figure 7, we show the hardware
Table 2 presents the gate counts for the single tile compi- composition, in percentages, of each architecture. Notice
lation. We assume that one gate is approximately equivalent how the memory starts out taking a fairly large percentage
to one byte of memory for SRAM comparisons. Note that of the area, but as we decrease the granularity of the mem-
the logic and register gate counts for each application except ory system by adding more tiles, the system becomes more
SHA are smaller than the size of a simple processor (about balanced. As we totally dominate the memory area with
20K gates). additional hardware, the speedup curves tail o .
7.2 Parallelized Comparison Implications for Con gurable Hardware
For the parallelized case, we report results for two compila- Note that while we have not directly considered recon g-
tion strategies: a hardwired system and a virtual wired sys- urable hardware, such as an FPGA-based system described
tem. The hardwired system allows direct, dedicated wires in the related work, if a recon gurable system were to imple-
between arbitrary memories and assumes that signals can ment virtual wires and small memories directly in the fabric,
propagate the full length of each wire in one clock cycle. we believe similar performance gains might be achievable.
The total cycle count is always smaller for the hardwired However, we should note that in recon gurable logic-based
system because there are no cycles dedicated to schedul- architectures the potential clock speeds will most likely not
ing the wires. Additionally, the hardwired systems do have be on an equal basis and an appropriate speed penalty for
additional hardware synthesized for multiplexing the inter- the custom logic will need to be taken into account.
connect. It therefore provides a good comparison point that
allows us to isolate the costs of virtual wire scheduling. Bear 8 Related Work
in mind, however, that for large circuits the hardwired sys-
tem would contain long wires with large propagation delays. Silicon compilation has existed in one form or another since
These delays slow down the clock, degrading the overall per- the early days of designing chips. Even before schematic
formance of the application. The scaling factors in future capture, designers wrote programs which directly generated
VLSI system will only exacerbate this phenomenon, making the geometry and layout for a particular design. In the
virtual wires scheduling a necessity. past twenty years various research has proposed to compile
Figure 5 presents the resulting speedups for Jacobi and PASCAL, FORTRAN, C, and Scheme into hardware. We
MPEG, the two application which are parallelizable, as we must be careful to distinguish between compiling a program
increase the number of tiles. Note that for both cases each into hardware, and describing hardware with a higher-level
tile already has multiple memory banks and takes advan-
128 64
|
|
Speedup
Speedup
64 32
|
|
32
|
16
|
16
|
8
|
8
|
4
|
4
|
2
|
2 Custom-hard-wires Custom-hard-wires
|
Custom-virtual-wires Custom-virtual-wires
1 1
|
Raw
|
Raw
0 | | | | |
0 | | | | |
|
|
0 4 8 12 16 0 4 8 12 16
Ntiles Ntiles
Speedup scalability for jacobi Speedup scalability for mpeg
Figure 5: Speedup for jacobi and mpeg.
300000 100000
|
|
Gates
Gates
Custom-virtual-wires Custom-virtual-wires
Custom-hard-wires Custom-hard-wires
240000 80000
|
|
180000 60000
|
|
120000 40000
|
|
60000 20000
|
0 | | | | |
0 |
| | | | |
|
|
0 4 8 12 16 0 4 8 12 16
Ntiles Ntiles
Gate count for jacobi Gate count for mpeg
Figure 6: Hardware Size
100 100
Logic Logic
Registers Registers
80 80
Memory Memory
60 60
40 40
20 20
0 0
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Virtual wires Hardwires Virtual wires Hardwires
Jacobi Mpeg
Figure 7: Hardware composition.
language { they are not the same. In our work, we compile than design hardware, for a recon gurable architecture. De-
sequential programs that describe an algorithm in a manner signs reported to take months to design could be written in a
easy for the programmer to understand. day. While data-parallel C extended the language to handle
Recent work in RTL and behavioral synthesis 18] in- bit-level operations and systolic communication, all control
volves synthesizing higher-level algorithms into hardware. ow is managed by the host. Hardware compilation was only
Where possible, we leverage this work in our compilation concerned with basic blocks of parallel instructions. This
process. For example, we leave resource allocation to be approach has been ported to National Semiconductors new
performed during the following synthesis pass. However, to processor/FPGA chip based on the CLAy architecture 13].
our knowledge this is the rst system which can manage Programmable Active Memories 19], designed at Com-
memory, multiplex wires, and transform general high-level paq Paris Research Lab, interfaces to a host processor via
programming languages directly into custom hardware. memory-mapped I/O. The programming model is to treat
We continue by discussing two additional areas of re- the recon gurable logic as a memory capable of perform-
lated work: parallel architectures and recon gurable archi- ing computation. The actual design of the con guration for
tectures. each PAM application was speci ed in a C-syntax hardware
description language.
8.1 Parallel Architectures In the PRISM project 17], functions derived from a sub-
set of C are compiled into an FPGA. The PRISM-I subset
Other researchers have parallelized some of the benchmarks included if-then-else as well as for loops of xed count. The
in this paper. Automatic parallelization has been demon- PRISM-II subset included variable length for loops, while,
strated to work well for dense matrix scienti c codes 8]. do-while, switch-case, break, and continue.
In contrast to this work, our approach to generating paral- Other projects include compilation of Ruby 7] - a lan-
lelism stems from an ability to exploit ne-grain ILP, rather guage of functions and relations, and compilation of vec-
than the coarse-grain parallelism targeted by 8]. Multi- torizable loops in Modula-2 for recon gurable architec-
processors are mostly restricted to such coarse-grain par- tures 21].
allelism because of their high costs of communication and
synchronization. Unfortunately, nding such coarse grain 9 Conclusion
parallelism often requires whole program analysis by the
compiler, which works well only in restricted domains. In In this paper we have described and evaluated a novel com-
a custom architecture, we can successfully exploit ILP be- pilation system that allows sequential programs written in C
cause of the register and wire-level latencies in hardware. and Fortran to be compiled directly into application-speci c
Of course, hardware can exploit coarse-grain parallelism as hardware substrates.
well. Our approach extends known techniques from paral-
Software distributed shared memory schemes on multi- lelizing compilers, such as memory disambiguation, static
processors (DSMs) 16] 5] are similar in spirit to our ap- scheduling and data partitioning. We focus on memory and
proach for managing memory. They emulate in software the wires rst, and then computation. We start by partitioning
task of cache coherence, one which is traditionally performed the data structures in the program into and array of small,
by complex hardware. In contrast, this work turns sequen- fast memories. In our model, computation is performed by
tial accesses from a single memory image into decentralized custom logic computing elements interspersed among these
accesses across multiple small memories. This technique en- memories. The compiler leverages the correspondence be-
ables the parallelization of sequential programs into a high tween memory and computation to place the computation
bandwidth memory array. close to the memory that it accesses, such that communi-
Both the Berkeley IRAM research project 10] and Stan- cation costs are minimized. Similarly, a static schedule is
ford's new Smart Memories Project 12] focus on building generated for the interconnect that optimizes wire utiliza-
future-generation computing system that are more tightly tion and minimizes interconnect latency. Finally, the spe-
coupled with memory. The IRAM's approach is to improve cialized hardware for smart memories, virtual wires, and
performance of the memory system by tting more data on a other custom logic is compiled to hardware in the form of a
chip. They achieve this goal by using high-density dynamic multi-process state machine in synthesizable Verilog.
RAM (DRAM) memory instead of lower-density SRAM With this compilation system we have obtained special-
caches and treating the on-chip DRAM memory as the main ization performance improvements by up to an order of mag-
memory instead of a redundant copy. The Smart Memo- nitude versus a single general purpose processor and addi-
ries Project's stated purpose is to build a future-generation tional parallelization speedups similar to those obtainable
computing system that provides e ciency, generality, and using a tightly interconnected multiprocessor.
programmability in a single system.
8.1.1 Recon gurable Architectures References
A recon gurable computing system, comprised of an array of 1] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and
interconnected Field Programmable Gate Array (FPGA) de- A. Agarwal. Logic Emulation with Virtual Wires. IEEE
vices are common hardware targets for application-speci c Transactions on Computer Aided Design, 16(6):609{
computing. Splash 6] and PAM 19] are the rst substan- 626, June 1997.
tial recon gurable computing systems. As part of the Splash 2] Rajeev Barua, Walter Lee, Saman Amarasinghe,
project, a team lead by Maya Gokhale ported data-parallel and Anant Agarwal. Maps: A Compiler-Managed
C 14] to the Splash recon gurable architecture. This e ort Memory System for Raw Machines. Technical
was one of the rst to actually compile programs, rather
report, M.I.T. LCS-TM-583, July 1998. Also 14] Maya Gokhale and Brian Schott. Data-Parallel C on a
http://www.cag.lcs.mit.edu/raw/. Recon gurable Logic Array. Journal of Supercomput-
ing, September 1995.
3] Rajeev Barua, Walter Lee, Saman Amarasinghe, and
Anant Agarwal. Memory Bank Disambiguation us- 15] Radu Rugina and Martin Rinard. Span: A shape and
ing Modulo Unrolling for Raw Machines. In Pro- pointer analysis package. Technical report, M.I.T. LCS-
ceedings of the ACM/IEEE Fifth Int'l Conference on TM-581, June 1998.
High Performance Computing(HIPC), Dec 1998. Also
http://www.cag.lcs.mit.edu/raw/. 16] Daniel J. Scales, Kourosh Gharachorloo, and Chan-
dramohan A. Thekkath. Shasta: A low overhead,
4] William J. Dally. Micro-optimization of oating- software-only approach for supporting ne-grain shared
point operations. In Proceedings of the Third Inter- memory. In Proceedings of the Seventh International
national Conference on Architectural Support for Pro- Conference on Architectural Support for Programming
gramming Languages and Operating Systems, pages Languages and Operating Systems, pages 174{185,
283{289, Boston, Massachusetts, April 3{6, 1989. Cambridge, Massachusetts, October 1{5, 1996.
5] Sandhya Dwarkadas, Alan L. Cox, and Willy 17] A. Smith, M. Wazlowski, L. Agarwal, T. Lee, E. Lam,
Zwaenepoel. An integrated compile-time/run-time soft- P. Athans, H. Silverman, and S. Ghosh. PRISM II Com-
ware distributed shared memory system. In Proceed- piler and Architecture. In Proceedings IEEE Workshop
ings of the Seventh International Conference on Ar- on FPGA-based Custom Computing Machines, pages
chitectural Support for Programming Languages and 9{16, Napa, CA, April 1993. IEEE.
Operating Systems, pages 186{197, Cambridge, Mas-
sachusetts, October 1{5, 1996. 18] Synopsys, Inc. Behavioral Compiler User Guide, V
1997.08, August 1997.
6] Maya Gokhale, William Holmes, Andrew Kopser, Sara
Lucas, Ronald Minnich, Douglas Sweeney, and Daniel 19] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. H.
Lopresti. Building and using a highly parallel pro- Touati, and P. Boucard. Programmable Active Mem-
grammable logic array. Computer, 24(1), January 1991. ories: Recon gurable Systems Come of Age. IEEE
Transactions on VLSI Systems, 4(1), March 1996.
7] Shaori Guo and Wayne Luk. Compiling Ruby into FP-
GAs. In Field Programmable Logic and Applications, 20] Elliot Waingold, Michael Taylor, Devabhaktuni Srikr-
August 1995. ishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim,
Matthew Frank, Peter Finch, Rajeev Barua, Jonathan
8] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Babb, Saman Amarasinghe, and Anant Agarwal. Bar-
Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Max- ing It All to Software: Raw Machines. IEEE Computer,
imizing multiprocessor performance with the suif com- 30(9):86{93, September 1997. Also available as MIT-
piler. COMPUTER, 29(12):84{89, December 1996. LCS-TR-709.
9] IKOS Systems, Inc. VirtuaLogic Emulation System 21] M. Weinhardt. Compilation and Pipeline Synthesis for
Documentation, 1996. Version 1.2. Recon gurable Architectures - High Performance by
Con gware. In Recon gurable Architecture Workshop,
10] Christoforos E. Kozyrakis, Stylianos Perissakis, David 1997.
Patterson, Thomas Anderson, Krste Asanovic, Neal
Cardwell, Richard Fromm, Jason Golbus, Benjamin
Gribstad, Kimberly Keeton, Randi Thomas, Noah
Treuhaft, and Katherine Yelick. Scalable processors in
the billion transistors era: IRAM. IEEE Computer,
pages 75{78, September 1997.
11] Walter Lee, Rajeev Barua, Matthew Frank, Dev-
abhatuni Srikrishna, Jonathan Babb, Vivek Sarkar,
and Saman Amarasinghe. Space-Time Scheduling of
Instruction-Level Parallelism on a Raw Machine. In
Proceedings of the Eighth ACM Conference on Archi-
tectural Support for Programming Languages and Op-
erating Systems, pages 46{57, San Jose, CA, October
1998.
12] Mark Horowitz, personal communications.
Stanford University Smart Memories Project.
http://velox.stanford.edu/smart memories.
13] Maya B. Gokhale, Janice M. Stone, Matthew Frank,
Sarno Corporation. NAPA C: Compiling for a Hybrid
RISC/FPGA Architecture. In FCCM98, Napa Valley,
California, April 1998.
Related docs
Get documents about "