Architecture of the Atlas Chip-Multiprocessor Dynamically by yurtgc548


									IEEE TRANSACTIONS ON COMPUTERS,            VOL. 50, NO. 1,    JANUARY 2001                                                                        67

Architecture of the Atlas Chip-Multiprocessor:
Dynamically Parallelizing Irregular Applications
                  Lucian Codrescu, Member, IEEE, D. Scott Wills, Senior Member, IEEE, and
                                       James Meindl, Fellow, IEEE

       AbstractÐSingle-chip multiprocessors are an important research direction for future microprocessors. The stigma of this approach is
       that many important applications cannot be automatically parallelized. This paper presents a single-chip multiprocessor that engages
       aggressive speculation techniques to enable dynamic parallelization of irregular, sequential binaries. Thread speculation and data
       value prediction are combined to enable the processor to execute dependent threads in parallel. The architecture performs a novel
       form of dynamic thread partitioning and includes an aggressive correlated value predictor. Microarchitectural structures manage
       interthread data and control dependencies. On an eight processor system, simulated execution of SPECint95 binaries delivers a
       speedup of 3.4 over a scalar in-order uniprocessor. This improvement is due entirely to the exploitation of dynamically extracted thread
       level parallelism.

       Index TermsÐThread speculation, multithreading, value prediction, multiscalar, chip-multiprocessor, parallelization.



A     S semiconductor technology pushes toward billions of
      fast transistors on a chip, interconnect delay and
design complexity become critical issues [21], [24], [25].
                                                                                    Threads are issued and retired in-order, thereby insuring
                                                                                    sequential semantics.
                                                                                       Predicting the values that flow between threads is
The processor architectures popular today exploit Instruc-                          needed for two reasons. First, it can be used to remove
tion Level Parallelism (ILP). The fundamental ILP techni-                           the need for explicit synchronization. Rather than using a
ques in use today were developed in an era when transistor                          compiler to schedule dependency communication, the
size and delay dominated [36], [37]. Extending these                                hardware will predict the values and later verify the
inherently centralized architectures to future technology                           predictions. Second, data value prediction is used to
may prove infeasible. A single-chip multiprocessor natu-                            increase parallelism. A value produced late in a thread
rally exploits locality and removes long interconnect delay                         and consumed early in the next thread will sequentialize
from the processor cycle time. Replication of a single                              thread execution. Data value prediction [18] has been
processor design reduces complexity. A primary concern                              recognized as a way to break true data dependencies by
with this architecture is that many important programs                              predicting the value that an instruction will consume and
cannot be automatically parallelized. Wide scale parallel                           executing it in parallel with the producing instruction. The
programming is still a distant dream and compilers have                             Atlas architecture has novel mechanisms to coordinate data
difficulty parallelizing irregular, nonnumeric programs. In                         value speculations in the context of a chip-multiprocessor.
addition, the conventional multiprocessor requires source                              The architecture extracts threads dynamically from a
code recompilation, leading to poor performance on existing                         sequential program. This is accomplished with a new
ªdusty-deckº binaries. What is needed is a way to execute a                         partitioning algorithm called MEM-slicing [7]. In fixed-
sequential, irregular binary application in a multiprocessor                        interval partitioning, threads are sliced at regular intervals
environment. Toward that goal, this paper presents a                                in the dynamic instruction stream. Atlas improves upon
detailed evaluation of the Atlas chip-multiprocessor [8].                           fixed-interval partitioning by forcing intervals to end at
   The architecture combines thread speculation (multi-                             memory instructions (loads and stores). Because memory
scalar execution), aggressive data value prediction, and                            instructions often begin or end data dependency chains,
dynamic thread partitioning in a chip-multiprocessor.                               natural thread boundaries are discovered dynamically. In
Thread speculation allows for parallel execution of threads                         irregular applications, this technique provides substantial
in the presence of ambiguous data and control dependen-                             improvement over loop-based, procedure-based, or fixed-
cies. Hardware predicts thread control flow, executes future                        interval partitioning.
                                                                                       SPECint95 benchmarks are executed on a detailed
threads speculatively, and later verifies the speculations.
                                                                                    execution-driven simulator that carefully models the
                                                                                    architecture. Sophisticated speculation techniques and
. The authors are with the School of Electrical and Computer Engineering,           proper thread partitioning are shown to be important tools
  Georgia Institute of Technology, 777 Atlantic Dr., Atlanta, GA 30332-             to extract thread level parallelism. Using unmodified
  0250. E-mail: {lucian, scott.wills, james.meindl}
                                                                                    sequential SPECint95 binaries, speedup due to thread
Manuscript received 9 Sept. 1999; revised 14 Sept. 2000; accepted 25 Oct.           parallelism averages 3.4 on eight processors. By using in-
For information on obtaining reprints of this article, please send e-mail to:       order processors, both as a comparison point and as the, and reference IEEECS Log Number 110557.                            elements of the multiprocessors, the effect of the reported
                                                                0018-9340/01/$10.00 ß 2001 IEEE
68                                                                  IEEE TRANSACTIONS ON COMPUTERS,       VOL. 50,   NO. 1,   JANUARY 2001

technique can be isolated so that future research can study
the interactions of the latest out-of-order ILP techniques on
TLP. The contribution of this paper is to present and to
evaluate the architecture of a TLP chip-multiprocessor that
dynamically parallelizes sequential binary applications.

1.1 Related Work
Speculative multithreading was introduced by the Multi-
scalar [12], [33] architecture. This design uses the compiler
to divide the program into threads and schedule interthread
register communication. Hardware is responsible for thread
control predictions, speculative buffering, memory disam-
biguation, synchronizing register communication, and
misspeculation recovery. The architectures presented in
[10], [11], [17], [26], [34], [35], [40] also perform speculative
multithreading. None of these architectures speculate on
the values that flow between threads and all require source
code recompilation.
   The most relevant research to this work is the trace
processor [29], DMT processor [3], SM processor [19], which
attempt to predict values between dynamically generated
                                                                    Fig. 1. Execution Model: thread and value speculation (procedure level
threads. There are three important differences between
those architectures and the one presented here. The first           threads shown).
concerns how threads are generated. Those architectures
partition threads on cache line boundaries (trace processor),       thread is predicted using control speculation and is then
loop iterations (SM), or the code following procedures and          assigned to another processor. Control speculations con-
loops (DMT). The Atlas multiprocessor uses a new                    tinue until all processors are busy (in Fig. 1, thread B is
automatic partitioning policy that provides fine-grain data         allocated followed by thread D).
and control predictable threads by selectively starting                Data value prediction is used to ease data dependency
threads at memory instructions [7]. The second feature that         stalls on speculative processors (in Fig. 1, the value of x is
differentiates this design is a focus on extremely aggressive       predicted in D rather than waiting for A to produce the
correlated value prediction. Other architectures use simpler        value). Threads are issued and retired in-order. When all
value predictors. It will be shown that the performance of          speculative predictions are verified, a processor is allowed
the value predictor is crucial. Finally, this architecture
                                                                    to update architectural machine state (in Fig. 1, D updates
emphasizes enhancing a conventional shared memory
multiprocessor. The trace and SM processors are dedicated           architectural memory and registers once it's determined
designs not intended to execute explicit parallelism (pro-          that B and D are the correct control path, the actual value of
vided by the user or compiler). The DMT design is based on          x is known, and A and B have retired).
a simultaneous multithreaded [38] design, which is in turn             The SPECint95 application m88ksim will be used as a
based on a wide issue superscalar. Superscalar processors           motivating example. The program is an instruction level
have been questioned as the right direction for future              simulator of the m88000 CPU. This benchmark contains a
technologies from a hardware (interconnect and complex-             large procedure, datapath(), which is repeatedly called to
ity) viewpoint [25], [27].                                          simulate the execution of an instruction. High-level
   This paper is divided into five sections. Section 2              pseudocode of this program is shown in Fig. 2.
describes the execution model with an example. Section 3               A typical iteration of this loop executes 900 dynamic
details the microarchitecture, Section 4 presents simulation
                                                                    instructions. The loop contains many iteration-carried true
results, and Section 5 concludes.
                                                                    data dependencies. Often, these dependencies are produced

The execution model, shown in Fig. 1, combines thread
speculation (multiscalar execution) with interthread data
value prediction. A program is first divided into threads. A
thread is defined as any contiguous sequence of dynamic
execution and could consist of basic blocks, loop iterations,
or procedures, for example.
   In this execution model, threads need not be data
independent (most SPECint95 threads are, in fact, data
dependent on previous threads). The first thread is assigned
to a processor for execution (thread A in Fig. 1). The next         Fig. 2. m88ksim pseudocode.

                                                                3   PROCESSOR OVERVIEW
                                                                This section details the processor microarchitecture. The
                                                                general philosophy is to start with a single-chip multi-
                                                                processor and add the mechanisms necessary to support
                                                                thread and value speculation. Much of the functionality
                                                                needed could be implemented in hardware or software. For
                                                                example, the compiler could create recovery code to handle
                                                                value mispredictions or even generate code to create value
                                                                predictions [13]. This work focuses on fully dynamic
                                                                techniques in order to achieve parallel processing on
                                                                unmodified, sequential binaries.
                                                                   Fig. 5 shows a top-level view of the envisioned
                                                                architecture. Eight processors are integrated together with
                                                                global L2 cache and a global control and value predictor.
                                                                The processors communicate with the global structures
                                                                through a shared bus. Two buses are shown, one for
                                                                accessing L2 data values and another for communicating
                                                                control and data predictions. Processors communicate with
                                                                one another via a pipelined, bidirectional ring interconnect.
                                                                This is a natural interconnect for thread speculation as
Fig. 3. m88ksim conventional execution.                         processors primarily communicate with their two nearest
                                                                neighbors (produced values are sent to the more speculative
late in the loop body and consumed early in the next            processor and consumed values come from the less
iteration.                                                      speculative processor).
   Fig. 3 shows conventional execution of three iterations of      The processors could be any conventional design, such
this loop. The variable PC forms a RAW dependency that is       as pipelined scalar, superscalar, VLIW, etc. The processors
produced at the end of the loop body and consumed at the        are assumed to be simple in-order pipelined designs. This
                                                                design point was chosen to help focus the research on
beginning of the next iteration. The assignment of the PC
                                                                speculative multithreading. With simple nodes, interactions
variable is control dependent on earlier work in the loop       between TLP and intranode ILP are removed. While this
body, such as checking the scoreboard. It would therefore       does exaggerate the impact of TLP over designs with more
be difficult for a compiler to ease this data dependency with   complex nodes, it also helps to isolate the execution
a more favorable code schedule.                                 behaviors due to TLP.
   Fig. 4 shows the same code executing under thread
                                                                3.1 Node Microarchitecture
speculation with interthread data value prediction. All
                                                                Fig. 6 shows the architecture of a single node, which is
three iterations may execute in parallel if the live input      loosely based on the Alpha 21164 [1]. Shaded regions are
variables (PC in iteration 2 and 3 and R2 in iteration 3) can   structures that are new or modified for thread and value
be correctly predicted. Sophisticated value prediction          speculation. Their functionality is described briefly below,
mechanisms are engaged for this purpose. Section 3.2            and then in detail in following sections.
                                                                   Interthread data dependencies are handled using two
discusses a new hybrid value predictor that is capable of
                                                                mechanisms, data dependence speculation [22] and data
achieving 97 percent accuracy on the prediction of PC in        value prediction [18]. Live register inputs are always
m88ksim.                                                        value predicted. A hybrid global value predictor makes

Fig. 4. m88ksim execution with thread and value speculation.
70                                                               IEEE TRANSACTIONS ON COMPUTERS,    VOL. 50,   NO. 1,   JANUARY 2001

                                                                 3.1.1 Dependency Predictor
                                                                 A dynamic Dependency Predictor is used to speculate on the
                                                                 existence of data dependencies through memory. It is
                                                                 important to establish whether or not a memory input is
                                                                 likely to be live. Many LOADs reference data that has not
                                                                 been written in a long time. Often, programs initialize data
                                                                 structures in the beginning of their execution that are
                                                                 subsequently read-only. Performing value prediction on
                                                                 these accesses is harmful and unnecessary. When a
                                                                 speculative processor encounters a LOAD whose address
                                                                 has not been seen locally, the processor has no a priori
                                                                 knowledge of whether or not a STORE to the same address
                                                                 will occur on a less speculative processor. Often, the
                                                                 existence of data dependencies is control dependent.
                                                                    Low order bits from the LOAD instruction address are
Fig. 5. Architecture block diagram.                              used to index a simple table. The LOAD instruction address
                                                                 is used, rather than the LOAD's effective address, as this
aggressive, 2-level, correlated value predictions. All value     has been found to produce more stable results. Many
predictions are tracked within the processors via a small        programs communicate from the same LOAD-STORE
fully associative queue, called the Active Ins Queue. For        instruction pair through different memory locations. For
memory dependencies, a dependency predictor is first             example, memory accesses through arrays will change the
consulted to decide if a communicating STORE is likely. If       effective address while the LOAD and STORE instruction
it is not likely, the processor LOADs the value from cache       addresses remain constant.
                                                                    The dependency predictor is heavily skewed toward
and proceeds. The primary data cache is modified to track
                                                                 predicting live dependencies. It is designed to capture a
these speculations, in a manner similar to other designs
                                                                 sequence of 64 consecutive no-dependence events. Each
[15], [17], [26], [34]. Recovery from this type of misspecula-
                                                                 table entry contains six bits. In the lookup process, if a table
tion is achieved by squashing the thread and restarting.         entry returns an all 1s bit pattern (111111), then the
Should the dependency predictor indicate a communicating         dependency prediction is dead (a STORE on a less
STORE is likely, the data value predictor is accessed. These     speculative processor to the same address is unlikely).
predictions are tracked in a similar manner to register value    Any other bit pattern is interpreted as live (a communicat-
predictions.                                                     ing STORE on a less speculative processor is likely). When a
    A recovery queue [3] is added to each processor to           LOAD is discovered to communicate with an active STORE,
provide fine-grained value misspeculation recovery. If a         the corresponding table entry is cleared, otherwise a
register or memory data value prediction is incorrect, only      saturated increment is performed on the table entry. The
instructions dependent on the incorrect value reexecute.         structure is seen in Fig. 7.

Fig. 6. Node microarchitecture.

Fig. 7. Live-biased dependency predictor.

    The reason this dependency predictor is biased toward       predictor training event. The thread must be reexecuted
predicting live is the relative penalty associated with         because of the misprediction. After restarting, the processor
mispredictions. If a dependency is predicted dead and           will eventually perform a LOAD to the effective address
subsequently is found to be live, the penalty is severe. The    that caused the original problem. At that point, the
entire thread must be discarded and reexecuted. On the          instruction address is known. When the snoop causes a
other hand, if the prediction is for a live dependency and it   thread restart, the effective address is saved in a special
is found to be dead, the penalty varies. A value prediction     register. As the processor performs LOADs after restart, the
will be performed in this case. Should the value prediction     effective addresses are compared against the offending
match the actual value, there is no penalty. If the value       address. When a match occurs, the instruction address is
prediction is incorrect, fine-grained recovery is performed.    known and is then used to clear the appropriate entry in the
As described in Section 3.1.5, this can be very inexpensive.    dependency predictor.
The net result is that live mispredictions are on average
much cheaper than dead mispredictions, hence, the bias          3.1.3 Value Cache
toward predicting live.                                         When the dependency predictor indicates a LOAD is likely
                                                                live, a data value prediction is made. The data value
3.1.2 Speculative Data Cache                                    prediction can come from either the global value predictor
In addition to the normal caching functions, the Speculative    or a current prediction may already reside in the Value
Data Cache tracks memory addresses that are currently           Cache. The Value Cache is a small associative structure used
being predicted dead. When the dependency predictor             by the global value predictor to manage striding and
predicts dead, a bit in the speculative data cache is set,      pattern-based predictions. When a value prediction for a
indicating that the cache line is under speculation. The        variable is requested on any processor, the global value
Speculative Data Cache then snoops STOREs from the              predictor makes the corresponding prediction for all
nonspeculative processor to verify LOADs that have been         speculative processors. Predictions are made starting with
predicted dead. Should a STORE be snooped to a cache line       the least speculative processor and then advancing to the
with a speculation bit set, a violation has occurred and the    most speculative processor. The predictions are stored in
thread must be recovered. This structure is similar to others   the value cache, which is a table that stores the prediction
described in the literature [15], [17], [26], [34].             and a bit indicating if the prediction is current or not.
   A line in the speculative data cache with an active          Predictions are made ªin bulkº to insure the accuracy of
speculation bit cannot be evicted while the thread is           striding and pattern-based predictions. Often, all threads
speculative. Should the cache need to replace a line, for       need a prediction for the same variable. But, prediction
example, and only lines with active speculations are            requests to the global value predictor do not necessary
available for replacement, the processor must stall until it    come in order. Request order depends on intrathread
becomes nonspeculative. This is because the active spec-        control flow, cache hits/misses, etc. When a processor
ulation bits in the line are the tracking mechanism for data    needs a value prediction, it first checks the value cache.
dependency speculation. If they are evicted, the processor      Often, the prediction is already there, as another processor
has no means of tracking these speculations.                    has already requested it. This value prediction prefetching
   Should a dead dependency speculation go wrong, the           helps reduce stalls waiting for value predictions from the
dependency predictor needs to be trained toward live (the       global predictor. The global predictor also sets the current
table entry cleared). This is not immediately possible          bit along with a prediction. The current bits are all cleared
because the speculative data cache uses effective address       when a thread retires.
to reference data, while the dependency predictor uses
instruction address. At the time a STORE snoop occurs to a      3.1.4 Active Ins Queue
line marked under speculation, only the effective address is    Whenever a value prediction is performed on a speculative
knownÐnot the LOAD instruction address. The situation is        processor, the prediction is tracked in the Active Ins queue.
handled by using a special register and delaying the            This is a small associative structure that keeps an in-order
72                                                                IEEE TRANSACTIONS ON COMPUTERS,   VOL. 50,   NO. 1,   JANUARY 2001

Fig. 8. Active input tracking with the Active In Queue.

list of the active value predictions on the node. When a             The fine-grained recovery mechanism is an important
value prediction is made, an entry is allocated at the end of     contributor to overall performance. Frequently, a value
the queue. The Active Ins queue records the effective             mispredicts and only a small number of instructions are
address or register number being predicted, the value used        reexecuted. Surprisingly, there are numerous cases where
as a prediction, the address of the instruction that initiated    zero instructions are reexecuted on a mispredict! Code that
the prediction, and a few control bits. The structure snoops      executes dynamically is sometimes dead. For example,
STOREs from the nonspeculative node. As STOREs are                procedures save and then restore variables that might be
done which hit in the Active Ins queue (a STORE is done           used. Dynamic control flow of the procedure often leads to
which communicates with a later LOAD), a comparison is            cases where the variable is not used. The dynamically
made between the STORE value and the predicted value. A           wasteful save and restores can create a thread input that is
bit in the Active In entry is used to indicate a miscompare. In   value predicted, but then never used. Compiler optimiza-
the case of a miscompare, the thread is not immediately           tions for conventional machines can result in code that hurts
recovered. It could be that many STOREs are done to the           performance in multithreaded architectures. For example, a
same effective address. Only the value of the last performed      compiler might boost a load instruction above conditional
STORE is important. When the thread becomes the                   control flow to mask load latency. This technique can be
nonspeculative thread, if any miscompare bits are set in          problematic for a multithreaded architecture if the load
the Active Ins queue, the thread must recover the mis-            instruction is exposed across threads. The load value can be
speculation. When a LOAD is entered in the Active Ins             mispredicted and thread control flow decides that the load
                                                                  should not be done. In this case, a live input was predicted
queue, the current stable (nonspeculative) version is loaded
                                                                  and never used.
from cache and compared against the value prediction. The
                                                                     An important observation is that recovery events
miscompare bit is initially set to the result of this
                                                                  typically fall into one of two categories. Either a small
comparison. This is needed in case another processor never
                                                                  number of dynamic instructions is dependent on the
does a STORE to the memory location. This is illustrated in
                                                                  mispredicted input (usually less than eight instructions)
Fig. 8.
                                                                  or lots of instructions are dependent (approximating the
3.1.5 Recovery Queues                                             entire thread). This could be exploited by having multiple,
                                                                  smaller queues track dependencies known to have only a
Recovery from data value mispredictions is accomplished
                                                                  small number of affected instructions. Dependencies that
in a fine-grained manner. Only instructions dependent on
                                                                  permeate the entire thread require complete reexecution
the incorrect data value are reexecuted. This is accom-
                                                                  anyway, so the thread can just be restarted. Evaluation of
plished in a manner similar to the trace buffers described in
                                                                  alternative recover mechanisms is left as future work. The
[3]. Essentially, a recovery queue tracks in-order instruction    current Atlas microarchitecture uses trace queues similar to
execution. Each queue entry contains fields that explicitly       those described by Akkary [3].
link producers and consumers (identified by queue index).
This linking is used to reexecute only data dependent             3.1.6 Write Buffers
instructions. Threads that are greater than the trace queue       It is possible that a STORE on a less speculative node can
size cannot be completely tracked. For these large threads,       communicate with a more speculative LOAD and that the
whatever portion of the thread exists beyond the end of the       STORE has already been executed by the time the LOAD
queue must be completely reexecuted. The recovery                 executes. In this case, it may be desirable to receive the data
mechanism is quite complex and cannot be described in             from the other node, rather than using a value prediction. In
detail due to space limitations. Interested readers should        parallel to generating a value prediction, a speculative node
see Akkary's work [3].                                            searches the previous node's write buffer for a write to the
   Atlas uses the recovery queues for data value mispredic-       same effective address. Only one node is searched, as
tions, not data dependence misspeculations. It is possible to     LOAD-STORE communication predominantly occurs on
modify the speculative data cache to allow dependence             adjacent processors (STORE data is usually consumed by
misspeculations to be recovered with the recovery queues.         the next most speculative node). Data from the other node's
This opportunity is left as future work.                          write buffer is not necessarily the correct data. The less

Fig. 9. Atlas Multi-Adaptive value predictor.

speculative node could be performing many STOREs to the          predictor update can proceed in parallel. Once the thread
same location and only the last STORE is important. Once         has finished execution and the write buffer and Active Ins
again, a speculation is needed to decide if the write buffer     Queue are empty, the thread has finished execution. The
data is relevant to the value prediction. Atlas will usurp the   adjacent more speculative node then becomes the non-
value prediction only if the value prediction is not             speculative node and the processor is free for another
confident. The value predictor maintains a 2-bit confidence      thread allocation. Global thread control logic, together with
on all predictions.                                              the value/control predictor, assigns a new (now most
                                                                 speculative) thread to the free node.
3.1.7 Thread Retirement                                             Note that a node may not have finished executing its
Thread retirement begins the cycle a node becomes                thread at the time it begins retirement. The actions of
nonspeculative. The first activity to be performed is to         register verification, value predictor update, and write
verify register value predictions. Upon becoming nonspe-         buffer flush are often overlapped with the continued
culative, a bit mask is sent to the previous (used-to-be-        execution of the thread.
nonspeculative) node indicating which register values are
needed for verification. This node then pipelines the values     3.2 Value/Control Prediction
out on the node-to-node interconnect, where they are             In the Atlas architecture, one predictor is used for both
consumed and verified by the Active Ins Queue.                   control and value predictions. The performance of this
   After register verification, the correct values for all       predictor is critically important. The Atlas Multi-Adaptive
register and memory data value predictions on the newly          (AMA) correlated predictor was created specifically for
nonspeculative node are known. Recall that memory values         aggressive interthread value and control prediction [9]. The
are broadcast by the nonspeculative node as they are             structure is shown in Fig. 9.
produced and verified at that time. If any value prediction         The AMA predictor works as follows: The PC of the
errors have been detected, they must first be corrected using    instruction to be predicted is used to index a first level table
the fined-gained recovery mechanism. Once this is com-           which stores recent local value history for the instruction
plete, the thread begins two retirement tasks: global value      and a stride. Bits from the instruction address are XORed
prediction update and write buffer flush.                        with the last value to correlate into a second level table.
   The write buffer is flushed to the L2 cache by broad-         Another second level table is accessed in parallel. This
casting the data values out, in-order, to the L2 cache and all   table correlates either global control information or bits
speculative nodes. The value predictor is updated by             from the past seven local values to choose another value
traversing the Active Ins queue and sending updates to           prediction. The first level table keeps bits that dynami-
the global value predictor. Write buffer flush and value         cally adapt the correlation function to either the deep
74                                                                IEEE TRANSACTIONS ON COMPUTERS,   VOL. 50,   NO. 1,   JANUARY 2001

local or deep control function. All table entries contain         compilers have a long history of loop parallelization.
2-bit confidence counters. The highest confidence predic-         Unfortunately, SPECint95 applications are not loop
tion from Last_Value+Stride, Short Local Correlation, and         oriented. Procedure level parallelization is also a good
(Deep Control or Deep Value Correlation) is chosen as             target as, often, the code following a procedure can be
the value prediction. To reduce the size of the second            executed in parallel with the procedure call. Unfortunately,
level table, a combination of intelligent placement,              procedures tend to be load imbalanced and calling patterns
replacement, and table update policies are used [9].              can vary widely between applications. Thread partitioning
   For control predictions, deep local history and deep           at fixed boundaries provides perfect load balance, but
control history are equivalent. In this case, the structure       makes no attempt to solve data or control flow.
behaves very similarly to the Jacobson predictor [16], but           Comparisons were done between fixed interval, loop,
with the addition of associativity to reduce aliasing effects     and procedure partitioning. Fixed interval was found to be
in the second level tables.                                       the most effective due to perfect load balance. However,
                                                                  fixed interval partitioning makes no attempt to favor
3.3 Dynamic Partitioning by MEM-Slicing                           threads with predictable control or data flow. A new
Partitioning refers to the problem of dividing the sequential     partitioning algorithm has been invented that improves
program into threads. Partitioning is known NP-hard [30]          fixed-interval partitioning by being more intelligent about
and solutions are generally heuristic-based. An algorithm         thread boundaries. Rather than ending a thread at a fixed
must balance the often-conflicting needs of control flow,         number of dynamic instructions, the MEM-slicing algo-
data flow, and load balance. Ideally, threads are control         rithm executes a minimum number of instructions and then
predictable, contain no short interthread data dependencies,      searches for a more favorable place to stop. While
and are all approximately the same size. While some work          experimenting with different thread stop policies, it was
has been done in static thread partitioning for speculative       discovered that memory instructions form excellent thread
multithreaded processors [39], very little work exists on         boundaries. The partitioning algorithm was developed
dynamic partitioning algorithms. A new dynamic partition-         around this observation. The dynamic partitioning algo-
ing algorithm has been invented for this architecture [7].        rithm starts by executing a minimum number of instruc-
   The size of a thread is defined as the number of dynamic       tions. The next thread begins at the next encountered
instructions in the thread. It is very important that all         memory instruction. This new dynamic partitioning algo-
threads are close to the same size to handle load balance.        rithm is called MEM-slicing, as memory instructions slice
Otherwise, processors with small threads will idle while          thread boundaries.
processors with large threads complete. Thread coverage is           Memory instructions are very common in integer
a related issue. It is important that the entire program be       programs. After the minimum number of instructions has
divided into threads. If only some portions are threaded,         been executed, it is typically not long until a memory
execution into the unthreaded code will cause severe load         instruction is encountered. This has the nice effect of
balance problems.                                                 making all threads relatively the same size.
   Thread data predictability refers to the number and               Why do memory instructions form better threads? It is
predictability of live data inputs into a speculative thread.     hypothesized that this is due to the nature of the load-store
For fully parallel execution, there should be no live data        architecture and the compiler that are being used. The
inputs into the thread. This is often not possible, in which      compiler strives to allocate variables with short reuse
case, it is desirable to have the minimum number of short,        distances to registers and forces longer-lived variables into
unpredictable inputs. If a data input is predictable, it can be   memory. The resultant code often starts data-dependency
removed with data value prediction. Short, unpredictable          chains with a LOAD, does some computation, and ends
interthread data dependencies should be avoided as these          with a STORE. Breaking threads on memory instructions
will cause the consuming thread to wait for the producing         tends to force short-lived register-based communication
thread, effectively sequentializing thread execution.             within the thread and longer-lived memory-based commu-
   Control predictability refers to the relative difficulty in    nication between threads. Overall, this behavior tends to
determining which thread follows the current thread.              make MEM-sliced threads more control and data predict-
Depending on how threads are partitioned, there can               able than fixed-interval sliced threads. Work is ongoing to
potentially be many possible following threads. Threads           provide a more quantitative explanation of the reasons why
can also be partitioned to take advantage of control              MEM-slicing works.
convergence by encapsulating hard-to-predict branches
within a thread. Large amounts of work may be discarded
as the result of incorrect thread control predictions.            4    PERFORMANCE EVALUATION
   The ideal thread example is a parallel loop iteration.         A detailed execution-driven simulator has been constructed
Each loop iteration is typically the same size as all other       on top of the SimpleScalar simulator, which implements a
loop iterations, has no live inputs, and control proceeds         MIPS-like ISA [1]. The simulator accurately models the
regularly from one iteration to the next. Unfortunately,          cycle-by-cycle behavior of the processor. The simulator
many important applications (and benchmarks) do not have          continues to execute threads under misspeculation, fully
parallel loops.                                                   modeling secondary effects.
   Current dynamic partitioning algorithms [3], [19], [29]           Table 1 summarizes important simulator configurations.
attack either loops, procedures, or cache line boundaries.           The SPEC benchmarks are compiled with gcc 2.6.1 using
Loops are a natural first place to look as parallelizing          optimizations ª-O2 ±unroll-loops.º Each benchmark is

                          TABLE 1                                 .    losses due to data value mispredictions (darkly
                   Simulator Configuration                             shaded region),
                                                                  .    losses due to nonspeculative data cache misses and
                                                                       speculative write buffer full stalls (vertically
                                                                  .    losses due to thread startup and retirement overhead
                                                                       (horizontally graded).
                                                                   The data in this and subsequent similar graphs is
                                                                calculated as follows: First, the simulator is set for perfect
                                                                control prediction, perfect value prediction, perfect caches,
                                                                and zero overhead. The only source of performance loss in
                                                                this case is load balance and coverage. The difference
                                                                between peak speedup (8) and the results of ideal
                                                                simulation gives the fraction of execution bandwidth lost
                                                                to load balance and coverage. Similar simulations are run
                                                                with the various penalties selectively enabled, allowing for
                                                                a measurement of execution bandwidth lost. Finally, a
                                                                simulation is run without any ideal assumptions, which
                                                                gives the actual performance. All results are presented as
                                                                speedup, calculated as the execution time of eight proces-
                                                                sors over the execution time of one processor. Each
                                                                processor is a single issue, in-order design.
                                                                   The first important result is that it is possible to achieve
                                                                good speedups on unmodified sequential binaries using
simulated using the train input set for the first 200 million   aggressive speculation techniques. Without dynamic spec-
instructions or until the program completes, whichever is       ulation, a chip-multiprocessor will not achieve any perfor-
first. Table 2 summarizes the SPEC benchmarks, including        mance gain on these benchmarks. These techniques make a
the baseline performance for a single node.                     chip-multiprocessor an attractive direction for future
4.1 Execution Bandwidth Distribution                               Slightly more than half the peak execution bandwidth is
Fig. 10 shows simulation results. The graph shows speedup       lost to control mispredicts, data mispredicts, and overhead.
on the y-axis (eight processors are used in the experiment).    Table 3 further summarizes simulator results. The average
The bars show the distribution of cycles for each bench-        size in dynamic instructions of all the threads in the
mark. The black section is achieved speedup (perl achieves      benchmark is 21 instructions. ªDone Speculativelyº is the
a speedup of 3.8, for example). The other shaded regions        fraction of work performed on speculative processors.
                                                                Control and data miss rates are the average misprediction
show where lost cycles are going, divided into five regions:
                                                                rates over all threads in the benchmark. Control miss rates
  .    losses due to control mispredicts (white region),        are best for m88ksim and vortex and worst in go and cc1.
  .    losses due to load balance and coverage (lightly         These results are qualitatively similar to branch predictor
       shaded region),                                          performance on these benchmarks.

                                                         TABLE 2
76                                                                   IEEE TRANSACTIONS ON COMPUTERS,   VOL. 50,   NO. 1,   JANUARY 2001

Fig. 10. Performance and execution bandwidth distribution.

4.2 Scalability                                                       processes, support multiple users, and execute explicitly
Fig. 11 shows results of varying the number of nodes in the           parallel programs. All of these factors should be considered
system from two to 16.                                                when choosing the number of nodes to include in a system.
   The m88ksim benchmark performs so well because it is               Also, the efficiency will improve as researchers gain a
the most control and data predictable benchmark, with miss            deeper understanding of thread speculation, data value
rates less than 3 percent. The other benchmarks show slow             prediction, and dynamic partitioning.
but steady improvement until 12 nodes, after which                    4.3 Dependency Predictor
performance diminishes. There is a direct relationship
                                                                      Fig. 13 shows the impact of the dependency predictor on
between control and data predictability and speculation
                                                                      performance. A simulation is conducted with ideal depen-
depth. With more nodes, deeper (more future) speculations             dency prediction (white bars). In this simulation, the
are required to keep the nodes busy. Control and data                 processor is given perfect knowledge of whether or not a
become less predictable as speculation depth increases.               LOAD is live or dead. Of course, this is not possible, but it
Diminishing returns are seen because many of the specula-             establishes ideal predictor potential. Simulations are also
tions made deep into the future are incorrect.                        conducted by predicting that a LOAD is always dead. This
   Fig. 12 shows efficiency of the system, calculated as the          is the policy used by the original multiscalar work [33].
achieved speedup over the number of nodes. Efficiency                 Finally, simulations are conducted while assuming memory
drops steadily as nodes are added. This result does not               dependencies are always live (a value prediction is
argue against building a multiprocessor with many nodes.              performed for all LOADs). The graph shows that the live-
Remember that these results are for a single, sequential              biased predictor presented in Section 3.1.1 raises perfor-
program. A multiprocessor has the ability to run multiple             mance to nearly the maximum achievable level.

                                                                 TABLE 3
                                                         Atlas Simulation Results

Fig. 11. Speedup of the Atlas architecture versus number of nodes.

Fig. 12. Speedup efficiency of Atlas architecture versus number of nodes.

4.4 Value Prediction                                                        prediction works and is a useful concept. No static or
Fig. 14 shows the impact of value prediction on overall                     dynamic data synchronization is performed. Rather, the
performance. In this experiment, the processors are given                   architecture relies on the accuracy of data value prediction
perfect control prediction in order to isolate the influence of             and good recovery techniques. Other architectures, which
value prediction. Three different value predictors are then                 retrofit value prediction on an established architecture that
simulated, including the AMA predictor, the Sazeides/                       support explicit synchronization, will likely not see such a
Smith predictor [32], and a simple Last Value Stride                        strong connection between predictor performance and
predictor [4], [14].                                                        overall performance.
    Prediction accuracy is critical to the success of this
architectural approach. A substantial improvement in over-                  4.5 Dynamic Partitioning
all performance is achieved as the predictor becomes more                   Fig. 15 compares dynamic partitioning algorithms. The
intelligent. Note that this result is not universally true, but             fixed-interval policy represents drawing thread boundaries
may be specific to this architecture. This architecture was                 at fixed intervals in the dynamic instruction stream (every
designed ground-up with the philosophy that value                           16 instructions). This is a policy similar to that used by the
78                                                             IEEE TRANSACTIONS ON COMPUTERS,   VOL. 50,   NO. 1,   JANUARY 2001

Fig. 13. Impact of dependency predictor.

Fig. 14. Impact of value predictor.

trace processor [29], as threads are cut on trace cache line      A fundamental trade-off exists between overhead and
boundaries. The second policy considered is to partition at    predictability. As threads are made smaller, they become
procedure boundaries and attempt to execute the code after     more predictable. This comes at the cost of increased
procedures in parallel with the procedure call. This is the    overhead. Large threads reduce overhead costs, but
primary source of parallelism exploited by the DMT             increase penalties associated with data and control mis-
processor [3]. Finally, experiments were conducted by          predicts. An optimal size exists somewhere between large
forming threads around loop iterations, as is done in [19].    and small threads. For the simulated Atlas chip-multi-
The MEM-slice algorithm improves on fixed interval             processor, 16-32 instruction long threads provide the best
partitioning by selectively slicing thread boundaries at       performance.
memory operations. This has the effect of increasing control      The MEM-slice algorithm was implemented in the Atlas
prediction accuracy and localizing communication within        chip multiprocessor simulator, and the minimum instruc-
threads [7].                                                   tion skip distance was varied from 4 to 64. The characteristic

Fig. 15. Dynamic partitioning algorithms.

Fig. 16. Distribution of execution bandwidth versus thread size.

shape is that performance climbs as threads go from very           With this technique, only instructions dependent on the
small to around 16 instructions and then performance drops         misprediction are reexecuted. The coarse-grain recovery
off with larger threads.                                           approach is to squash the entire thread and reexecute on
   Fig. 16 examines the cause of this behavior. Very small         any misspeculation. Fine-grained recovery is very impor-
threads are dominated by load balance and overhead                 tant to the successful application of value prediction in this
problems. As threads become bigger, load balance and               architecture. Hardware to support fast recovery will be
overhead issues become less important. Control flow is just        challenging to design. However, without this feature, the
the opposite. With small threads, it is easier to make control     value of speculative multithreading is greatly diminished.
predictions than with large threads. As shown in the figure,
these two forces oppose each other. Optimal performance
                                                                   4.7 Startup Latency
comes in the middle (the apex of the ªachievedº triangle).         Fig. 18 shows results of varying the pipeline depth of the
                                                                   processors. A significant degradation in performance
4.6 Recovery Mechanism                                             results as the pipelines get deeper. This is a direct function
Fig. 17 shows the results of simulating two different data         of the size of the threads. Small threads do not have enough
value misprediction recovery mechanisms. The fine-grained          work to amortize this overhead. As threads are made larger,
approach is that used by the Atlas chip-multiprocessor.            overhead is reduced. This comes at the cost of predict-
80                                                            IEEE TRANSACTIONS ON COMPUTERS,   VOL. 50,   NO. 1,   JANUARY 2001

Fig. 17. Impact of recovery mechanism.

Fig. 18. Pipeline startup latency.

ability, as larger threads are less control and data          5    SUMMARY    AND   CONCLUSIONS
predictable. The optimal thread size should be determined     A single-chip multiprocessor is an excellent architecture for
after overhead costs are known.                               future technologies from a hardware point of view as the
                                                              demand for long interconnect and design complexity is
4.8 L1 miss Latency
                                                              reduced. The traditional drawbacks of this approach are
Fig. 19 shows the impact of increasing the miss penalty to    poor performance on existing binaries and on codes that
the L1 data cache. As expected, performance degrades          cannot be automatically parallelized. This paper demon-
with increasing latency. The degradation, however, is not     strates an architecture that adds a moderate amount of
severe. Thread speculative architectures provide natural      hardware to a baseline multiprocessor and, in so doing,
                                                              eliminates the negatives of a single-chip multiprocessor. It
latency tolerance as the future threads request data before
                                                              is possible to expect good parallel performance from
they become nonspeculative, essentially performing data
                                                              unmodified sequential binaries using aggressive specula-
prefetching.                                                  tion techniques. This paper shows that thread speculation,

Fig. 19. L1 miss latency.

data value prediction, and dynamic partitioning can be                   [5]    L. Codrescu and S. Wills, ªProfiling for Input Predictable
                                                                                Threads,º Proc. Int'l Conf. Computer Design (ICCD-98), pp. 558-
combined to support parallel execution of dependent                             565, Oct. 1998.
threads. This paper demonstrates and evaluates mechan-                   [6]    L. Codrescu and S. Wills, ªExploring Microprocessor Architec-
isms to perform these activities in the context of a chip-                      tures for Gigascale Integration,º Proc. 20th Advanced Research in
                                                                                VLSI, Mar. 1999.
multiprocessor.                                                          [7]    L. Codrescu and S. Wills, ªOn Dynamic Speculative Thread
   The most important contributors to performance are fast                      Partitioning and the MEM-Slicing Algorithm,º Proc. Int'l conf.
                                                                                Parallel Architectures and Compilation Techniques (PACT99), Oct.
value misprediction recovery, accurate control/value pre-
diction, and proper thread partitioning. Other factors, such             [8]    L. Codrescu and S. Wills, ªArchitecture of the Atlas Chip-
as dependency prediction, updates/repredictions, and L1                         Multiprocessor: Dynamically Parallelizing Irregular Applica-
                                                                                tions,º Proc. Int'l Conf. Computer Design (ICCD99), Oct. 1999.
miss penalty, are of secondary importance. Thread size                   [9]    L. Codrescu and S. Wills, ªThe AMA Correlated Value Predictor,º
should be carefully adjusted to balance startup latency and                     Pica Group Technical Report 10-98, 1998, available from http://
                                                                         [10]   M. Dorojevets and V. Oklobdzija, ªMultithreaded Decoupled
   An important point is that the Atlas chip-multiprocessor                     Architecture,º Int'l J. High Speed Computing, vol. 7, no. 3, pp. 465-
is an enhanced shared memory multiprocessor. It retains                         480, 1995.
the advantages of that design. Given explicit parallelism                [11]   P.K. Dubey, K. O'Brien, and C. Barton, ªSingle-Program Spec-
                                                                                ulative Multithreading (SPSM) Architecture: Compiler-Assisted
from the user, compiler, or operating system, this archi-                       Fine-Grained Multithreading,º Proc. Int'l Conf. Parallel Architecture
tecture provides a high performance processor with fast                         and Compilation Techniques, pp. 109-121, June 1995.
                                                                         [12]   M. Franklin, ªThe Multiscalar Architecture,º PhD thesis, Univ. of
clocks and reduced design complexity. Aggressive thread                         Wisconsin-Madison, 1993.
and value speculation can be added to automatically                      [13]   C. Fu, M.D. Jennings, S.Y. Larin, and T.M. Conte, ªValue
parallelize irregular, sequential binary applications. Current                  Speculation Scheduling for High Performance Processors,º Proc.
                                                                                Eighth Int'l Conf. Architectural Support for Programming Languages
results suggest speedups averaging 3.4 on 8 processors.                         and Operating Systems (ASPLOS-VIII), Oct. 1998.
These results will improve as partitioning and speculation               [14]   J. Gonzalez and A. Gonzalez, ªSpeculative Execution via Address
techniques mature.                                                              Prediction and Data Prefetching,º Proc. 11th Int'l Conf. Super-
                                                                                computing, pp. 196-203, July 1997.
                                                                         [15]   S. Gopal, T.N. Vijaykumar, J. Smith, and G. Sohi, ªSpeculative
                                                                                Versioning Cache,º Proc. Fourth Int'l Symp. High-Performance
REFERENCES                                                                      Computer Architecture (HPCA-4), Feb. 1998.
[1]   21164 Alpha Microprocessor Hardware Reference Manual, Compaq       [16]   Q. Jacobson, E. Rotenberg, and J.E. Smith, ªPath-Based Next Trace
      Computer Corp., Dec. 1998.                                                Prediction,º Proc. Micro-30, Dec. 1997.
[2]   D. Burger and T.A. Austin, ªThe SimpleScalar Tool Set,             [17]   V. Krishnan and J. Torellas, ªExecuting Sequential Binaries on a
      Version 2.0,º Technical Report #1342, Computer Science Dept.,             Multithreaded Architecture with Speculation Support,º Proc.
      Univ. of Wisconsin-Madison, June 1997.                                    Fourth Int'l Symp. High-Performance Computer Architecture
[3]   H. Akkary, ªDynamic MultiThreaded Processor,º PhD thesis,                 (HPCA-4), Feb. 1998.
      Portland State Univ., June 1998.                                   [18]   M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, ªValue Locality and
[4]   T. Chen and J.-L. Baer, ªEffective Hardware-Based Data Prefetch-          Data Speculation,º Proc. Eighth Int'l Conf. Architectural Support for
      ing for High-Performance Processors,º IEEE Trans. Computers,              Programming Languages and Operating Systems (ASPLOS-7), pp. 138-
      vol. 44, no. 5, pp. 609-623, May 1995.                                    147, Oct. 1996.
82                                                                            IEEE TRANSACTIONS ON COMPUTERS,          VOL. 50,   NO. 1,   JANUARY 2001

[19] P. Marcuello and A. Gonzalez, ªControl and Data Dependence                                      Lucian Codrescu received his BS degree in
     Speculation in Multithreaded Processors,º Proc. Fourth Int'l Symp.                              computer engineering from Virginia Tech in
     High-Performance Computer Architecture (HPCA-4), Feb. 1998.                                     1993. He received his PhD degree in electrical
[20] S. McFarling, ªCombining Branch Predictors,º Technical Report                                   and computer engineering (2000) from the
     DEC WRL TN-36, Digital Western Research Lab, June 1993.                                         Georgia Institute of Technology, Atlanta. He is
[21] J. Meindl, ªGigascale Integration: Is the Sky the Limit?º IEEE                                  a computer architect at the Motorola/Lucent
     Circuits and Devices, Nov. 1996.                                                                Starcore Design Center in Atlanta, Georgia.
[22] A.I. Moshovos, S.E. Breach, T.N. Vijaykumar, and G.S. Sohi,                                     From 1993-1995, he worked as a software
     ªDynamic Speculation and Synchronization of Data Dependen-                                      engineer with General Electric in the area of
     cies,º Proc. 24th Int'l Symp. Computer Architecture (ISCA-24), June                             embedded Operating Systems. His doctoral
     1997.                                                                    thesis addresses techniques for parallel execution of sequential
[23] R. Nair, ªDynamic Path-Based Branch Correlation,º Proc.                  programs on a multithreaded architecture, including mem-slicing
     Micro-28, pp. 15-23, Dec. 1995.                                          partitioning and data partitioning. His research interests include high
[24] ªThe National Technology Roadmap for Semiconductors, Tech-               performance architectures, architectures for gigascale technologies, and
     nology Needs,º Security Industry Assoc. '97, 1997.                       VLSI. He is a member of the IEEE and the Computer Society.
[25] K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K.
     Chung, ªThe Case for a Single-Chip Multiprocessor,º Proc. Seventh                              D. Scott Wills received his BS degree in
     Int'l Conf. Architectural Support for Programming Languages and                                physics from Georgia Tech in 1983 and his
     Operating Systems (ASPLOS-7), Oct. 1996.                                                       SM, EE, and ScD degrees in electrical engineer-
[26] J. Oplinger, D. Heine, S.-W. Liao, B.A. Nayfeh, M.S. Lam, and K.                               ing and computer science from the Massachu-
     Olukotun, ªSoftware and Hardware for Exploiting Speculative                                    setts Institute of Technology in 1985, 1987, and
     Parallelism with a Multiprocessor,º Technical Report CSL-TR-97-                                1990, respectively. He is an associate professor
     715, Computer Systems Laboratory, Stanford Univ., May 1997.                                    of electrical and computer engineering at the
[27] S. Palacharla, N.P. Jouppi, and J.E. Smith, ªQuantifying the                                   Georgia Institute of Technology. His research
     Complexity of Superscalar Processors,º Proc. Int'l Symp. Computer                              interests include short wire VLSI architectures,
     Architrecture (ISCA '97), 1997.                                                                high throughput portable processing systems,
[28] S. Pan, K. So, and J. Rahmeh, ªImproving the Accuracy of                 architectural modeling for gigascale (GSI) technology, and high
     Dynamic Branch Prediction Using Branch Correlation,º Proc. Fifth         efficiency image processors. He is a senior member of the IEEE, a
     Int'l Conf. Architectural Support for Programming Languages and          member of the IEEE Computer Society, and an associate editor of the
     Operating Systems (ASPLOS-5), pp. 76-84, Oct. 1992.                      IEEE Transactions on Computers.
[29] E. Rotenberg et al., ªTrace Processors,º Proc. Micro-30, pp. 68-74,
     Dec. 1997.                                                                                       James Meindl received his bachelor's, mas-
[30] V. Sarkar and J. Hennessy, ªPartitioning Parallel Programs for                                   ter's, and doctor's degrees in electrical engineer-
     Macro-Dataflow,º Conf. Proc. 1986 ACM Conf. Lisp and Functional                                  ing from the Carnegie Institute of Technology
     Programming, pp. 192-201, 1986.                                                                  (Carnegie Mellon University). He is the director
[31] Y. Sazeides and J.E. Smith, ªThe Predictability of Data Values,º                                 of the Joseph M. Pettit Microelectronics Re-
     Proc. Micro-30, Dec. 1997.                                                                       search Center and has been the Joseph M.
[32] Y. Sazeides and J. Smith, ªImplementations of Context Predic-                                    Pettit Chair Professor of Microelectronics at the
     tors,º technical report, Computer Science Dept., Univ. of                                        Georgia Institute of Technology since 1993. He
     Wisconsin-Madison, Dec. 1997.                                                                    was senior vice president for academic affairs
[33] G.S. Sohi, S.E. Breach, and T.N. Vijaykumar, ªMultiscalar                                        and provost of Rensselaer Polytechnic Institute
     Processors,º Proc. Int'l Symp. Computer Architecture (ISCA-22),          from 1986 to 1993. He was with Stanford University from 1967 to 1986
     pp. 414-425, June 1995.                                                  as the John M. Fluke Professor of Electrical Engineering, associate
[34] J.G. Steffan and T.C. Mowry, ªThe Potential for Using Thread-            dean for research in the School of Engineering, director of the Center for
     Level Data Speculation to Facilitate Automatic Parallelization,º         Integrated Systems, director of the Electronics Laboratories, and
     Proc. Fourth Int'l Symp. High-Performance Computer Architecture          founding director of the Integrated Circuits Laboratory. Dr. Meindl is a
     (HPCA-4), Feb. 1998.                                                     life fellow of the IEEE and the American Association for the Advance-
[35] J.-Y. Tsai and P.-C. Yew, ªThe Superthreaded Architecture:               ment of Science and a member of the American Academy of Arts and
     Thread Pipelining with Run-Time Data Dependence Checking                 Sciences and the National Academy of Engineering and its academic
     and Control Speculation,º Proc. Int'l Conf. Parallel Architectures and   advisory board. He received a Benjamin Garver Lamme Medal from
     Compilation Techniques (PACT), Oct. 1996.                                ASEE, an IEEE Education Medal, an IEEE Solid-State Circuits Medal
[36] J.E. Thornton, Design of a Computer: The Control Data 6600. Scott,       and an IEEE Beatrice K. Winner Award. He has also been awarded the
     Foresman, 1970.                                                          IEEE Electron Devices Society's J.J. Ebers Award, the Hamerschlag
[37] R. Tomasulo, ªAn Efficient Algorithm for Exploiting Multiple             Distinguished Alumnus Award, Carnegie Mellon University, the 1999
     Arithmetic Units,º IBM J. Research and Development, vol. 11, pp. 25-     SIA University Research Award, and the IEEE Third Millennium Medal.
     33, Jan. 1967.
[38] D.M. Tullsen, S.J. Eggers, and H.M. Levy, ªSimultaneous Multi-
     threading: Maximizing On-Chip Parallelism,º Proc. 22 Int'l Symp.
     Computer Architecture (ISCA-22), pp. 392-403, June 1995.
[39] T.N. Vijaykumar, ªCompiling for the Multiscalar Architecture,º
     PhD thesis, Univ. of Wisconsin-Madison, Jan. 1998.
[40] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J.
     Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A.
     Agarwal, ªBaring It All to Software: Raw Machines,º Computer,
     pp. 86-93, Sept. 1997.
[41] A. Wolfe and J.P. Shen, ªA Variable Instruction Stream Extension
     to the VLIW Architecture,º Proc. Fourth Int'l Conf. Architectural
     Support for Programming Languages and Operating Systems
     (ASPLOS-4), pp. 2-14, Apr. 1991.
[42] T. Yeh and Y. Patt, ªAlternative Implementations of Two-Level
     Adaptive Branch Prediction,º Proc. 19th Int'l Symp. Computer
     Architecture (ISCA-19), pp. 124-134, 1992.

To top