Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs by Sj8eTJ

VIEWS: 7 PAGES: 57

									Alternatives to Eager Hardware Cache
   Coherence on Large-Scale CMPs

              Marcelo Cintra

            University of Edinburgh
        http://www.homepages.inf.ed.ac.uk/mc
 Acknowledgments
 Research Team
  –   Dr. Tom Ashby (now at IMEC)
  –   Christian Fensch
  –   Pedro Jimenez
  –   Constantino Ribeiro
  –   Marius Kuhn
 Funding
  – UK – EPSRC
  – EC – HiPEAC (NE)
  – EC – SARC (IP)

                        ORNL - March 2007   2
 Why Cache Coherence?
 When is it necessary?
   – Shared-memory parallel applications (including OS)
   – Thread migration under multiprogramming
   – Implementing synchronization (e.g., spin-locks and Decker’s
     algorithm)
 When is it not so important?
   – Sequential applications with no migration
   – Applications with not much locality or reuse of data
   – Message-passing parallel applications


                          ORNL - March 2007                        3
 Hardware Cache Coherence?
 What is good about it?
   – Portability and transparency: no need to compile applications
     for different hierarchies
   – High performance: high selectivity of invalidations, low
     traffic, low latency operations
 What is not so good about it?
   – Extra hardware: for coherence state, for controllers, and for
     suitable interconnects
   – Overkill:
       When coherence is not necessary (see previous slide)
       When software can do a reasonable, or even better, job

                             ORNL - March 2007                       4
 Hardware Cache Coherence?
 Extra hardware:
   – Shared bus fabrics and crossbars will take up a large amount
     of chip real estate (easily >50% overhead for 32 processors
     or more by some accounts)
   – Protocols and controllers for distributed directories and
     multi-segment buses are complex and difficult to verify
 Overkill:
   – Communication in applications does not occur at all times,
     but hardware assumes so
   – Hardware is always there even when not needed



                          ORNL - March 2007                         5
Current CMP’s
 Example: IBM Power4
  – 2 8-issue cores with
    private 32KB L1
  – Crossbar to 3 L2 banks
    1.5MB total
  – 174M transistors in
    0.22micron on 412mm2
 and Power5
  – 2 8-issue 2-way SMT
    cores with private 32KB
    L1
  – Crossbar to 3 L2 banks
    1.9MB total
  – 276M transistors in
    0.13micron on 389mm2
                      ORNL - March 2007   6
Current CMP’s
 Example: Sun T1
  – 8 1-issue 4-way MT
    cores with private 8KB
    L1
  – Crossbar to 4 L2 banks
    3MB total
  – ?M transistors in
    0.09micron on ?mm2
 and T2
  – 8 1-issue 8-way MT
    cores
  – Crossbar to 8 L2 banks
    4MB total
  – ?M transistors in
    0.065micron on ?mm2
                      ORNL - March 2007   7
 Where are we heading?
 Moore’s Law applied to number of cores? Not yet.
   – Multithreading delays the need for more cores, but unlikely
     to go beyond 8-way
   – Ever larger shared on-chip cache is easy solution (i.e., with
     directory); but how large can/should we go?
 But what if we want it to be so?
   – Must find scalable and area-efficient solutions to cache
     coherence problem
   – Must find ways to make the most out of the off-chip data
     bandwidth
   – Must (ideally) make architecture that is good for various
     workload types
  Is it time to re-think traditional hardware cache coherence?
                           ORNL - March 2007                         8
 Our Tiled Architecture Approach
 Simple single- or dual-issue, possibly multithreaded,
  cores
 Point-to-point interconnect
 Cache coherence mostly handled by OS and
  compiler/programmer with minimal hardware support
 Cache coherence can be turned off with little hardware
  waste for certain applications
 Reconfigurable NUCA caches maximize the usefulness
  of the on-chip storage


                       ORNL - March 2007                   9
 Outline
 Motivation
 Page level placement and remote cache access
   – Design
   – Evaluation
   – Related Work and Conclusions
 Software self invalidations and writebacks
   – Design
   – Evaluation
   – Related Work and Conclusions
 Current and Future Directions
                        ORNL - March 2007        10
 Baseline Tiled Architecture
 Single issue RISC processors
 Point-to-point interconnects
 Caches are only for local data
  storage and there is no
  hardware cache coherence
 Shown to be good for ILP-
  and DLP-based applications
  (e.g., MIT’s Raw)
 But, no support for shared-
  memory parallel applications



                        ORNL - March 2007   11
 Key Ideas
 There is only one copy of each data in all L1 caches
  (thus, no coherence problem)
   – Later we will relax this constraint a little bit
 Data is allocated to L1 caches on a page level
   – OS controls placement
   – New hardware tables ensure that location of data can be
     easily determined
   – Placement can change at quiescent points (e.g., barriers)
 Hardware supports remote cache accesses

 Should work well if most data is placed locally and the
      distances/latency on chip are not too high
                             ORNL - March 2007                   12
 Mechanism Overview
 Data placement
  – There is a spectrum of options
      Static per        Our proposal          Dynamic per
      page                                    line
      -               Hardware Complexity               +
      -               Mapping Flexibility               +

  – Dynamic per line is what previous proposals for NUCA-like
    and adaptive shared/private cache schemes do
  – Static per page is very restrictive and compiler/OS has to get
    the mapping absolutely right
  – Our proposal: dynamic OS controlled mapping per page

                          ORNL - March 2007                          13
 Data Placement
 New level of indirection allows OS to map any given
  page to any L1 cache
 OS extensions
   – Page table is extended to keep owner tile ID alongside (per
     page) virtual-to-physical address translation (note that this
     exposes the chip structure to the OS)
   – Default mapping policy is first-touch
 Hardware extensions
   – TLB-like table that caches the most recent translations and is
     used to locate the line upon loads and stores (MAP)
   – Instead of a directory coherence controller, a device to
     handle remote cache access requests (RAC)
                           ORNL - March 2007                          14
 Data Placement
 Why not simply divide the physical address space
  across tiles and use the virtual-to-physical mapping to
  place pages?
   – More restrictive
   – Fragmentation in main memory
   – More difficult to change mapping
              P addr.   V addr.                     P addr.   V addr.
  L1                                           L1

  L1                                           L1

  L1                                           L1

  L1                                           L1
                           ORNL - March 2007                            15
 Data Placement
 Difficulty:
   – L1 caches are often virtually-indexed to speed up local
     processor access
   – Remote cache requests with physical addresses would require
     some inverse translation at the remote tile
 Solution:
   – Ship virtual addresses over the on-chip interconnect
   – TLB’s only keep virtual-to-physical translations for the data
     allocated locally




                           ORNL - March 2007                         16
     Load/Store Mechanism
1.    Virtual address (v-addr) is used
      to index L1, TLB, and MAP
2.    If MAP entry points to local
      L1 then proceed with cache
      and TLB lookups as usual
3.    If MAP entry points to remote
      L1 then abort L1 lookup and
      ignore TLB result (including
      miss)
4.    v-addr is sent over network to
      home tile’s RAC
5.    v-addr is used to index both
      L1 and TLB
6.    Remote cache and TLB
      miss/hits are handled as usual

                                   ORNL - March 2007   17
 Allowing Migration of Data
 To migrate pages:
   –   Invalidate MAP entries
   –   Write-back dirty cache lines
   –   Invalidate cache contents
   –   The following access will generate a (possibly) new mapping
 Must be done at a quiescent state: e.g., barrier
 Instructions for cache writeback and invalidate already
  exist in most IS’s, all we need is to invalidate the MAP
  as well
 Note that it is not necessary to invalidate the TLB’s

                            ORNL - March 2007                        18
Allowing Migration of Data
     P0
L1        TLB MAP                  Memory
                    pa→P0                   1. P0 ld 0xA0 (pa→P0)
                                            2. P1 ld 0xA1 (remote)
                                            3. P0 st 0xA2
                                            4. Barrier (write-back +
     P1                                         inv. L1, inv. MAP)
L1        TLB MAP
                    pa→P0




     P2
L1        TLB MAP



                    ORNL - March 2007                                  19
Allowing Migration of Data
     P0
L1        TLB MAP                  Memory
                    pa→P1                   1. P0 ld 0xA0 (pa→P0)
                                            2. P1 ld 0xA1 (remote)
                                            3. P0 st 0xA2
                                            4. Barrier (write-back +
     P1                                         inv. L1, inv. MAP)
L1        TLB MAP
                                            5. P1 ld 0xA1 (pa→P1)
                    pa→P1
                                            6. P0 ld 0xA0 (remote)



     P2
L1        TLB MAP



                    ORNL - March 2007                                  20
 Allowing Replication of Read-Only Data
 Controlled replication: multiple readers but single, fixed,
  writer
 Caveat: assume release consistent (RC) programs
 Operation
   – Processors reading data in a page get a local mapping in MAP
   – First write traps to OS and tile becomes the home/owner
   – Subsequent reads with old local mappings can continue
   – Subsequent writes, and reads without mapping, are directed to
     home tile
   – At locks must also invalidate MAP entries
   – At barriers same procedure as for migration
 Key features
   – OS does not have to keep track of sharers
   – Ownership of modified data is not transferred
                          ORNL - March 2007                          21
Allowing Replication of Data
     P0
L1        TLB MAP                  Memory
                    pa→P0                   1. P0 ld 0xA0 (pa→P0)
                                            2. P1 ld 0xA1 (pa→P1)
                                            3. P0 st 0xA2 (owner)
                                            4. Barrier (write-back +
     P1                                         inv. L1, inv. MAP)
L1        TLB MAP
                    pa→P1




     P2
L1        TLB MAP



                    ORNL - March 2007                                  22
Allowing Replication of Data
     P0
L1        TLB MAP                  Memory
                    pa→P0                   1. P0 ld 0xA0 (pa→P0)
                                            2. P1 ld 0xA1 (pa→P1)
                                            3. P0 st 0xA2 (owner)
                                            4. Barrier (write-back +
     P1                                         inv. L1, inv. MAP)
L1        TLB MAP
                                            5. P0 ld 0xA2 (pa→P0)
                                            6. P2 ld 0xA1 (pa→P2)
                                            7. P2 st 0xA1 (owner)
                                            8. P0 acq. lock (invalidate
     P2                                         MAP)
L1        TLB MAP
                    pa→P2


                    ORNL - March 2007                                  23
Allowing Replication of Data
     P0
L1        TLB MAP                  Memory
                    pa→P2                   1. P0 ld 0xA0 (pa→P0)
                                            2. P1 ld 0xA1 (pa→P1)
                                            3. P0 st 0xA2 (owner)
                                            4. Barrier (write-back +
     P1                                         inv. L1, inv. MAP)
L1        TLB MAP
                                            5. P0 ld 0xA2 (pa→P0)
                                            6. P2 ld 0xA1 (pa→P2)
                                            7. P2 st 0xA1 (owner)
                                            8. P0 acq. lock (invalidate
     P2                                         MAP)
L1        TLB MAP
                    pa→P2                   9. P0 ld 0xA1 (pa→P2)


                    ORNL - March 2007                                  24
 Evaluation Environment
 Simulator:
   –   PowerPC IS
   –   Pipelined single-issue cores and network modeled with Liberty
   –   Caches and memory borrowed from SimpleScalar
   –   All latencies modeled in detail; contention modeled in detail,
       except at internal nodes in the network
 Compiler: gcc 3.4.4
 Applications: Splash-2
 Systems simulated:
   – Proposed scheme with and without migration+sharing
     (NUCA-Dist and NUCA-Dist+MS)
   – Idealized distributed directory coherent system (Dir-Coh)

                            ORNL - March 2007                           25
      Overall Performance
           40                                                                                             3%
           35             No gap
           30
speedup




           25
                                                                                          42%
           20
           15
           10
            5
            0




                                                                                                                 nd
                          T




                                                                                                                  pt
                               LU




                                                                                                                 sq
                                                                                                  ce
                                          x


                                                   es




                                                                       n
                                                            m




                                                                                   it y
                  ky




                                       di
                         FF




                                                                     ea




                                                                                                              r-s
                                                          fm




                                                                                                              lre
                                                 rn




                                                                                                             r-n
                                                                                                ra
                                                                                 os
               es




                                    ra




                                                                  oc




                                                                                             yt




                                                                                                           te
                                              ba




                                                                                                          vo
                                                                              di
            ol




                                                                                                          te
                                                                                          ra




                                                                                                       wa
          ch




                                                                           ra




                                                                                                       wa
                                                        Dir-Coh    NUCA-Dist+MS



                       NUCA-Dist+MS performs close (within 42%,
                       and 19% on avg.) to idealized directory system
                                                   ORNL - March 2007                                                   26
                     Memory Access Breakdown
                            cholesky                                                            FFT                                                           LU
Seq




                                                               Seq




                                                                                                                              Seq
                1'                                                              1'                                                            1'
NUCA-Dist




                                                               NUCA-Dist




                                                                                                                              NUCA-Dist
               16'                                                             16'                                                           16'
NUCA-Dist+MS




                                                               NUCA-Dist+MS




                                                                                                                              NUCA-Dist+MS
               16'                                                             16'                                                           16'




               32'                                                             32'                                                           32'
Dir-Coh




                                                               Dir-Coh




                                                                                                                              Dir-Coh
               32'                                                             32'                                                           32'


                     0%   20%   40%   60%    80%   100%                              0%   20%   40%   60%   80%   100%                             0%   20%   40%   60%   80%   100%




                                                   Local hit                  Remote hit              Local miss         Remote miss



                                            Off-chip accesses very rare for all systems
                                                          ORNL - March 2007                                                                                                        27
                     Memory Access Breakdown
                                radix                                                            barnes                                                           fmm
Seq




                                                                Seq




                                                                                                                                  Seq
                1'                                                               1'                                                               1'
NUCA-Dist




                                                                NUCA-Dist




                                                                                                                                  NUCA-Dist
               16'                                                              16'                                                              16'
NUCA-Dist+MS




                                                                NUCA-Dist+MS




                                                                                                                                  NUCA-Dist+MS
               16'                                                              16'                                                              16'




               32'                                                              32'                                                              32'
Dir-Coh




                                                                Dir-Coh




                                                                                                                                  Dir-Coh
               32'                                                              32'                                                              32'


                     0%   20%    40%    60%   80%   100%                              0%   20%    40%     60%   80%   100%                             0%   20%   40%   60%   80%   100%


                                                    Local hit                  Remote hit                 Local miss         Remote miss


                                  Migration and replication significantly
                                  reduce the amount of remote cache accesses
                                                   ORNL - March 2007                                                                                                                   28
                     Memory Access Breakdown
                                ocean                                                        radiosity                                                     raytrace
Seq




                                                                Seq




                                                                                                                               Seq
                1'                                                               1'                                                            1'
NUCA-Dist




                                                                NUCA-Dist




                                                                                                                               NUCA-Dist
               16'                                                              16'                                                           16'
NUCA-Dist+MS




                                                                NUCA-Dist+MS




                                                                                                                               NUCA-Dist+MS
               16'                                                              16'                                                           16'




               32'                                                              32'                                                           32'
Dir-Coh




                                                                Dir-Coh




                                                                                                                               Dir-Coh
               32'                                                              32'                                                           32'


                     0%   20%    40%    60%   80%   100%                              0%   20%   40%   60%   80%   100%                             0%   20%   40%    60%   80%   100%



                                                    Local hit                  Remote hit                Local miss       Remote miss




                                                                                      ORNL - March 2007                                                                              29
                     Memory Access Breakdown
                                volrend                                                       water-nsq                                                      water-spt
Seq




                                                                  Seq




                                                                                                                                 Seq
                1'                                                                 1'                                                            1'
NUCA-Dist




                                                                  NUCA-Dist




                                                                                                                                 NUCA-Dist
               16'                                                                16'                                                           16'
NUCA-Dist+MS




                                                                  NUCA-Dist+MS




                                                                                                                                 NUCA-Dist+MS
               16'                                                                16'                                                           16'




               32'                                                                32'                                                           32'
Dir-Coh




                                                                  Dir-Coh




                                                                                                                                 Dir-Coh
               32'                                                                32'                                                           32'


                     0%   20%     40%     60%   80%   100%                              0%   20%   40%   60%   80%   100%                             0%   20%   40%   60%   80%   100%



                                                      Local hit                  Remote hit              Local miss         Remote miss


                                   NUCA-Dist+MS has on avg. 90% of local
                                   accesses      ORNL - March 2007                                                                                                                    30
  Ovhd. of Writebacks and Invalidations
 Recall:                                                    Overhead (%)
                                               Benchmark    Barriers   Lock
   – At barriers: write back dirty
     cache lines, invalidate cache and          cholesky     <0.01     2.93

     MAP                                          FFT         0.1      <0.01

   – At lock acquires: write back dirty           LU         4.03      <0.01

     cache lines, invalidate MAP                 radix       2.75      4.23
                                                barnes       0.65       1.8
                                                  fmm         0.7       0.9
                                                 ocean       22.38     0.43
                                                radiosity    <0.01     17.4
Writebacks and invalidations don’t              raytrace     0.38      8.07
seem to be a major bottleneck                   volrend      <0.01     <0.01
                                               water-nsq      0.1      1.08
                                               water-spt     3.83       0.5
                           ORNL - March 2007                                   31
 Related Work
 Tiled architectures:
   – Raw (MIT); TRIPS (U. Texas); Cyclops (IBM); Wavescalar (U.
     Washington); Vector-Thread (MIT)
 Shared/Private on-chip caches
   – NUCA (U. Texas); CMP-NUCA (U. Wisconsin); NuRAPID
     (Purdue); Victim Replication (MIT); Cooperative Caching (U.
     Wisconsin)
 Alternatives to Snooping and Directory Coherence:
   – Token Coherence and Ring Coherence (U. Wisconsin)
 Software DSM


                         ORNL - March 2007                         32
 Conclusions
 Alternative to full hardware distributed directory cache
  coherence that divides work between OS and hardware
 Mostly acceptable performance degradation:
   – Within 50% for all applications and within 15% for 8 out of 12
     applications w.r.t. an ideal hardware coherent system
   – Small number of remote cache accesses (10% on average and
     38% maximum)
 Likely much less complex than distributed directory
  coherence:
   – Only request-response transactions; no forwarding or multiple
     invalidations necessary
   – Less storage required compared to directory state
                          ORNL - March 2007                           33
 Outline
 Motivation
 Page level placement and remote cache access
   – Design
   – Evaluation
   – Related Work and Conclusions
 Software self invalidations and writebacks
   – Design
   – Evaluation
   – Related Work and Conclusions
 Current and Future Directions
                        ORNL - March 2007        34
 Hardware vs. Software Cache Coherence
 Hardware
  + Fast communication and possibly more accurate invalidations
  – Eager write propagation at the time of store
  + Unnecessary invalidations on false sharing
  – No exploitation of knowledge of memory consistency model,
    synchronization points, and communication patterns
  + Has been done commercially for many years and proven to
    work well




                        ORNL - March 2007                         35
 Hardware vs. Software Cache Coherence
 Software
  + Compiler/programmer may know exactly what needs to be
    communicated and when
  – Exact sharing information not usually easy to identify statically
  – Hard to deal with hardware artifacts such as false sharing,
    speculative execution, and prefetching
  – Never really left the niche of research prototypes and DSP’s
  – Only worked for loop-based data-parallel scientific
    applications
  + Less hardware required


                          ORNL - March 2007                             36
 Software Coh. for Release Consistency
 Our approach: look at the communication requirements
  of the memory consistency model (RC in our case)
   – Synchronization points explicitly defined
   – No communication between threads happens in-between
     synchronization points
   – Lock acquires:
       Data modified by previous lock owner must be used (transitively)
   – Lock releases:
       Data modified must be made visible to future lock owners (transitively)
   – Barriers:
       Data modified by all threads before the barrier must be used by all
        threads after the barrier
                              ORNL - March 2007                                   37
 Software Coh. for Release Consistency
 To enforce cache coherence on RC
  – Lock acquires:
      Write back dirty lines
      Invalidate cache contents
  – Lock release:
      Write back dirty lines
  – Barrier:
      Write back dirty lines
      Invalidate cache contents
  – Use per-word dirty bits to perform selective writeback in
    memory to handle false sharing

                                ORNL - March 2007               38
 Evaluation Environment
 Simulator:
   – Sparc IS
   – Non-pipelined single-issue cores, caches, and network
     modeled with Simics
   – All latencies modeled in detail; contention modeled in detail at
     bus
 Compiler: Sun C compiler
 Applications: Splash-2 (scientific&engineering) and
  ALPBench (media)
 Systems simulated:
   – Proposed scheme with self-invalidations and writebacks
     (flush)
   – Hardware coherent system (MSI)
                           ORNL - March 2007                            39
Overall Performance




   For a surprising number of applications software
   coherence performs close (within 10%) to MSI
                     ORNL - March 2007                40
Overall Performance




   Somewhat worse results for larger (256KB vs.
                                    to
   64KB), but still fairly close 2007 MSI
                       ORNL - March               41
Overall Performance




   Even better results with a shared L2 cache
                     ORNL - March 2007          42
 Related Work
 Software cache coherence:
   – Pfister et. al., ICPP’85; Cheong and Veidenbaum, ISCA’88;
     Cytron et. al., ICPP’88
   – Compiler-directed, only successful with regular data-parallel
     applications
 Most recent/relevant work: shared-regions, Sandhu et.
  al., PPoPP’93
   – Also based on explicit synchronization
   – Required user help to identify critical sections and the
     associated data
 Software DSM
                           ORNL - March 2007                         43
 Conclusions
 Revisited software-only cache coherence mechanisms
 Pure software-based cache coherence for release-
  consistent applications performs surprisingly well
   – Within 50% for all applications and within 10% for 9 out of 13
     applications w.r.t. hardware coherent system
 Some applications still suffer due to unnecessary cache
  invalidations and writebacks
 Hardware needed to perform more selective cache
  invalidation



                          ORNL - March 2007                           44
 Outline
 Motivation
 Page level placement and remote cache access
   – Design
   – Evaluation
   – Related Work and Conclusions
 Software self invalidations and writebacks
   – Design
   – Evaluation
   – Related Work and Conclusions
 Current and Future Directions
                        ORNL - March 2007        45
 Current and Future Directions
 Tiled architectures
   – Hybrid hardware-software cache coherence schemes
   – Mechanisms to support streaming applications
   – Reconfigurable on-chip cache hierarchies
 Speculative parallelization
   – Tuning of common microarchitectural features
   – Automatic compilation and optimization with gcc
   – Software-only speculative parallelization
 Performance prediction and design space exploration
   – With machine learning techniques
 Database Systems
   – JIT code generation and optimization for relational queries
                          ORNL - March 2007                        46
Alternatives to Eager Hardware Cache
   Coherence on Large-Scale CMPs

              Marcelo Cintra

            University of Edinburgh
        http://www.homepages.inf.ed.ac.uk/mc
 Implementing Locks
 Strait-forward to implement Compare&Swap style primitives
  (there is only one copy of the lock variable)
 Load-link/Store-conditional primitives are not as easy to
  implement without cache coherence

                                             Coherence guarantees that
                                             all caches will be notified
                                             of any conflicting store



                                             Absence of coherence
                                             makes it impossible to
                                             detect conflict

                         ORNL - March 2007                            48
 Implementing Locks
 Solution:
   – Place RESERVE register in the tile that is home to the lock
   – RESERVE register is shared by all locks in pages assigned to
     the tile
   – Must not allow new LL requests to overwrite the RESERVE
     register (to avoid livelock)
       Must also add a timeout mechanism to prevent deadlock when there
        is no matching SC
   – Must add an ID tag to the RESERVE register to identify the
     processor with the pending LL (to avoid simultaneous
     acquisition of a lock)


                            ORNL - March 2007                              49
 Evaluation Environment
 Configurations simulated:




                      ORNL - March 2007   50
 Evaluation Environment
 Applications: Splash-2 benchmarks with reference input
  sets (except for radiosity)




                      ORNL - March 2007                    51
Memory Latencies
                                                0.5<S ≤1   1<S ≤2   2<S ≤3   3<S ≤4

                         80
                         70
 Fraction of loops (%)




                         60
                         50
                         40
                         30
                         20
                         10
                          0
                               mesa   art   equake ammp    vpr      mcf   crafty vortex bzip2 average




                              NUCA-Dist+MS performs very close (within
                              x% of ideal coherent system
                                               ORNL - March 2007                                        52
 Evaluation Environment
 Configurations simulated: 2 to 16 processors




                       ORNL - March 2007         53
 Evaluation Environment
 Applications: Splash-2 applications and ALPBench
  benchmarks, both with reference input sets (except
  MPGDEC/ENC)




                       ORNL - March 2007               54
Sources of largest errors (top 10%)



           Source of error             Number    Error (%)
  Incorrect IR workload estimation           4    54~116
   Unknown iteration count(i<P)              3    54~61
 Unknown inner loop iteration count          2    98~161
         Biased conditional                  1      136




                         ORNL - March 2007                   55
Application Characteristics

                                      % of Seq.   Spec data
 Application        Loops
                                       Time       Size (KB)
   TREE            accel_10                 94       <1


 WUPWISE         muldeo_200’                       12,000
                                            41
                 muldoe_200’
   MDG            interf_1000               86      <1

  LUCAS        mers_mod_square                      4,000
                                            20
                   (line 444)
   AP3M          Shgravll_700               78      3,000


                        ORNL - March 2007                     56
 Breaking News!!

 Intel has just announced an 80 core chip at
  ISSCC
  – Simple cores
  – Point-to-point interconnect
  – No cache coherence (?)




                      ORNL - March 2007         57

								
To top