Berkeley NOW by HuJEMt

VIEWS: 0 PAGES: 48

									Workload-Driven Evaluation

      CS 258, Spring 99
       David E. Culler
   Computer Science Division
        U.C. Berkeley
Workload-Driven Evaluation
• Evaluating real machines
• Evaluating an architectural idea or trade-offs

=> need good metrics of performance
=> need to pick good workloads
=> need to pay attention to scaling
      – many factors involved




2/12/99                         CS258 S99          2
Working Set Perspective
•At   a given level of the hierarchy (to the next further one)


                 Data traffic
                                               First working set




                                Capacity-generated traffic
                                  (including conflicts)
                                                                                     Second working set



                                            Other capacity-independent communication

                                                     Inherent communication

                                                   Cold-start (compulsory) traffic
                                                                         Replication capacity (cache size)




 – Hierarchy of working sets
 – At first level cache (fully assoc, one-word block), inherent to algorithm
     » working set curve for program
 – Traffic from any type of miss can be local or nonlocal (communication)

2/12/99                                                  CS258 S99                                           3
Example Application Set
                                                              ograms
               Table 4.1 General Statistics about Application Pr

                                                Total    Total   Total    Total          Total Shared Shared
                              Input         Instructions FLOPS References Reads          Writes Reads Writes
          Application        Data Set            (M)      (M)     (M)      (M)            (M)    (M)   (M)   Barriers           Locks
          LU             512  512 matrix      489.52      92.20    151.07     103.09    47.99    92.79 44.74         66                0
                         16  16 blocks
          Ocean           258 258 grids       376.51     101.54      99.70     81.16    18.54    76.95 16.97        364            1,296
                                      –7
                         tolerance = 10
                         4 time-steps
          Barnes-Hut      16-K particles     2,002.74 239.24        720.13     406.84 313.29 225.04 93.23               7       34,516
                          = 1.0
                         3 time-steps
          Radix           256-K points           84.62     —         14.19       7.81     6.38    3.61    2.18       11               16
                         radix = 1,024
          Raytrace        Car scene              833.35      —        290.35    210.03    80.31 161.10 22.35              0      94,456
          Radiosity       Room scene           2,297.19      —        769.56    486.84 282.72 249.67 21.88              10      210,485
                o
          Multipr g:     SGI IRIX 5.2,       1,296.43      —        500.22     350.42 149.80       —       —        —           —
           User          two pmakes +
                         two compress
          Multipr g:
                o                              668.10      —        212.58     178.14    34.44     —       —        —         621,505
                         jobs
           Kernel

                            ograms, shared reads and writes simply r fer to all nonstack r ferences issued by the application pr All
          For the parallel pr                                        e                   e                                     ocesses.
                                                                           ed              ocesses. The Multipr g workload is not rallel
          such references do not necessarily point to data that is truly shar by multiple pr                    o                  a pa
                                                ed
          application, so it does not access shar data. A dash in a table entry means that this measur                           not
                                                                                                        ement is not applicable to or is
          measured for that application (e.g., Radix has no oating-point operations). (M) denotes that measur  ement in that column is in
          millions.




2/12/99                                                      CS258 S99                                                                      4
Working Sets (P=16, assoc, 8 byte)
                          40                                                       60                                                               10
                                                                                        
                                                                                                                                                                 L0 WS
                                                                                                     L1 WS                                              
                                                                                                                                                   8        
                          30                                                                                                                                         
                                                                                                  
                                                                                   40                 
          Miss rate (%)




                                                                   Miss rate (%)




                                                                                                                                    Miss rate (%)
                                                                                                                                                     6
                          20                                                                                     
                                            L1 WS                                                                                                   4
                                                                                   20
                          10                                                                                                                                                    L1 WS
                                                       L2 WS                                                                  L WS                  2
                                                                                                                              2                                                      L2 WS
                                           
                                                 
                                                                                                                                                                              
                          0                                                         0                                                                0



                               1,024




                                                                                        1,024




                                                                                                                                                         1,024
                                   1
                                   2
                                   4
                                   8



                                 128
                                 256
                                 512




                                                                                            1
                                                                                            2
                                                                                            4
                                                                                            8



                                                                                          128
                                                                                          256
                                                                                          512




                                                                                                                                                             1
                                                                                                                                                             2
                                                                                                                                                             4
                                                                                                                                                             8



                                                                                                                                                           128
                                                                                                                                                           256
                                                                                                                                                           512
                                  16
                                  32
                                  64




                                                                                           16
                                                                                           32
                                                                                           64




                                                                                                                                                            16
                                                                                                                                                            32
                                                                                                                                                            64
                                        Cache size (K)                                              Cache size (K)                                               Cache size (K)

                                              (a) LU                                                 (b) Ocean                                                   (c) Barnes–Hut


                          20                                                       20                                                               50
                                                                                        

                               
                                                                                                                                                   40                        L1 WS
                          15                                                       15
                                                                                                                                                             
          Miss rate (%)




                                                                   Miss rate (%)




                                                                                                                                  Miss rate (%)
                                                                                                                                                  30           

                          10                                                       10                                                                                
                                                                                                                                                                         
                                                                                                                                                    20                                L2 WS
                                             L1 WS
                                                                                                                       L1 WS                                                      
                          5                                                        5
                                                                                                                                                    10                                  
                                                           L2 WS                                             
                                                                                                                 
                                                                                                                       
                                                                                                                   
                          0                                                        0                                                                 0
                               1,024




                                                                                        1,024




                                                                                                                                                         1,024
                                   1
                                   2
                                   4
                                   8




                                                                                            1
                                                                                            2
                                                                                            4
                                                                                            8




                                                                                                                                                             1
                                                                                                                                                             2
                                                                                                                                                             4
                                                                                                                                                             8
                                 128
                                 256
                                 512




                                                                                          128
                                                                                          256
                                                                                          512




                                                                                                                                                           128
                                                                                                                                                           256
                                                                                                                                                           512
                                  16
                                  32
                                  64




                                                                                           16
                                                                                           32
                                                                                           64




                                                                                                                                                            16
                                                                                                                                                            32
                                                                                                                                                            64
                                        Cache size (K)                                              Cache size (K)                                               Cache size (K)

                                           (d) Radiosity                                            (e) Ray trace                                                        (f ) Radix
2/12/99                                                                                     CS258 S99                                                                                           5
Working Sets Change with P (NPB)




                             8-fold reduction
                             in miss rate from
                             4 to 8 proc




2/12/99        CS258 S99                 6
Where the Time Goes: NPB LU-a

                       3000


                       2500


                       2000
                                                              Wait
          Total Time




                                                              Receive
                       1500
                                                              Send
                                                              Compute
                       1000


                       500


                         0
                              4   8                 16   32
                                      Processors




2/12/99                                 CS258 S99                       7
False Sharing Misses: Artifactual
Comm.
• Different processors                                 Contiguity in memory layout
  update different words
  in same block                              P0   P1   P2      P3

• Hardware treats it as
  sharing                                    P4   P5   P6       P7

      – cache block is unit of
        coherence                            P8

• Ping-pongs between
  caches
                                                       Cache block
                                                       straddles partition
                                                       boundary




2/12/99                          CS258 S99                                           8
Questions in Scaling
• Scaling a machine: Can scale power in many
  ways
      – Assume adding identical nodes, each bringing memory
• Problem size: Vector of input parameters, e.g. N
  = (n, q, Dt)
      – Determines work done
      – Distinct from data set size and memory usage
• Under what constraints to scale the application?
      – What are the appropriate metrics for performance
        improvement?
          » work is not fixed any more, so time not enough
• How should the application be scaled?


2/12/99                        CS258 S99                      9
  Under What Constraints to Scale?
• Two types of constraints:
   – User-oriented, e.g. particles, rows, transactions, I/Os per processor
   – Resource-oriented, e.g. memory, time
• Which is more appropriate depends on application
  domain
   – User-oriented easier for user to think about and change
   – Resource-oriented more general, and often more real
• Resource-oriented scaling models:
   – Problem constrained (PC)
   – Memory constrained (MC)
   – Time constrained (TC)
• (TPC: transactions, users, terminals scale with
  “computing power”)
• Growth under MC and TC may be hard to predict
  2/12/99                        CS258 S99                             10
Problem Constrained Scaling
• User wants to solve same problem, only faster
      – Video compression
      – Computer graphics
      – VLSI routing

• But limited when evaluating larger machines

      SpeedupPC(p) =   Time(1)
                       Time(p)




2/12/99                      CS258 S99            11
Time Constrained Scaling
• Execution time is kept fixed as system scales
    – User has fixed time to use machine or wait for result
• Performance = Work/Time as usual, and time is
  fixed, so
                                  Work(p)
                   SpeedupTC(p) =
                                  Work(1)

• How to measure work?
      – Execution time on a single processor? (thrashing problems)
      – Should be easy to measure, ideally analytical and intuitive
      – Should scale linearly with sequential complexity
          » Or ideal speedup will not be linear in p (e.g. no. of rows in
            matrix program)
      – If cannot find intuitive application measure, as often true,
        measure execution time with ideal memory system on a
2/12/99 uniprocessor (e.g. pixie)CS258 S99                                  12
Memory Constrained Scaling
• Scale so memory usage per processor stays fixed
• Scaled Speedup: Time(1) / Time(p) for scaled up
  problem
    – Hard to measure Time(1), and inappropriate

    SpeedupMC(p) = Work(p)   Time(1)        Increase in Work
                           x              =
                   Time(p)   Work(1)        Increase in Time

• Can lead to large increases in execution time
    – If work grows faster than linearly in memory usage
    – e.g. matrix factorization
        » 10,000-by 10,000 matrix takes 800MB and 1 hour on
          uniprocessor. With 1,000 processors, can run 320K-by-320K
          matrix, but ideal parallel time grows to 32 hours!
        » With 10,000 processors, 100 hours ...

2/12/99                       CS258 S99                           13
Scaling Summary
• Under any scaling rule, relative structure of the
  problem changes with P
      – PC scaling: per-processor portion gets smaller
      – MC & TC scaling: total problem get larger
• Need to understand hardware/software
  interactions with scale

• For given problem, there is often a natural
  scaling rule
      – example: equal error scaling




2/12/99                        CS258 S99                 14
   Types of Workloads
   – Kernels: matrix factorization, FFT, depth-first tree search
   – Complete Applications: ocean simulation, crew scheduling, database
   – Multiprogrammed Workloads


• Multiprog.           Appls           Kernels         Microbench.

Realistic                                    Easier to understand
Complex                                      Controlled
Higher level interactions                    Repeatable
Are what really matters                      Basic machine characteristics

Each has its place:
   Use kernels and microbenchmarks to gain understanding, but
   applications to evaluate effectiveness and performance
   2/12/99                     CS258 S99                           15
NOW Ultra 170 vs Enterprise 5000
                                                   • Workstation UPA
                                        8-port
     160 MB/s
     bidirectional
                                        wormhole
                                        switches
                                                               – cross bar
     links
                                                   • SMP
              dma dma          Myricom                         – switch between Ultrasparc
                               Lanai NIC                         coherence protocol (MOESI) and
               MP sram         (37.5 MHz proc,
                 dma
                               256 MB sram
                               3 dma units)
                                                                 bus protocol (MSI)
 64MB
 Mem                   S-bus (25 MHz)
              B/A

        UPA                                        4 Processing Cards                                                                      Multiple Myricom
                                                    - 2 x Ultra1 CPUs with 512 MB L2                                                       Lanai network
UltraSparc    L2 Cache                              - 2 x 64 MB DRAM banks                                                                 interface cards




                                                                                                      100bT, SCSI


                                                                                                                           SBUS


                                                                                                                                           2 FiberChannel
                                                                                                                    SBUS


                                                                                                                                    SBUS
                                                    P      P
                                                    $      $
                                                                                                                                                            3 I/O Cards
                                                    $2    $2                                                                                                 - 2 64B x 25 MHz
                                                                  mem ctrl
                                                                                                                    Bus Interface
                                                                                                                                                            SBus each
                                                     Bus Interface / Switc h



                                                                                       TM
                                                                               Gigaplane    bus (256 data, 41 address, 83 MHz)




2/12/99                                                  CS258 S99                                                                                                   16
Microbenchmarks




• Memory access latency (512KB L2, 64B blocks)
      – Enterprise 5000:   51 cycles       Ultra 170: 44 cycles
      – other L2:          84 cycles
• Memory copy bandwidth
      – Enterprise 5000:   184 MB/s        Ultra 170: 168 MB/s
• Arithmetic, floating point, graphics, ...
2/12/99                        CS258 S99                          17
Coverage: Stressing Features
• Easy to mislead with workloads
      – Choose those with features for which machine is good, avoid
        others
• Some features of interest:
      – Compute v. memory v. communication v. I/O bound
      – Working set size and spatial locality
      – Local memory and communication bandwidth needs
      – Importance of communication latency
      – Fine-grained or coarse-grained
          » Data access, communication, task size
      – Synchronization patterns and granularity
      – Contention
      – Communication patterns
• Choose workloads that cover a range of properties
2/12/99                       CS258 S99                               18
  Coverage: Levels of Optimization
• Many ways in which an application can be suboptimal
   – Algorithmic, e.g. assignment, blocking                           4n
                                                         2n
                                                          p            p

   – Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem
   – Data layout, distribution and alignment, even if properly structured
   – Orchestration
       » contention
       » long versus short messages
       » synchronization frequency and cost, ...
   – Also, random problems with “unimportant” data structures
• Optimizing applications takes work
   – Many practical applications may not be very well optimized

• May examine selected different levels to test robustness
  of system
  2/12/99                        CS258 S99                             19
    Concurrency
• Should have enough to utilize the processors
   – If load imbalance dominates, may not be much machine can do
   – (Still, useful to know what kinds of workloads/configurations don’t
     have enough concurrency)

• Algorithmic speedup: useful measure of
  concurrency/imbalance
   – Speedup (under scaling model) assuming all memory/communication
     operations take zero time
   – Ignores memory system, measures imbalance and extra work
   – Uses PRAM machine model (Parallel Random Access Machine)
            » Unrealistic, but widely used for theoretical algorithm development


• At least, should isolate performance limitations due to
  program characteristics that a machine cannot do
  much about (concurrency) from those that it can.
  2/12/99                               CS258 S99                                  20
Workload/Benchmark Suites
• Numerical Aerodynamic Simulation (NAS)
      – Originally pencil and paper benchmarks
• SPLASH/SPLASH-2
      – Shared address space parallel programs
• ParkBench
      – Message-passing parallel programs
• ScaLapack
      – Message-passing kernels
• TPC
      – Transaction processing
• SPEC-HPC
• ...
2/12/99                          CS258 S99       21
  Evaluating a Fixed-size Machine
• Many critical characteristics depend on problem
  size
   – Inherent application characteristics
       » concurrency and load balance (generally improve with
          problem size)
       » communication to computation ratio (generally improve)
       » working sets and spatial locality (generally worsen and
          improve, resp.)
   – Interactions with machine organizational parameters
   – Nature of the major bottleneck: comm., imbalance, local
     access...
• Insufficient to use a single problem size
• Need to choose problem sizes appropriately
   – Understanding of workloads will help

  2/12/99                      CS258 S99                           22
Our problem today
• Evaluate architectural alternatives
      – protocols, block size
• Fix machine size and characteristics
• Pick problems and problem sizes




2/12/99                         CS258 S99   23
Steps in Choosing Problem Sizes
1. Appeal to higher powers
      May know that users care only about a few problem sizes
      But not generally applicable
2. Determine range of useful sizes
      Below which bad perf. or unrealistic time distribution in phases
      Above which execution time or memory usage too large
3. Use understanding of inherent characteristics
      Communication-to-computation ratio, load balance...

      For grid solver, perhaps at least 32-by-32 points per processor
      40MB/s c-to-c ratio with 200MHz processor
      No need to go below 5MB/s (larger than 256-by-256 subgrid per
        processor) from this perspective, or 2K-by-2K grid overall


2/12/99                         CS258 S99                               24
  Steps in Choosing Problem Sizes
• Variation of characteristics with problem size usually
  smooth
   – So, for inherent comm. and load balance, pick some sizes along range



• Interactions of locality with architecture often have
  thresholds (knees)
   – Greatly affect characteristics like local traffic, artifactual comm.
   – May require problem sizes to be added
       » to ensure both sides of a knee are captured
   – But also help prune the design space




  2/12/99                         CS258 S99                                 25
  Choosing Problem Sizes (contd.)
4. Use temporal locality and working sets
    Fitting or not dramatically changes local traffic and artifactual comm.
    E.g. Raytrace working sets are nonlocal, Ocean are local
                                                         100
                                                  % of
            Miss                                  working
            ratio   WS                            set that
                      1                            ts in
                                                  cache of       WS3          WS2              WS1
                                     WS2          size C


                                           WS3
              (a)                                        (b)
                                 C                                      Problem1              Problem3
                    Cache size                                 Problem size
                                                                                   Problem2         Problem4
– Choose problem sizes on both sides of a knee if realistic
    » Critical to understand growth rate of working sets
– Also try to pick one very large size (exercises TLB misses etc.)
– Solver: first (2 subrows) usually fits, second (full partition) may or not
    » Doesn’t for largest (2K) so add 4K-b-4K grid
    » Add 16K as large size, so grid sizes now 256, 1K, 2K, 4K, 16K (in each dimension)

  2/12/99                                        CS258 S99                                                     26
Multiprocessor Simulation
• Simulation runs on a uniprocessor (can be parallelized too)
      – Simulated processes are interleaved on the processor
• Two parts to a simulator:
      – Reference generator: plays role of simulated processors
          » And schedules simulated processes based on simulated time
      – Simulator of extended memory hierarchy
          » Simulates operations (references, commands) issued by
            reference generator
• Coupling or information flow between the two parts varies
      – Trace-driven simulation: from generator to simulator
      – Execution-driven simulation: in both directions (more accurate)
• Simulator keeps track of simulated time and detailed
  statistics



2/12/99                           CS258 S99                               27
  Execution-driven Simulation
• Memory hierarchy simulator returns simulated time
  information to reference generator, which is used to
  schedule simulated processes
       P1                            $1         Mem 1


       P2                            $2         Mem 2               N
                                                                    e
                                                                    t
       P3                            $3        Mem 3
                                                                    w
                                                                    o
        ·                            ·                              r
        ·                            ·                              k
        ·                            ·

       Pp                            $p        Mem p



       ence generator
  Ref er                                                c
                                         Memory and inter onnect simulator



  2/12/99                CS258 S99                                       28
Difficulties in Simulation-based Evaluation
• Cost of simulation (in time and memory)
   – cannot simulate the problem/machine sizes we care about
   – have to use scaled down problem and machine sizes
       » how to scale down and stay representative?
• Huge design space
   – application parameters (as before)
   – machine parameters (depending on generality of evaluation context)
       » number of processors
       » cache/replication size
       » associativity
       » granularities of allocation, transfer, coherence
       » communication parameters (latency, bandwidth, occupancies)
   – cost of simulation makes it all the more critical to prune the space



 2/12/99                       CS258 S99                            29
  Choosing Parameters
• Problem size and number of processors
   – Use inherent characteristics considerations as discussed earlier
   – For example, low c-to-c ratio will not allow block transfer to help much
• Cache/Replication Size
   – Choose based on knowledge of working set curve
   – Choosing cache sizes for given problem and machine size analogous
     to choosing problem sizes for given cache and machine size,
     discussed
   – Whether or not working set fits affects block transfer benefits greatly
       » if local data, not fitting makes communication relatively less
         important
       » If nonlocal, can increase artifactual comm. So BT has more
         opportunity
   – Sharp knees in working set curve can help prune space
       » Knees can be determined by analysis or by very simple
         simulation
  2/12/99                       CS258 S99                              30
Our Cache Sizes (16x1MB, 16x64KB)
                    40                                                       60                                                               10
                                                                                  
                                                                                                                                                           L0 WS
                                                                                               L1 WS                                              
                                                                                                                                             8        
                    30                                                                                                                                         
                                                                                            
                                                                             40                 
    Miss rate (%)




                                                             Miss rate (%)




                                                                                                                              Miss rate (%)
                                                                                                                                               6
                    20                                                                                     
                                      L1 WS                                                                                                   4
                                                                             20
                    10                                                                                                                                                    L1 WS
                                                 L2 WS                                                                  L WS                  2
                                                                                                                        2                                                      L2 WS
                                     
                                           
                                                                                                                                                                        
                    0                                                         0                                                                0
                         1,024




                                                                                  1,024




                                                                                                                                                   1,024
                             1
                             2
                             4
                             8



                           128
                           256
                           512




                                                                                      1
                                                                                      2
                                                                                      4
                                                                                      8



                                                                                    128
                                                                                    256
                                                                                    512




                                                                                                                                                       1
                                                                                                                                                       2
                                                                                                                                                       4
                                                                                                                                                       8



                                                                                                                                                     128
                                                                                                                                                     256
                                                                                                                                                     512
                            16
                            32
                            64




                                                                                     16
                                                                                     32
                                                                                     64




                                                                                                                                                      16
                                                                                                                                                      32
                                                                                                                                                      64
                                  Cache size (K)                                              Cache size (K)                                               Cache size (K)

                                        (a) LU                                                 (b) Ocean                                                   (c) Barnes–Hut


                    20                                                       20                                                               50
                                                                                  

                         
                                                                                                                                             40                        L1 WS
                    15                                                       15
                                                                                                                                                       
    Miss rate (%)




                                                             Miss rate (%)




                                                                                                                            Miss rate (%)
                                                                                                                                            30           

                    10                                                       10                                                                                
                                                                                                                                                                   
                                                                                                                                              20                                L2 WS
                                       L1 WS
                                                                                                                 L1 WS                                                      
                    5                                                        5
                                                                                                                                              10                                  
                                                     L2 WS                                             
                                                                                                           
                                                                                                                 
                                                                                                             
                    0                                                        0                                                                 0
                         1,024




                                                                                  1,024




                                                                                                                                                   1,024
                             1
                             2
                             4
                             8




                                                                                      1
                                                                                      2
                                                                                      4
                                                                                      8




                                                                                                                                                       1
                                                                                                                                                       2
                                                                                                                                                       4
                                                                                                                                                       8
                           128
                           256
                           512




                                                                                    128
                                                                                    256
                                                                                    512




                                                                                                                                                     128
                                                                                                                                                     256
                                                                                                                                                     512
                            16
                            32
                            64




                                                                                     16
                                                                                     32
                                                                                     64




                                                                                                                                                      16
                                                                                                                                                      32
                                                                                                                                                      64
                                  Cache size (K)                                              Cache size (K)                                               Cache size (K)

                                     (d) Radiosity                                            (e) Ray trace                                                        (f ) Radix

2/12/99                                                                                       CS258 S99                                                                                   31
Focus on protocol tradeoffs
• Methodology:
      – Use Splash II and Multiprogram workload (ala Ch 4)
      – Choose $ parameters per earlier methodology
          » default 1MB, 4-way cache, 64-byte block, 16 processors;
             64K cache for some
      – Focus on frequencies, not end performance for now
          » transcends architectural details, but not what we’re really
             after
      – Use idealized memory performance model to avoid changes
        of reference interleaving across processors with machine
        parameters
          » Cheap simulation: no need to model contention
      – Run program on parallel machine simulator
          » collect trace of cache state transitions
          » analyze properties of the transitions


2/12/99                         CS258 S99                             32
Bandwidth per transition
                                                                                            PrRd
 Bus Transaction     Address / Cmd       Data                                               PrWr/—

 BusRd               6                   64                                            M

 BusRdX              6                   64                                                           BusRdX/Flush
                                                                                       BusRd/Flush

 BusWB               6                   64                                       r/—
                                                                                PrW
                                                                          r/BusRdX
                                                                        PrW
 BusUpgd             6                   --
                                                                                       E
                                                                                            BusRd/
                                                                                            Flush
                                                                                                      BusRdX/Flush
 Ocean Data Cache Frequency Matrix (per 1000)                                        PrRd/—
                                                             r/BusRdX
                                                           PrW

                                                                                       S
            NP           I      E             S        M                                             
                                                                                              BusRdX/Flush’
                                                                        PrRd/
  NP        0            0      1.25          0.96     0.001                   )
                                                                        BusRd (S
                                                                                      PrRd/—
                                                                                            
                                                                                     BusRd/Flush’
  I         0.64         0      0             1.87     0.001
                                                                                PrRd/
                                                                                BusRd(S)
  E         0.20         0      14.00         0.0      2.24
                                                                                        I
  S         0.42         2.50   0             134.72   2.24
  M         2.63         0.00   0             2.30     843.57



2/12/99                                CS258 S99                                                     33
                                                           Traffic (MB/s)




                                                                  100
                                                                          120
                                                                                          140
                                                                                                     160
                                                                                                                180
                                                                                                                           200




                                          20
                                               40
                                                    60
                                                           80




                                      0




2/12/99
                      bar nes/Ill
                    bar nes/3St
              bar nes/3St-RdEx


                             lu/Ill
                          lu/3St




                                                                              Bus
                   lu/3St- RdEx
                                                                                          Cmd

                       ocean/Il l
                     ocean/3St
              ocean/3St- RdEx


                    r adiosity/Ill
                  r adiosity/3St
            r adiosity/3St-RdEx


                        r adix/Ill
                      r adix/3St
                                                                                                                                 Bandwidth Trade-off




                r adix/3St- RdEx




CS258 S99
                    r aytr ace/Ill
                  r aytr ace/3St
            r aytr ace/3St- RdEx
                                                         Traffic (MB/s)
                                               10
                                                    15
                                                            20
                                                                   25
                                                                               30
                                                                                                35
                                                                                                           40
                                                                                                                      45




                                      0
                                          5




                   Appl -Code/Il l
                 Appl -Code/3St
        Appl -Code/3St-RdEx
                                                                                                                                     E -> M are infrequent




                   Appl -Data/Ill
                                                                        Bus
                                                                                    Cmd




                  Appl -Data/3St
            Appl -Data/3St- RdEx


                    OS- Code/Il l
                  OS- Code/3St
            OS- Code/3St- RdEx


                     OS- Data/Ill
34




                   OS- Data/3St
             OS- Data/3St-RdEx
                                                                                                                                     BusUpgrade is cheap
                                                                                                                                                             1 MB Cache, 200 MIPS / 200 MFLOPS Processor
2/12/99
                                                    Traffic (MB/s)




                                              100
                                                    150
                                                           200
                                                                     250
                                                                           300
                                                                                       350
                                                                                               400




                                         50


                                     0
                       ocean/Il l

                     ocean/3St

              ocean/3St-R dEx



                        r adix/Ill

                      r adix/3St




CS258 S99
               r adix/3St- RdEx
                                                                                                     Smaller (64KB) Caches




                    r aytr ace/Ill

                  r aytr ace/3St
                                                                                 Bus
                                                                                         Cmd




            r aytr ace/3St- RdEx


35
Cache Block Size
• Trade-offs in uniprocessors with increasing block size
      – reduced cold misses (due to spatial locality)
      – increased transfer time
      – increased conflict misses (fewer sets)
• Additional concerns in multiprocessors
      –   parallel programs have less spatial locality
      –   parallel programs have sharing
      –   false sharing
      –   bus contention
• Need to classify misses to understand impact
      – cold misses
      – capacity / conflict misses
      – true sharing misses
           » one proc writes words in a block, invalidating a block in another
             processor’s cache, which is later read by that process
      – false sharing misses

2/12/99                              CS258 S99                                   36
            Miss Classification
                                                                               c
                                                                  Miss Cla ssi¼ ation
                                                                                                                              modified word accessed during lifetime
                                                                                                                              means access to word(s) within a block
                                      ¼r st re f er e nc e to               r ea son                                          that have been modified since the last
                                      m e m or y block by P E              f or m iss
                                                                                                                              “essential” (4,6,8,10,12) miss to this
                                                                                 othe r
       y es         ¼r st a cc e ss                                                                                           block by this processor
                   syste m wide
                                                                           r ea son f or          r epla c e m e nt
  1. pure - cold           no                                           e lim ina tion of
                                                                            la st c opy
       no             wr itten
                      bef ore                                                   inva lida tion

     2. c old              y es
                                                                           old c opy
                                                         no           with sta te = inva lid     y es
                   m odi¼ d e                                             still the r e?
       no       word( s) a c ce sse d
                 dur ing life tim e

                                                                                                                                                               has bloc k
    3. c old-               y es                                                                                                                y es     bee n m odi¼ d sinc e
                                                                                                                                                                       e            no
f alse -shar ing                                 m odi¼ d e                                             m odi¼ de                                            r epla c e m e nt
                      4. c old-        no     word( s) a c ce sse d     y es                no       word( s) a c ce sse d y e s
                   tr ue- sha ring             dur ing life tim e                                     dur ing life tim e


                                 5. inva l- c ap-                 6. inva l- c ap-    7. pure -                            8. pure -    m odi¼ de                                           m odi¼ d e
                                 f alse -shar ing                 tr ue- sha ring f alse -shar ing                           no
                                                                                                                        tr ue- sha ring
                                                                                                                                     word( s) a c ce sse d   y es             no         word( s) a c ce sse d   y es
                                                                                                                                      dur ing life tim e                                  dur ing life tim e


                                                                                                                      9. c a p-inva l-                 10. c a p- inval-   11. pure -                      12. c a pa c ity-
                                                                                                                      f alse -shar ing                  tr ue- sha ring    c apa c ity                     tr ue- sha ring




            2/12/99                                                                                     CS258 S99                                                                                          37
                                                     Miss Rate




                             0
                                  0.001
                                            0.002
                                                               0.003
                                                                                         0.004
                                                                                                                                  0.005
                                                                                                                                                         0.006
                barnes/8




2/12/99
              barnes/16
              barnes/32
                                                                                                                                                                 Size
              barnes/64
             barnes/128
             barnes/256


                     lu/8
                    lu/16
                    lu/32
                    lu/64
                  lu/128
                  lu/256


              radiosi ty/8
             radiosi ty/16
             radiosi ty/32
                                                                                                               T SMR
                                                                                                                             F SMR




                                                                                                    CAPMR
                                                                                                                                          UPGMR




                                                                                    COLDMR




             radiosi ty/64
            radiosi ty/128
            radiosi ty/256
                                                    Miss Rate




CS258 S99
                             0
                                                                                                                    0.1




                                 0.02
                                          0.04
                                                        0.06
                                                                                0.08
                                                                                                                                                  0.12




                  ocean/8
                ocean/16
                ocean/32
                ocean/64
               ocean/128
               ocean/256


                  radix/8
                 radix/16
                 radix/32
                 radix/64
                radix/128
                radix/256


               raytrace/8
                                                                                                                                                                 Breakdown of Miss Rates with Block




              raytrace/16
              raytrace/32
                                                                                                 T SMR
                                                                                                            F SMR




                                                                                 CAPMR
                                                                                                                          UPGMR




              raytrace/64
38
                                                                       COLDMR




             raytrace/128
             raytrace/256
2/12/99
            1 MB Cache
                                                          Miss Rate




                              0
                                  0.001
                                          0.002
                                                  0.003
                                                              0.004
                                                                               0.005
                                                                                           0.006
                                                                                                          0.007
                                                                                                                          0.008




            Appl -Code/64

            Appl -Code/128

            Appl -Code/256

              Appl -Data/64
                                                                                                                                  Breakdown (cont)




            Appl -Data/128
                                                                                          TSMR
                                                                                                   FSMR




                                                                                  CAPMR
                                                                                                                  UPGMR




            Appl -Data/256
                                                                      COLDMR




               OS-Code/64




CS258 S99
            OS-Code/128

            OS-Code/256

                 OS-Data/64

              OS-Data/128

              OS-Data/256


39
2/12/99
                                            Miss Rate




                           0
                                      0.1
                                                                    0.2
                                                                                                    0.3




                               0.05
                                                0.15
                                                                                          0.25


                ocean/8

               ocean/16

               ocean/32

               ocean/64

              ocean/128

              ocean/256



                 radix/8

                radix/16

                radix/32

                radix/64




CS258 S99
              radix/128

              radix/256



              raytrace/8

             raytrace/16

             raytrace/32
                                                                          T SMR
                                                                                  F SMR




             raytrace/64
                                                                 CAPMR
                                                                                            UPGMR
                                                                                                          Breakdown with 64KB Caches



                                                        COLDMR




            raytrace/128

            raytrace/256
40
                                                 Traffic (bytes/instr)




                                  0.02
                                         0.04
                                                0.06
                                                        0.08
                                                                            0.12
                                                                                        0.14
                                                                                                     0.16
                                                                                                            0.18




                                                                0.1




                              0
                bar nes/8

               bar nes/32




2/12/99
             bar nes/128




                                                                      Bus
                                                                                  Cmd
                                                                                                                    Traffic

             r adiosity/16

             r adiosity/64

            r adiosity/256

               r aytr ace/8

             r aytr ace/32

            r aytr ace/128


                                                 Traffic (bytes/instr)




                                  0.5
                                                1.5
                                                                2.5
                                                                                         3.5
                                                                                                              4.5




                              0
                                         1
                                                        2
                                                                             3
                                                                                                       4




                   radix/8
                 radix/16
                 radix/32




CS258 S99
                 radix/64
                                                                                               Bus




               radix/128
                                                                                                      Cmd




               radix/256

                                                Traffic (bytes/FLOP)
                                  0.2
                                         0.4
                                                0.6
                                                        0.8
                                                                            1.2
                                                                                        1.4
                                                                                                     1.6
                                                                                                            1.8




                              0
                                                               1




                       lu/8
                     lu/16
                     lu/32
                     lu/64
                                                                      Bus
                                                                                  Cmd




                   lu/128
                   lu/256


                 ocean/8
               ocean/16
               ocean/32
               ocean/64
41




             ocean/128
             ocean/256
                                    Traffic (bytes/instr)




                                                                10
                                                                      12
                                                                           14




                           0
                               2
                                   4
                                          6
                                                   8




2/12/99
                 radix/8
                radix/16
                radix/32
                radix/64
              radix/128
              radix/256


              raytrace/8
             raytrace/16
             raytrace/32
                                                       Bus
                                                                Cmd




             raytrace/64
            raytrace/128
            raytrace/256




CS258 S99
                                   Traffic (bytes/FLOP)
                                                                                Traffic with 64 KB caches




                        0
                               1
                                    2
                                              3
                                                            4
                                                                      5
                                                                           6




              ocean/8

             ocean/16

             ocean/32
                                                                                    Bus




             ocean/64
                                                                                    Cmd




            ocean/128

            ocean/256
42
2/12/99
                                       Traffic (bytes/instr)




                                 0.1
                                       0.2
                                                  0.3
                                                                0.4
                                                                            0.5
                                                                                  0.6




                             0
             Appl -Code/64
            Appl -Code/128
            Appl -Code/256
                                                          Bus




             Appl -Data/64
                                                                      Cmd




            Appl -Data/128
                                                                                        Traffic SimOS 1 MB




            Appl -Data/256


              OS- Code/64




CS258 S99
             OS- Code/128

             OS- Code/256


              OS- Data/64

             OS- Data/128

             OS- Data/256
43
Making Large Blocks More Effective
• Software
    – Improve spatial locality by better data structuring (more later)
    – Compiler techniques
• Hardware
       – Retain granularity of transfer but reduce granularity of
         coherence
            » use subblocks: same tag but different state bits
            » one subblock may be valid but another invalid or dirty
       – Reduce both granularities, but prefetch more blocks on a
         miss
       – Proposals for adjustable cache size
       – More subtle: delay propagation of invalidations and perform
         all at once
            » But can change consistency model: discuss later in
              course
       – Use update instead of invalidate protocols to reduce false
2/12/99                           CS258 S99                            44
         sharing effect
Update versus Invalidate
• Much debate over the years: tradeoff depends on
  sharing patterns
• Intuition:
      – If those that used continue to use, and writes between use are
        few, update should do better
           » e.g. producer-consumer pattern
      – If those that use unlikely to use again, or many writes between
        reads, updates not good
           » “pack rat” phenomenon particularly bad under process
             migration
           » useless updates where only last one will be used
• Can construct scenarios where one or other is
  much better
• Can combine them in hybrid schemes (see text)
      – E.g. competitive: observe patterns at runtime and change
2/12/99                         CS258 S99                            45
        protocol
Update vs Invalidate: Miss Rates
                   0.60                                                                                                          2.50
                                                                                               False sharing
                                                                                               True sharing
                   0.50
                                                                                               Capacity                          2.00
                                                                                               Cold
                   0.40
   Miss rate (%)




                                                                                                                 Miss rate (%)
                                                                                                                                 1.50

                   0.30

                                                                                                                                 1.00
                   0.20


                                                                                                                                 0.50
                   0.10



                   0.00                                                                                                          0.00
                                                                                Raytrace/inv
                                            Ocean/inv




                                                                                                  Raytrace/upd




                                                                                                                                        Radix/inv
                                                        Ocean/mix

                                                                    Ocean/upd
                          LU/inv

                                   LU/upd




                                                                                                                                                                Radix/upd
                                                                                                                                                    Radix/mix
  – Lots of coherence misses: updates help
  – Lots of capacity misses: updates hurt (keep data in cache uselessly)
  – Updates seem to help, but this ignores upgrade and update traffic
2/12/99                        CS258 S99                            46
Upgrade and Update Rates (Traffic)
                                                                                Upgrade/update rate (%)
  – Update traffic is




                                                          0.00



                                                                        0.50



                                                                                        1.00



                                                                                                      1.50



                                                                                                                     2.00



                                                                                                                                    2.50
     substantial
                                                 LU/inv
  – Main cause is multiple
                                                LU/upd
     writes by a processor
     before a read by other
        » many bus                            Ocean/inv

          transactions versus                Ocean/mix
          one in invalidation
          case                               Ocean/upd


        » could delay updates
          or use merging                   Ray trace/inv

  – Overall, trend is away from            Ray trace/upd
     update based protocols as
     default
                                                                                Upgrade/update rate (%)
        » bandwidth,


                                                         0.00

                                                                 1.00

                                                                               2.00

                                                                                      3.00

                                                                                               4.00

                                                                                                       5.00

                                                                                                              6.00

                                                                                                                            7.00

                                                                                                                                   8.00
          complexity, large
          blocks trend, pack rat              Radix/inv


          for process migration               Radix/mix


  – Will see later that updates              Radix/upd

     have greater problems for CS258 S99
2/12/99                                                                                                                            47
     scalable systems
Summary
• FSM describes Cache Coherence Algorithm
      – many underlying design choices
      – prove coherence, consistency
• Evaluation must be based on sound
  understandng of workloads
      – drive the factors you want to study
      – representative
      – scaling factors
• Use of workload driven evaluation to resolve
  architectural questions




2/12/99                        CS258 S99         48

								
To top