Docstoc

EECS 252 Graduate Computer Architecture Lec 01 - Introduction

Document Sample
EECS 252 Graduate Computer Architecture Lec 01 - Introduction Powered By Docstoc
					CSCE 930 Advanced Computer
       Architecture

            Introductions

              Adopted from
        Professor David Patterson
                   &
              David Culler
 Electrical Engineering and Computer Sciences
        University of California, Berkeley
Outline
• Computer Science at a Crossroads: Parallelism
      – Architecture: multi-core and many-cores
      – Program: multi-threading
• Parallel Architecture
      –    What is Parallel Architecture?
      –    Why Parallel Architecture?
      –    Evolution and Convergence of Parallel Architectures
      –    Fundamental Design Issues
• Parallel Programs
      – Why bother with programs?
      – Important for whom?
• Memory & Storage Subsystem Architectures


7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   2
 Crossroads: Conventional Wisdom in Comp. Arch
• Old Conventional Wisdom: Power is free, Transistors expensive
• New Conventional Wisdom: “Power wall” Power expensive, Xtors free
  (Can put more on chip than can afford to turn on)
• Old CW: Sufficiently increasing Instruction Level Parallelism via
  compilers, innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” law of diminishing returns on more HW for ILP
• Old CW: Multiplies are slow, Memory access is fast
• New CW: “Memory wall” Memory slow, multiplies fast
  (200 clock cycles to DRAM memory, 4 clocks for multiply)
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
   – Uniprocessor performance now 2X / 5(?) yrs
   Sea change in chip design: multiple “cores”
      (2X processors per chip / ~ 2 years)
       » More simpler processors are more power efficient
     7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   3
                                Crossroads: Uniprocessor Performance
                               10000
                                             From Hennessy and Patterson, Computer
                                             Architecture: A Quantitative Approach, 4th
                                             edition, October, 2006                                           ??%/year
                                1000
Performance (vs. VAX-11/780)




                                                                                     52%/year

                                 100




                                  10
                                                    25%/year



                                   1
                                   1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
                                                      • VAX       : 25%/year 1978 to 1986
                                                      • RISC + x86: 52%/year 1986 to 2002
                                                      • RISC + x86: ??%/year 2002 to present
                                  7/5/2011             CSCE930-Advanced Computer Architecture, Introduction        4
       Sea Change in Chip Design
• Intel 4004 (1971): 4-bit processor,
  2312 transistors, 0.4 MHz,
  10 micron PMOS, 11 mm2 chip

• RISC II (1983): 32-bit, 5 stage
  pipeline, 40,760 transistors, 3 MHz,
  3 micron NMOS, 60 mm2 chip

• 125 mm2 chip, 0.065 micron CMOS
  = 2312 RISC II+FPU+Icache+Dcache
    – RISC II shrinks to ~ 0.02 mm2 at 65 nm
    – Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?
    – Proximity Communication via capacitive coupling at > 1 TB/s ?
      (Ivan Sutherland @ Sun / Berkeley)



   • Processor is the new transistor?
      7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   5
Déjà vu all over again?

• Multiprocessors imminent in 1970s, „80s, „90s, …
• “… today‟s processors … are nearing an impasse as
  technologies approach the speed of light..”
          David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature
   Custom multiprocessors strove to lead uniprocessors
   Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to
  multicore designs. … This is a sea change in computing”
                              Paul Otellini, President, Intel (2004)
• Difference is all microprocessor companies switch to
  multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)
   Procrastination penalized: 2X sequential perf. / 5 yrs
   Biggest programming challenge: 1 to 2 CPUs

7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   6
    Problems with Sea Change

•        Algorithms, Programming Languages, Compilers,
         Operating Systems, Architectures, Libraries, … not
         ready to supply Thread Level Parallelism or Data
         Level Parallelism for 1000 CPUs / chip,
•        Architectures not ready for 1000 CPUs / chip
     •         Unlike Instruction Level Parallelism, cannot be solved by just by
               computer architects and compiler writers alone, but also cannot
               be solved without participation of computer architects
•        This course explores ISL (Instruction Level
         Parallelism) and its shift to Thread Level Parallelism
         / Data Level Parallelism



    7/5/2011         CSCE930-Advanced Computer Architecture, Introduction     7
Outline
• Computer Science at a Crossroads: Parallelism
      – Architecture: multi-core and many-cores
      – Program: multi-threading
• Parallel Architecture
      –    What is Parallel Architecture?
      –    Why Parallel Architecture?
      –    Evolution and Convergence of Parallel Architectures
      –    Fundamental Design Issues
• Parallel Programs
      – Why bother with programs?
      – Important for whom?
• Memory & Storage Subsystem Architectures


7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   8
           What is Parallel Architecture?

• A parallel computer is a collection of processing
  elements that cooperate to solve large problems
  fast
• Some broad issues:
     – Resource Allocation:
         » how large a collection?
         » how powerful are the elements?
         » how much memory?
     – Data access, Communication and Synchronization
         » how do the elements cooperate and communicate?
         » how are data transmitted between processors?
         » what are the abstractions and primitives for cooperation?
     – Performance and Scalability
         » how does it all translate into performance?
         » how does it scale?
7/5/2011      CSCE930-Advanced Computer Architecture, Introduction     9
     Why Study Parallel Architecture?

Role of a computer architect:
           To design and engineer the various levels of a computer
           system to maximize performance and programmability
           within limits of technology and cost.



Parallelism:
    • Provides alternative to faster clock for performance
    • Applies at all levels of system design
    • Is a fascinating perspective from which to view
      architecture
    • Is increasingly central in information processing

7/5/2011        CSCE930-Advanced Computer Architecture, Introduction   10
                 Why Study it Today?

• History: diverse and innovative organizational
  structures, often tied to novel programming models

• Rapidly maturing under strong technological
  constraints
  – The “killer micro” is ubiquitous
  – Laptops and supercomputers are fundamentally similar!
  – Technological trends cause diverse approaches to converge
• Technological trends make parallel computing
  inevitable
  – In the mainstream with the reality of multi-cores and many-cores
• Need to understand fundamental principles and
  design tradeoffs, not just taxonomies
  – Naming, Ordering, Replication, Communication performance
 7/5/2011     CSCE930-Advanced Computer Architecture, Introduction     11
     Inevitability of Parallel Computing
• Application demands: Our insatiable need for computing
  cycles
  – Scientific computing: VR simulations in Biology, Chemistry, Physics, ...
  – General-purpose computing: Video, Graphics, CAD, Databases, AR, VI,
    TP...
• Technology Trends
  – Number of cores on chip growing rapidly (New Moors Law)
  – Clock rates expected to go up only slowly (tech. wall)
• Architecture Trends
  – Instruction-level parallelism valuable but limited
  – Coarser-level parallelism, or thread-level parallelism, the most viable
    approach
• Economics
• Current trends:
  – Today‟s microprocessors are multiprocessors

 7/5/2011     CSCE930-Advanced Computer Architecture, Introduction      12
                    Application Trends

• Demand for cycles fuels advances in hardware, and vice-
  versa
    – Cycle drives exponential increase in microprocessor performance
    – Drives parallel architecture harder: most demanding applications
• Range of performance demands
    – Need range of system performance with progressively increasing cost
    – Platform pyramid
• Goal of applications in using parallel machines: Speedup
•   Speedup (p processors) = Performance (p processors)
                                            Performance (1 processor)

• For a fixed problem size (input data set), performance =
  1/time
                                                              Time (1 processor)
•       Speedup fixed problem (p processors) =
    7/5/2011                                                    Time
               CSCE930-Advanced Computer Architecture, Introduction (p processors)
                                                                             13
           Scientific Computing Demand




7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   14
         Engineering Computing Demand
• Large parallel machines a mainstay in many industries
   – Petroleum (reservoir analysis)
   – Automotive (crash simulation, drag analysis, combustion efficiency),
   – Aeronautics (airflow analysis, engine efficiency, structural mechanics,
     electromagnetism),
   – Computer-aided design
   – Pharmaceuticals (molecular modeling)
   – Visualization
         » In all of the above
         » Entertainment (3D films like Avatar & 3D games )
         » Architecture (walk-throughs and rendering)
       » Virtual Reality/Immersion (museums, teleporting, etc)
   – Financial modeling (yield and derivative analysis)
   – Etc.


  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   15
 Learning Curve for Parallel Applications




• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor
  Paragon, 891 on 128-processor Cray T3D
  7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   16
              Commercial Computing

• Also relies on parallelism for high end
   – Scale not so large, but use much more wide-spread
   – Computational power determines scale of business that can be handled
• Databases, online-transaction processing, decision
  support, data mining, data warehousing ...
• TPC benchmarks (TPC-C order entry, TPC-D decision
  support)
   – Explicit scaling criteria provided
   – Size of enterprise scales with size of system
   – Problem size no longer fixed as p increases, so throughput is used as a
     performance measure (transactions per minute or tpm)




  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   17
               Similar Story for Storage
• Divergence between memory capacity and speed more
  pronounced
    – Capacity increased by 1000x from 1980-95, speed only 2x
    – Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get
  faster
    – Need to transfer more data in parallel
    – Need deeper cache hierarchies
    – How to organize caches?
• Parallelism increases effective size of each level of
  hierarchy, without increasing access time
• Parallelism and locality within memory systems too
    – New designs fetch many bits within memory chip; follow with fast
      pipelined transfer across narrower interface
    – Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching
  7/5/2011 CSCE930-Advanced Computer Architecture, Introduction      18
Real-world applications demand high-
  performing and reliable storage




High performance Computing   Medicinal Image
   >100TB                         >100TB
                                                    Digital body
                                                     1TB/body
               NASA
                                                   H.B Glass
               >1PB                                 5GB/day
                                        7/5/2011
       GIS >1PB          Ocean resource data>1PB


                           Google, Yahoo, …
                                     >1PB/year


  Oil prospecting>1PB

1PB=1000TB=1015Bytes,
It is equal to the capacity of 10,000 100GB disks.
                                    7/5/2011
    Technology Trends: Moore‟s Law: 2X transistors /
    “year”  2X cores / “year”




•   “Cramming More Components onto Integrated Circuits”
      – Gordon Moore, Electronics, 1965
•   # on transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)
    7/5/2011         CSCE930-Advanced Computer Architecture, Introduction         21
    Tracking Technology Performance Trends
• Drill down into 4 technologies:
   –   Disks,
   –   Memory,
   –   Network,
   –   Processors
• Compare ~1980 Archaic (Nostalgic) vs.
  ~2000 Modern (Newfangled)
   – Performance Milestones in each technology
• Compare for Bandwidth vs. Latency improvements
  in performance over time
• Bandwidth: number of events per unit time
   – E.g., M bits / second over network, M bytes / second from disk
• Latency: elapsed time for a single event
   – E.g., one-way network delay in microseconds,
     average disk access time in milliseconds

 7/5/2011     CSCE930-Advanced Computer Architecture, Introduction    22
Disks: Archaic(Nostalgic) v. Modern(Newfangled)

  •    CDC Wren I, 1983                  • Seagate 373453, 2003
  •    3600 RPM                          • 15000 RPM                (4X)
  •    0.03 GBytes capacity              • 73.4 GBytes           (2500X)
  •    Tracks/Inch: 800                  • Tracks/Inch: 64000      (80X)
  •    Bits/Inch: 9550                   • Bits/Inch: 533,000      (60X)
  •    Three 5.25” platters              • Four 2.5” platters
                                           (in 3.5” form factor)
  • Bandwidth:                           • Bandwidth:
    0.6 MBytes/sec                         86 MBytes/sec          (140X)
  • Latency: 48.3 ms                     • Latency: 5.7 ms          (8X)
  • Cache: none                          • Cache: 8 MBytes


      7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   23
    Latency Lags Bandwidth (for last ~20 years)
     10000
                                                    • Performance Milestones

      1000


Relative
  BW                            Disk
        100
Improve
  ment


        10


                       (Latency improvement
                   = Bandwidth improvement)
                                                    • Disk: 3600, 5400, 7200, 10000,
         1                                            15000 RPM (8x, 143x)
              1               10              100     (latency = simple operation w/o contention
              Relative Latency Improvement            BW = best-case)

        7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                     24
 Memory: Archaic (Nostalgic) v. Modern (Newfangled)
• 1980 DRAM                       • 2000 Double Data Rate Synchr.
  (asynchronous)                    (clocked) DRAM
• 0.06 Mbits/chip                 • 256.00 Mbits/chip       (4000X)
• 64,000 xtors, 35 mm2            • 256,000,000 xtors, 204 mm2
• 16-bit data bus per             • 64-bit data bus per
  module, 16 pins/chip              DIMM, 66 pins/chip         (4X)
• 13 Mbytes/sec                   • 1600 Mbytes/sec          (120X)
• Latency: 225 ns                 • Latency: 52 ns             (4X)
• (no block transfer)             • Block transfers (page mode)




    7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   25
   Latency Lags Bandwidth (last ~20 years)
    10000
                                                   • Performance Milestones


     1000


Relative    Memory
  BW                          Disk
        100
Improve                                            • Memory Module: 16bit plain
  ment
                                                     DRAM, Page Mode DRAM, 32b,
                                                     64b, SDRAM,
       10                                            DDR SDRAM (4x,120x)
                                                   • Disk: 3600, 5400, 7200, 10000,
                      (Latency improvement           15000 RPM (8x, 143x)
                  = Bandwidth improvement)
         1                                          (latency = simple operation w/o contention
             1               10              100    BW = best-case)

             Relative Latency Improvement
       7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                    26
LANs: Archaic (Nostalgic)v. Modern (Newfangled)


    • Ethernet 802.3                       • Ethernet 802.3ae
    • Year of Standard: 1978               • Year of Standard: 2003
    • 10 Mbits/s                           • 10,000 Mbits/s     (1000X)
      link speed                             link speed
    • Latency: 3000 msec                   • Latency: 190 msec    (15X)
    • Shared media                         • Switched media
    • Coaxial cable                        • Category 5 copper wire
                                               "Cat 5" is 4 twisted pairs in bundle
Coaxial Cable:    Plastic Covering                Twisted Pair:
                     Braided outer conductor
                           Insulator
                             Copper core           Copper, 1mm thick,
                                                   twisted to avoid antenna effect


     7/5/2011    CSCE930-Advanced Computer Architecture, Introduction        27
   Latency Lags Bandwidth (last ~20 years)
    10000
                                                  • Performance Milestones

     1000
                                 Network
Relative    Memory                                • Ethernet: 10Mb, 100Mb,
                             Disk
  BW
        100
                                                    1000Mb, 10000 Mb/s (16x,1000x)
Improve
  ment                                            • Memory Module: 16bit plain
                                                    DRAM, Page Mode DRAM, 32b,
       10
                                                    64b, SDRAM,
                                                    DDR SDRAM (4x,120x)
                     (Latency improvement
                                                  • Disk: 3600, 5400, 7200, 10000,
        1
                 = Bandwidth improvement)           15000 RPM (8x, 143x)
            1               10              100
                                                    (latency = simple operation w/o contention
            Relative Latency Improvement            BW = best-case)

      7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                     28
    CPUs: Archaic (Nostalgic) v. Modern (Newfangled)

• 1982 Intel 80286                       • 2001 Intel Pentium 4
• 12.5 MHz                               • 1500 MHz                (120X)
• 2 MIPS (peak)                          • 4500 MIPS (peak)       (2250X)
• Latency 320 ns                         • Latency 15 ns            (20X)
• 134,000 xtors, 47 mm2                  • 42,000,000 xtors, 217 mm2
• 16-bit data bus, 68 pins               • 64-bit data bus, 423 pins
• Microcode interpreter,                 • 3-way superscalar,
  separate FPU chip                        Dynamic translate to RISC,
• (no caches)                              Superpipelined (22 stage),
                                           Out-of-Order execution
                                         • On-chip 8KB Data caches,
                                           96KB Instr. Trace cache,
                                           256KB L2 cache
      7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   29
Latency Lags Bandwidth (last ~20 years)
     10000                                      • Performance Milestones
CPU high,
                            Processor           • Processor: „286, „386, „486,
Memory low
(“Memory                                          Pentium, Pentium Pro,
Wall”) 1000                                       Pentium 4 (21x,2250x)
                                   Network      • Ethernet: 10Mb, 100Mb,
 Relative    Memory
                               Disk
                                                  1000Mb, 10000 Mb/s (16x,1000x)
   BW
 Improve
         100
                                                • Memory Module: 16bit plain
   ment                                           DRAM, Page Mode DRAM, 32b,
                                                  64b, SDRAM,
         10                                       DDR SDRAM (4x,120x)
                                                • Disk : 3600, 5400, 7200, 10000,
                       (Latency improvement       15000 RPM (8x, 143x)
                   = Bandwidth improvement)
          1
              1               10              100
                 Relative Latency Improvement
      7/5/2011          CSCE930-Advanced Computer Architecture, Introduction   30
Rule of Thumb for Latency Lagging BW

  • In the time that bandwidth doubles, latency
    improves by no more than a factor of 1.2 to 1.4
     (and capacity improves faster than bandwidth)
  • Stated alternatively:
    Bandwidth improves by more than the square
    of the improvement in Latency




   7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   31
        6 Reasons Latency Lags Bandwidth

1. Moore’s Law helps BW more than latency
  •      Faster transistors, more transistors,
         more pins help Bandwidth
        » MPU Transistors:           0.130 vs. 42 M xtors             (300X)
        » DRAM Transistors:          0.064 vs. 256 M xtors           (4000X)
        » MPU Pins:                  68 vs. 423 pins                    (6X)
        » DRAM Pins:                 16 vs. 66 pins                     (4X)
  •      Smaller, faster transistors but communicate
         over (relatively) longer lines: limits latency
        » Feature size:              1.5 to 3 vs. 0.18 micron       (8X,17X)
        » MPU Die Size:              35 vs. 204 mm2        (ratio sqrt  2X)
        » DRAM Die Size:             47 vs. 217 mm2        (ratio sqrt  2X)



  7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   32
 6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency
   •    Size of DRAM block  long bit and word lines
         most of DRAM access time
   •    Speed of light and computers on network
   •    1. & 2. explains linear latency vs. square BW?
3. Bandwidth easier to sell (“bigger=better”)
   •    E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.
                10 msec latency Ethernet
   •    4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
   •    Even if just marketing, customers now trained
   •    Since bandwidth sells, more resources thrown at bandwidth,
        which further tips the balance




   7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   33
6 Reasons Latency Lags Bandwidth (cont’d)

   4. Latency helps BW, but not vice versa
        •      Spinning disk faster improves both bandwidth and
               rotational latency
              » 3600 RPM  15000 RPM = 4.2X
              » Average rotational latency: 8.3 ms  2.0 ms
              » Things being equal, also helps BW by 4.2X
        •      Lower DRAM latency 
               More access/second (higher bandwidth)
        •      Higher linear density helps disk BW
                (and capacity), but not disk Latency
              » 9,550 BPI  533,000 BPI  60X in BW




   7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   34
6 Reasons Latency Lags Bandwidth (cont’d)

 5. Bandwidth hurts latency
       •     Queues help Bandwidth, hurt Latency (Queuing Theory)
       •     Adding chips to widen a memory module increases
             Bandwidth but higher fan-out on address lines may
             increase Latency
 6. Operating System overhead hurts
     Latency more than Bandwidth
       •     Long messages amortize overhead;
             overhead bigger part of short messages




  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   35
Summary of Technology Trends

 • For disk, LAN, memory, and microprocessor,
   bandwidth improves by square of latency
   improvement
      – In the time that bandwidth doubles, latency improves by no more
        than 1.2X to 1.4X
 • Lag probably even larger in real systems, as
   bandwidth gains multiplied by replicated components
      –     Multiple processors in a cluster or even in a chip
      –     Multiple disks in a disk array
      –     Multiple memory modules in a large memory
      –     Simultaneous communication in switched LAN
 • HW and SW developers should innovate assuming
   Latency Lags Bandwidth
      – If everything improves at the same rate, then nothing really changes
      – When rates vary, require real innovation
 7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   36
                Architectural Trends

• Architecture translates technology‟s gifts to
  performance and capability
• Resolves the tradeoff between parallelism and
  locality
     – Current microprocessor: 1/4 compute, 1/2 cache, 1/4 off-chip
       connect
     – Tradeoffs may change with scale and technology advances


• Four generations of architectural history: tube,
  transistor, IC, VLSI
     – Here focus only on VLSI generation
• Greatest delineation in VLSI has been in type of
  parallelism exploited

7/5/2011     CSCE930-Advanced Computer Architecture, Introduction     37
                   Architectural Trends

• Greatest trend in VLSI generation is increase in
  parallelism
   – Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
         » slows after 32 bit
         » adoption of 64-bit now under way, 128-bit far (not performance
           issue)
         » great inflection point when 32-bit micro and cache fit on a chip
   – Mid 80s to mid 90s: instruction level parallelism
         » pipelining and simple instruction sets, + compiler advances (RISC)
         » on-chip caches and functional units => superscalar execution
         » greater sophistication: out of order execution, speculation,
           prediction
             • to deal with control transfer and latency problems

   – Next step: thread level parallelism
  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction      38
             Architectural Trends: ILP
• Reported speedups for superscalar processors
           • Horst, Harris, and Jardine [1990] ......................                 1.37
           • Wang and Wu [1988] ..........................................            1.70
           • Smith, Johnson, and Horowitz [1989] ..............                       2.30
           • Murakami et al. [1989] ........................................          2.55
           • Chang et al. [1991] .............................................        2.90
           • Jouppi and Wall [1989] ......................................            3.20
           • Lee, Kwok, and Briggs [1991] ...........................                 3.50
           • Wall [1991] ..........................................................   5
           • Melvin and Patt [1991] .......................................           8
           • Butler et al. [1991] .............................................       17+

• Large variance due to difference in
    – application domain investigated (numerical versus non-
      numerical)
    – capabilities of processor modeled
7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                       39
                                                                   ILP Ideal Potential

                                      30                                                            3
                                                                                                                                                                 
                                                                                                                       
   Frac tion of total c y c les (%)




                                      25                                                           2.5


                                      20                                                            2
                                                                                                                 




                                                                                         Speedup
                                      15                                                           1.5

                                      10                                                            1        


                                       5                                                           0.5

                                       0                                                            0
                                           0       1     2     3     4     5        6+                   0                 5              10                 15
                                                 Number of instruc tions is s ued                                    Ins truc tions is s ued per c y c le




• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
                – real caches and non-zero miss latencies
                                      7/5/2011            CSCE930-Advanced Computer Architecture, Introduction                                          40
                 Results of ILP Studies
• Concentrate on parallelism for 4-issue machines




 • Realistic studies show only 2-fold speedup
     • Recent studies show that more ILP needs to look
        across threads
   7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   41
  Architectural Trends: Bus-based MPs
•Micro on a chip makes it natural to connect many to shared
memory
   – dominates server and enterprise market, moving down to desktop
•Faster processors began to saturate bus, then bus technology
advanced                               70


   – today, range of sizes for bus-based systems, desktop to large servers
                                                            
                                                  CRAY CS6400 
                                                                                                                        Sun
                                       60                                                                              E10000



                                       50
                Number of processors




                                       40
                                                                                      SGI Challenge
                                                                                            

                                              Sequent B2100          Symmetry81                 SE60                 Sun E6000
                                       30                                                                    
                                                                                                         SE70



                                       20                                                  
                                                                                  Sun SC2000              SC2000E
                                                                                                    SGI PowerChallenge/XL

                                                                                                       AS8400
                                               Sequent B8000        Symmetry21                 SE10      
                                       10                                                                SE30
                                                                               Power      SS1000         SS1000E

                                                                        SS690MP 140           AS2100 HP K400  P-Pro
                                                SGI PowerSeries        SS690MP 120        SS10     SS20
                                       0
                                       1984       1986        1988      1990        1992          1994          1996        1998


 7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                                                            42
         No. of processors in fully configured commercial shared-memory systems
                              Economics

• Commodity microprocessors not only fast but CHEAP
   • Development cost is tens of millions of dollars (5-100 typical)
   • BUT, many more are sold compared to supercomputers
   – Crucial to take advantage of the investment, and use the commodity
      building block
   – Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors
  (e.g. database) as well as hardware vendors
• Standardization by Intel makes small, bus-based
  SMPs commodity
• Desktop: few smaller processors versus one larger
  one?
   – Multiprocessor on a chip

  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   43
    Consider Scientific Supercomputing

• Proving ground and driver for innovative architecture
  and techniques
   – Market smaller relative to commercial as MPs become mainstream
   – Dominated by vector machines starting in 70s
   – Microprocessors have made huge gains in floating-point performance
       » high clock rates
       » pipelined floating point units (e.g., multiply-add every cycle)
       » instruction-level parallelism
       » effective use of caches (e.g., automatic blocking)
   – Plus economics


• Large-scale multiprocessors replace vector
  supercomputers
   – Most top-performing machines on Top-500 list are multiprocessors
  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   44
 Summary: Why Parallel Architecture?

• Increasingly attractive
   – Economics, technology, architecture, application demand

• Increasingly central and mainstream
• Parallelism exploited at many levels
   – Instruction-level parallelism
   – Thread-level of parallelism (multi-cores and many-cores)
   – Application/Node-level of parallelism (“MPPs”, cluster, grids, clouds)

• Focus of this class: thread-level of parallelism
• Same story from memory/storage system perspective but
  with a “twist” (data-intensive computing, etc)
• Wide range of parallel architectures make sense
   – Different cost, performance and scalability

  7/5/2011     CSCE930-Advanced Computer Architecture, Introduction    45
 Convergence of Parallel Architectures




7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   46
                                    History

Historically, parallel architectures tied to programming models
  • Divergent architectures, with no predictable pattern of growth.




                         Application Software

    Systolic               System Software
                                                              SIMD
    Arrays                     Architecture
                                                               Message Passing
              Dataflow
                                              Shared Memory


   • Uncertainty of direction paralyzed parallel software development!
   7/5/2011       CSCE930-Advanced Computer Architecture, Introduction    47
                                 Today

• Extension of “computer architecture” to support
  communication and cooperation
   – OLD: Instruction Set Architecture
   – NEW: Communication Architecture

• Defines
   – Critical abstractions, boundaries, and primitives (interfaces)
   – Organizational structures that implement interfaces (hw or sw)


• Compilers, libraries and OS are important bridges
  today



  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction    48
Modern Layered Framework


      CAD          Database    Scientific modeling       Parallel applications

      o
Multipr gramming   Shared     Message     Data           Programming models
                      ess
                   addr       passing    parallel

                     Compilation
                      or library                       Communication abstraction
                                                       User/system boundary
                                Operating systems support

                                                               e/softwar boundary
                                                         Hardwar       e
                             dwar
              Communication har e
            Physical communication medium




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction               49
Programming Model
• What programmer uses in coding applications
• Specifies communication and synchronization
• Examples:
      – Multiprogramming: no communication or synch. at program
        level
      – Shared address space: like bulletin board
      – Message passing: like letters or phone calls, explicit point to
        point
      – Data parallel: more regimented, global actions on data
          » Implemented with shared address space or message
            passing




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction        50
Communication Abstraction
• User level communication primitives provided
      – Realizes the programming model
      – Mapping exists between language primitives of programming
        model and these primitives
• Supported directly by hw, or via OS, or via user
  sw
• Lot of debate about what to support in sw and
  gap between layers
• Today:
      – Hw/sw interface tends to be flat, i.e. complexity roughly
        uniform
      – Compilers and software play important roles as bridges today
      – Technology trends exert strong influence
• Result is convergence in organizational structure
      – Relatively simple, general purpose communication primitives
7/5/2011      CSCE930-Advanced Computer Architecture, Introduction    51
            Communication Architecture
            • = User/System Interface + Implementation
• User/System Interface:
  – Comm. primitives exposed to user-level by hw and system-level sw

• Implementation:
  – Organizational structures that implement the primitives: hw or OS
  – How optimized are they? How integrated into processing node?
  – Structure of network

• Goals:
  – Performance
  – Broad applicability
  – Programmability
  – Scalability
  – Low Cost


 7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   52
        Evolution of Architectural Models

• Historically machines tailored to programming models
     – Prog. model, comm. abstraction, and machine organization lumped
       together as the “architecture”
• Evolution helps understand convergence
     – Identify core concepts

•   Shared Address Space
•   Message Passing
•   Data Parallel
•   Others:
     – Dataflow
     – Systolic Arrays

• Examine programming model, motivation, intended
  applications, and contributions to convergence
   7/5/2011 CSCE930-Advanced Computer Architecture, Introduction 53
    Shared Address Space Architectures
• Any processor can directly reference any memory location
  – Communication occurs implicitly as result of loads and stores
• Convenient:
  – Location transparency
  – Similar programming model to time-sharing on uniprocessors
     » Except processes run on different processors
     » Good throughput on multiprogrammed workloads
• Naturally provided on wide range of platforms
  – History dates at least to precursors of mainframes in early 60s
  – Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model
  – Ambiguous: memory may be physically distributed among processors




  7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   54
  Shared Address Space Model
• Process: virtual address space plus one or more threads of
  control
• Portions of address spaces of processes are shared
                                                                  Machine physical address space
                              Virtual address spaces for a
                              collection of processes communicating
                                                                      Pn pr i vat e
                              via shared addresses
                              Load
                                       Pn

                                                                      Common physical
                              P2
                         P1                                           addresses
                   P0


               St or e
                                                                      P pr i vat e
                                                                       2
                              Shared portion
                              of address space

                                                                      P1 pr i vat e
                              Private portion
                              of address space
                                                                      P0 pr i vat e



•Writes   to shared address visible to other threads (in other processes too)
•Natural extension of uniprocessors model: conventional memory
operations for comm.; special atomic operations for synchronization
•OS uses shared memory to coordinateArchitecture, Introduction
  7/5/2011       CSCE930-Advanced Computer processes                      55
 Communication Hardware
  • Also natural extension of uniprocessor
  • Already have processor, one or more memory
    modules and I/O controllers connected by hardware
    interconnect of some sort
                                                                               I/O
                                                                               dev ic es


             Mem      Mem        Mem      Mem         I/O c trl    I/O c trl



                         Interc onnec t     Interc onnec t



                        Proc es s or                Proc es s or



Memory capacity increased by adding modules, I/O by controllers
   •Add     processors for processing!
   •For     higher-throughput multiprogramming, or parallel programs
 7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                    56
                                History
• “Mainframe” approach
   – Motivated by multiprogramming
   – Extends crossbar used for mem bw and I/O                 P
   – Originally processor cost limited to small
                                                              P
       » later, cost of crossbar
                                                      I/O     C
   – Bandwidth scales with p
                                                      I/O
   – High incremental cost; use multistage                    C

     instead                                                          M       M       M       M



• “Minicomputer” approach
   – Almost all microprocessor systems have bus
   – Motivated by multiprogramming, TP
   – Used heavily for parallel computing       I/O          I/O

   – Called symmetric multiprocessor (SMP)      C           C     M       M

   – Latency larger than for uniprocessor
   – Bus is bandwidth bottleneck
       » caching is key: coherence problem                                        $       $

   – Low incremental cost                                                         P       P


  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction                                57
      Example: Intel Pentium Pro Quad
                                                  CPU
                                                                            P-Pro              P-Pro      P-Pro
                                            Interrupt   256-KB             module             module     module
                                            c ontr
                                                 oller    L2 $
                                               Bus interfac e


                                                                                      ess
                                                     P-Pro bus (64-bit data, 36-bit addr , 66 MHz)



                                                               PCI               PCI                   Memory
                                                              bridge            bridge                 c ontroller




                                                                 PCI bus




                                                                                    PCI bus
                                                      PCI
                                                      I/O                                                MIU
                                                     c ards
                                                                                                  1-, 2-, or 4-way
                                                                                                    interleaved
                                                                                                       DRAM




                                           – All coherence and
                                             multiprocessing glue in
                                             processor module
                                           – Highly integrated, targeted at
                                             high volume
                                           – Low latency and bandwidth


7/5/2011   CSCE930-Advanced Computer Architecture, Introduction                                                      58
           Example: SUN Enterprise

                                                                                                             CPU/mem
                                                        P                P                                   cards
                                                        $                $

                                                       $2                $2    Mem ctrl

                                                       Bus i nterface/switch




                                                                             ess,
                                               Gigaplane bus (256 data, 41 addr 83 MHz)


                                                                                                              I/O cards
                                                                      Bus i nterface




                                                                                            2 FiberChannel
                                                       100bT , SCSI

                                                                       SBUS



                                                                                     SBUS
                                                                              SBUS
– 16 cards of either type: processors + memory, or I/O
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
7/5/2011    CSCE930-Advanced Computer Architecture, Introduction                                                      59
                                   Scaling Up
                M      M           M
                           



                    Network                     Network




                       $    $        M $    M $         M $
                $

                P      P           P        P        P              P

              “Dance hall”                 Distributed memory

– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than crossbar
    » latencies to memory uniform, but uniformly large
 – Distributed memory or non-uniform memory access (NUMA)
       » Construct shared address space out of simple message
         transactions across a general-purpose network (e.g. read-
         request, read-response)
 – Caching shared (particularly nonlocal) data? Introduction
7/5/2011       CSCE930-Advanced Computer Architecture,                  60
                Example: Cray T3E

                                              External I/O


                                                                   P         Mem
                                                                   $

                                                                    Mem
                                                                     ctrl
                                                                   and NI



                                                              XY   Switch



                                                                       Z




   – Scale up to 1024 processors, 480MB/s links
   – Memory controller generates comm. request for nonlocal references
   – No hardware mechanism for coherence (SGI Origin etc. provide this)

7/5/2011    CSCE930-Advanced Computer Architecture, Introduction            61
             Message Passing Architectures

• Complete computer as building block, including I/O
   – Communication via explicit I/O operations

• Programming model: directly access only private
  address space (local memory), comm. via explicit
  messages (send/receive)

• High-level block diagram similar to distributed-
  memory SAS
   – But comm. integrated at IO level, needn‟t be into memory system
   – Like networks of workstations (clusters), but tighter integration
   – Easier to build than scalable SAS

• Programming model more removed from basic
  hardware operations
   – Library or OS intervention
  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   62
                Message-Passing Abstraction

                                                        Match          Y
                                                                Receive , P, t

                                                                                                    essY
                                                                                                 Addr
                                          SendX, Q, t


                  ess
               Addr X
                          Local pr cess
                                 o                                                      o
                                                                                 Local pr cess
                                                                                    ess space
                                                                                 addr
                             ess space
                          addr



                           ProcessP                                               ProcessQ

– Send specifies buffer to be transmitted and receiving process
– Recv specifies sending process and application storage to receive into
– Memory to memory copy, but need to name processes
– Optional tag on send and matching rule on receive
– User process names local data and entities in process/tag space too
– In simplest form, the send/recv match achieves pairwise synch event
    » Other variants too
– Many overheads: copying, buffer management, protection
    7/5/2011            CSCE930-Advanced Computer Architecture, Introduction                               63
 Evolution of Message-Passing Machines
                                                                        101              100




                                                             001              000




                                                                        111              110




• Early machines: FIFO on each link                          011              010


   – Hw close to prog. Model; synchronous ops
   – Replaced by DMA, enabling non-blocking ops
       » Buffered by system at destination until recv

• Diminishing role of topology
   –   Store&forward routing: topology important
   –   Introduction of pipelined routing made it less so
   –   Cost is in node-network interface
   –   Simplifies programming
  7/5/2011       CSCE930-Advanced Computer Architecture, Introduction               64
Example: IBM SP-2

                                                Power 2
                                                 CPU                IBM SP-2 node


                                                     L2 $



                                                      Memory bus


           General inter connection
           network formed fr om                Memory                  4-way
                                                                    interleaved
           8-port switches                     contr oller
                                                                       DRAM

                                               MicroChannel bus
                                                                          NIC

                                               I/O                    DMA




                                                                                  DRAM
                                                             i860       NI




 – Made out of essentially complete RS6000 workstations
 – Network interface integrated in I/O bus (bw limited by I/O bus)
7/5/2011     CSCE930-Advanced Computer Architecture, Introduction                        65
Example Intel Paragon
                                                               i860         i860      Intel
                                                                                      Paragon
                                                               L1 $         L1 $      node




                                                               Memory bus (64-bit, 50 MHz)



                                                                Mem                    DMA
                                                                 c trl
                                                                             Driver
                                                                                         NI
                                                                4-way
           Sandia’ s Intel Paragon XP/S-based Supercomputer
                                                              interleaved
                                                                DRAM


                                                                                   8 bits ,
                                                                                   175 MHz,
                         2D grid network                                               e
                                                                                   bidir c tional
                                o
                         with pr c es s ing node
                         attac hed to every s witc h




7/5/2011           CSCE930-Advanced Computer Architecture, Introduction                             66
   Toward Architectural Convergence
• Evolution and role of software have blurred boundary
      – Send/recv supported on SAS machines via buffers
      – Can construct global address space on MP using hashing
      – Page-based (or finer-grained) shared virtual memory
• Hardware organization converging too
      – Tighter NI integration even for MP (low-latency, high-
        bandwidth)
      – At lower level, even hardware SAS passes hardware
        messages
• Even clusters of workstations/SMPs are parallel systems
      – Emergence of fast system area networks (SAN)
• Programming models distinct, but organizations
  converging
      – Nodes connected by general network and communication
        assists
      – Implementations also converging, at least in high-end
        machines
7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   67
                   Data Parallel Systems
• Programming model
   – Operations performed in parallel on each element of data structure
   – Logically single thread of control, performs sequential or parallel steps
   – Conceptually, a processor associated with each data element

• Architectural model
   – Array of many simple, cheap processors with little memory each
       » Processors don‟t sequence through instructions
   – Attached to a control processor that issues instructions
   – Specialized and general communication, cheap global synchronization

Original motivations                                                              Control
                                                                                 processor



   •Matches    simple differential equation solvers                     PE      PE              PE


   •Centralizehigh cost of instruction                                  PE      PE              PE


   fetch/sequencing                                                                    

                                                                        PE      PE
                                                                                                PE



  7/5/2011       CSCE930-Advanced Computer Architecture, Introduction                           68
            Application of Data Parallelism
   – Each PE contains an employee record with his/her salary
   If salary > 100K then
      salary = salary *1.05
   else
      salary = salary *1.10
   – Logically, the whole operation is a single step
   – Some processors enabled for arithmetic operation, others disabled
• Other examples:
   – Finite differences, linear algebra, ...
   – Document searching, graphics, image processing, ...
• Some early machines:
   – Thinking Machines CM-1, CM-2 (and CM-5)
   – Maspar MP-1 and MP-2,
• Today‟s on-chip examples:
   – IBM‟s Cell procesor,
   – GPU, GPGPU
 7/5/2011       CSCE930-Advanced Computer Architecture, Introduction   69
                Dataflow Architectures
• Represent computation as a graph of essential
  dependences
      – Logical processor at each node, activated by availability of operands
      – Message (tokens) carrying tag of next instruction sent to next processor
      – Tag compared with others in matching store; match fires execution
                                1               b                   c             e



                a = (b +1)  (b  c)    +                                  
                d =c e
                f= ad
                                                                        d
                                                                           Dataflow graph
                                                      a
                                                           
                                                                                             Network
                                                               f




                       Token           Program
                       store            store


                     Waiting           Ins truction                             Form         Network
                                                               Exec ute
                     Matching             fetch                                 token

                                                          Token queue

                                                                                             Network


7/5/2011      CSCE930-Advanced Computer Architecture, Introduction                                     70
  Evolution and Convergence
• Key characteristics
   – Ability to name operations, synchronization, dynamic scheduling
• Problems
   –   Operations have locality across them, useful to group together
   –   Handling complex data structures like arrays
   –   Complexity of matching store and memory units
   –   Expose too much parallelism (?)
• Converged to use conventional processors and memory
   – Support for large, dynamic set of threads to map to processors
   – Typically shared address space as well
   – But separation of progr. model from hardware (like data-parallel)
• Lasting contributions:
   – Integration of communication with thread (handler) generation
   – Tightly integrated communication and fine-grained synchronization
   – Remained useful concept for software (compilers etc.)

  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction     71
 Systolic Architectures
   – Replace single processor with array of regular processing elements
   – Orchestrate data flow for high throughput with less memory access


              M                                M


              PE
                                      PE       PE                        PE


Different from pipelining
  •   Nonlinear array structure, multidirection data flow, each PE may have
      (small) local instruction and data memory

Different from SIMD: each PE may do something different
Initial motivation: VLSI enables inexpensive special-purpose chips
Represent algorithms directly by chips connected in regular pattern

 7/5/2011         CSCE930-Advanced Computer Architecture, Introduction        72
              Systolic Arrays (contd.)
Example: Systolic array for 1-D convolution
                   y(i) = w1 ´ x(i) + w2 ´ x(i + 1) + w3 ´ x(i + 2) + w4 ´ x(i + 3)

             x8        x6          x4           x2
                  x7          x5           x3          x1

                                          w4          w3           w2           w1
             y3         y2          y1


                                         xin          xout              xout = x
                                                  x
                                                                           x = xin
                                                  w                     yout = yin + w ´ xin
                                         yin          yout

  – Practical realizations (e.g. iWARP) use quite general processors
      » Enable variety of algorithms on same hardware
  – But dedicated interconnect channels
      » Data transfer directly from register to register across channel
  – Specialized, and same problems as SIMD
      » General purpose systems work well for same algorithms (locality
        etc.)
 7/5/2011    CSCE930-Advanced Computer Architecture, Introduction                              73
Convergence: Generic Parallel Architecture
• A generic modern multiprocessor

                                                  Network


                                                            


                       Communication
            Mem          assist (C A )


                  $

                  P




Node: processor(s), memory system, plus communication assist
   • Network interface and communication controller
• Scalable network
• Convergence allows lots of innovation, now within framework
   • Integration of assist with node, what operations, how efficiently...

 7/5/2011     CSCE930-Advanced Computer Architecture, Introduction     74
           Fundamental Design
           Issues




7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   75
    Understanding Parallel Architecture

• Traditional taxonomies not very useful
• Programming models not enough, nor hardware
  structures
   – Same one can be supported by radically different architectures
• Architectural distinctions that affect software
   – Compilers, libraries, programs
• Design of user/system and hardware/software interface
   – Constrained from above by progr. models and below by technology
• Guiding principles provided by layers
   – What primitives are provided at communication abstraction
   – How programming models map to these
   – How they are mapped to hardware


  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction    76
               Fundamental Design Issues
• At any layer, interface (contract) aspect and performance
  aspects
   – Naming: How are logically shared data and/or processes referenced?
   – Operations: What operations are provided on these data
   – Ordering: How are accesses to data ordered and coordinated?
   – Replication: How are data replicated to reduce communication?
   – Communication Cost: Latency, bandwidth, overhead, occupancy

• Understand at programming model first, since that sets
  requirements

• Other issues
• • Node Granularity: How to split between processors and memory?
• • ...
    7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   77
            Sequential Programming Model

• Contract
   – Naming: Can name any variable in virtual address space
       » Hardware (and perhaps compilers) does translation to physical
         addresses
   – Operations: Loads and Stores
   – Ordering: Sequential program order


• Performance
   –   Rely on dependences on single location (mostly): dependence order
   –   Compilers and hardware violate other orders without getting caught
   –   Compiler: reordering and register allocation
   –   Hardware: out of order, pipeline bypassing, write buffers
   –   Transparent replication in caches


 7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   78
Shared Address Space (SAS) Programming Model

• Naming: Any process can name any variable in shared
  space
• Operations: loads and stores, plus those needed for
  ordering
• Simplest Ordering Model:
    –   Within a process/thread: sequential program order
    –   Across threads: some interleaving (as in time-sharing)
    –   Additional orders through synchronization
    –   Again, compilers/hardware can violate orders without getting caught
             » Different, more subtle ordering models also possible (discussed later)




  7/5/2011          CSCE930-Advanced Computer Architecture, Introduction          79
                      Synchronization

• Mutual exclusion (locks)
   – Ensure certain operations on certain data can be performed by
     only one process at a time
   – Room that only one person can enter at a time
   – No ordering guarantees

• Event synchronization
   – Ordering of events to preserve dependences
       » e.g. producer —> consumer of data
   – 3 main types:
       » point-to-point
       » global
       » group



 7/5/2011    CSCE930-Advanced Computer Architecture, Introduction    80
 Message Passing Programming Model

• Naming: Processes can name private data directly.
   – No shared address space
• Operations: Explicit communication through send and
  receive
   – Send transfers data from private address space to another process
   – Receive copies data from process to private address space
   – Must be able to name processes
• Ordering:
   – Program order within a process
   – Send and receive can provide pt to pt synch between processes
   – Mutual exclusion inherent
• Can construct global address space:
   – Process number + address within process address space
   – But no direct operations on these names
 7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   81
        Design Issues Apply at All Layers
• Prog. model‟s position provides constraints/goals for
  system
• In fact, each interface between layers supports or takes
  a position on:
   – Naming model
   – Set of operations on names
   – Ordering model
   – Replication
   – Communication performance

• Any set of positions can be mapped to any other by
  software
• Let‟s see issues across layers
   – How lower layers can support contracts of programming models
   – Performance issues
  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   82
                Naming and Operations
• Naming and operations in programming model can be directly
  supported by lower levels, or translated by compiler, libraries or OS

• Example: Shared virtual address space in programming model
• Hardware interface supports shared physical address space
   – Direct support by hardware through v-to-p mappings, no software layers
• Hardware supports independent physical address spaces
   – Can provide SAS through OS, so in system/user interface
       » v-to-p mappings only for data that are local
       » remote data accesses incur page faults; brought in via page fault
         handlers
       » same programming model, different hardware requirements and cost
         model
   – Or through compilers or runtime, so above sys/user interface
       » shared objects, instrumentation of shared accesses, compiler
         support

    7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   83
Naming and Operations (contd)
  • Example: Implementing Message Passing
  • Direct support at hardware interface
       – But match and buffering benefit from more flexibility
  • Support at sys/user interface or above in software (almost
    always)
       – Hardware interface provides basic data transport (well suited)
       – Send/receive built in sw for flexibility (protection, buffering)
       – Choices at user/system interface:
           » OS each time: expensive
           » OS sets up once/infrequently, then little sw involvement each
             time
       – Or lower interfaces provide SAS, and send/receive built on top with
         buffers and loads/stores

  • Need to examine the issues and tradeoffs at every layer
       – Frequencies and types of operations, costs
  7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   84
                                 Ordering

• Message passing: no assumptions on orders across
  processes except those imposed by send/receive pairs

• SAS: How processes see the order of other processes‟
  references defines semantics of SAS
   –   Ordering very important and subtle
   –   Uniprocessors play tricks with orders to gain parallelism or locality
   –   These are more important in multiprocessors
   –   Need to understand which old tricks are valid, and learn new ones
   –   How programs behave, what they rely on, and hardware implications




  7/5/2011       CSCE930-Advanced Computer Architecture, Introduction   85
                             Replication
• Very important for reducing data transfer/communication
• Again, depends on naming model
• Uniprocessor: caches do it automatically
   – Reduce communication with memory
• Message Passing naming model at an interface
   – A receive replicates, giving a new name; subsequently use new name
   – Replication is explicit in software above that interface
• SAS naming model at an interface
   –   A load brings in data transparently, so can replicate transparently
   –   Hardware caches do this, e.g. in shared physical address space
   –   OS can do it at page level in shared virtual address space, or objects
   –   No explicit renaming, many copies for same name: coherence problem
         » in uniprocessors, “coherence” of copies is natural in memory
           hierarchy

  7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   86
              Communication Performance

• Performance characteristics determine usage of
  operations at a layer
   – Programmer, compilers etc make choices based on this
• Fundamentally, three characteristics:
   – Latency: time taken for an operation
   – Bandwidth: rate of performing operations
   – Cost: impact on execution time of program
• If processor does one thing at a time: bandwidth 
  1/latency
   – But actually more complex in modern systems
• Characteristics apply to overall operations, as well as
  individual components of a system, however small
• We‟ll focus on communication or data transfer across
  nodes
   7/5/2011     CSCE930-Advanced Computer Architecture, Introduction   87
Outline
• Computer Science at a Crossroads: Parallelism
      – Architecture: multi-core and many-cores
      – Program: multi-threading
• Parallel Architecture
      –    What is Parallel Architecture?
      –    Why Parallel Architecture?
      –    Evolution and Convergence of Parallel Architectures
      –    Fundamental Design Issues
• Parallel Programs
      – Why bother with programs?
      – Important for whom?
• Memory & Storage Subsystem Architectures


7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   88
Why Bother with Programs?
• They‟re what runs on the machines we design
      – Helps make design decisions
      – Helps evaluate systems tradeoffs
• Led to the key advances in uniprocessor
  architecture
      – Caches and instruction set design
• More important in multiprocessors
      – New degrees of freedom
      – Greater penalties for mismatch between program and
        architecture




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   89
Important for Whom?
• Algorithm designers
      – Designing algorithms that will run well on real systems
• Programmers
      – Understanding key issues and obtaining best performance
• Architects
      – Understand workloads, interactions, important degrees of
        freedom
      – Valuable for design and for evaluation




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   90
Software
• 1. Parallel programs
      – Process of parallelization
      – What parallel programs look like in major programming
        models

• 2. Programming for performance
      – Key performance issues and architectural interactions

• 3. Workload-driven architectural evaluation
      – Beneficial for architects and for users in procuring machines

• Unlike on sequential systems, can‟t take
  workload for granted
      – Software base not mature; evolves with architectures for
        performance
      – So need to open the box

• Let‟s begin with parallel programs ...
7/5/2011      CSCE930-Advanced Computer Architecture, Introduction      91
Outline
• Motivating Problems (application case studies)

• Steps in creating a parallel program

• What a simple parallel program looks like
      – In the three major programming models
      – Ehat primitives must a system support?
• Later: Performance issues and architectural
  interactions




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   92
Motivating Problems
• Simulating Ocean Currents
      – Regular structure, scientific computing

• Simulating the Evolution of Galaxies
      – Irregular structure, scientific computing

• Rendering Scenes by Ray Tracing
      – Irregular structure, computer graphics

• Data Mining
      – Irregular structure, information processing
      – Not discussed here (read in book)




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   93
Simulating Ocean Currents




                  (a) Cross sections          (b) Spatial discretization of a cross section



           – Model as two-dimensional grids
           – Discretize in space and time
               » finer spatial and temporal resolution => greater
                 accuracy
           – Many different computations per time step
               » set up and solve equations
           – Concurrency across and within grid computations
7/5/2011       CSCE930-Advanced Computer Architecture, Introduction                           94
Simulating Galaxy Evolution
–   Simulate the interactions of many stars evolving over time
–   Computing forces is expensive
–   O(n2) brute force approach
–   Hierarchical Methods take advantage of force law: G       m                   1m2
                                                                                  r2
        Star on which for ces
        are being computed                                                  Large gr oup far
                                                                            enough away to
                                                                            appr oximate




                                       Small gr oup far enough away to
      Star too close to                appr oximate to center of mass
      appr oximate


•Many   time-steps, plenty of concurrency across stars within one

7/5/2011             CSCE930-Advanced Computer Architecture, Introduction                      95
Rendering Scenes by Ray Tracing
      – Shoot rays into scene through pixels in image plane
      – Follow their paths
          » they bounce around as they strike objects
          » they generate new rays: ray tree per input ray
      – Result is color and opacity for that pixel
      – Parallelism across rays



• All case studies have abundant concurrency




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   96
        Creating a Parallel Program
• Assumption: Sequential algorithm is given
    – Sometimes need very different algorithm, but beyond scope

• Pieces of the job:
    –    Identify work that can be done in parallel
    –    Partition work and perhaps data among processes
    –    Manage data access, communication and synchronization
    –    Note: work includes computation, data access and I/O

• Main goal: Speedup (plus low prog. effort and
  resource needs)
•                            Speedup (p) = Performance(p)
• For a fixed problem:                              Performance(1)
                                                     Time(1)
•                            Speedup (p) =
                                                     Time(p)
        7/5/2011   CSCE930-Advanced Computer Architecture, Introduction   97
Steps in Creating a Parallel Program
                                Partitioning




              D                     A                       O                M
              e                     s                       r                a
              c                     s                       c                p
              o                     i                       h                p
              m                     g          p0     p1    e   p0      p1   i
              p                                             s                    P0        P1
                                    n                                        n
              o                     m                       t                g
              s                     e                       r
              i                     n                       a
              t                     t                       t
                                                                                 P2        P3
              i                                p2     p3    i   p2      p3
              o                                             o
              n                                             n


        Sequential      Tasks                  Process es         Parallel            Process ors
       c omputation                                              program



• 4 steps: Decomposition, Assignment,
  Orchestration, Mapping
    – Done by programmer or system software (compiler, runtime, ...)
    – Issues are the same, so assume programmer does it all explicitly

7/5/2011              CSCE930-Advanced Computer Architecture, Introduction                          98
  Some Important Concepts
• Task:
   – Arbitrary piece of undecomposed work in parallel computation
   – Executed sequentially; concurrency is only across tasks
   – E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace
   – Fine-grained versus coarse-grained tasks

• Process (thread):
   – Abstract entity that performs the tasks assigned to processes
   – Processes communicate and synchronize to perform their tasks

• Processor:
   – Physical engine on which process executes
   – Processes virtualize machine to programmer
       » first write program in terms of processes, then map to
         processors

  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction        99
Decomposition
• Break up computation into tasks to be divided
  among processes
      – Tasks may become available dynamically
      – No. of available tasks may vary with time

• i.e. identify concurrency and decide level at
  which to exploit it


• Goal: Enough tasks to keep processes busy, but
  not too many
      – No. of tasks available at a time is upper bound on achievable
        speedup



7/5/2011      CSCE930-Advanced Computer Architecture, Introduction      100
Limited Concurrency: Amdahl‟s Law
   – Most fundamental limitation on parallel speedup
   – If fraction s of seq execution is inherently serial, speedup <= 1/s
   – Example: 2-phase calculation
      » sweep over n-by-n grid and do some independent
         computation
      » sweep again and add each value to global sum
   – Time for first phase = n2/p
   – Second phase serialized at global variable, so time = n2
   – Speedup <=         2n2 or at most 2
                    n2   + n2
                    p
   – Trick: divide second phase into two
      » accumulate into private sum during sweep
      » add per-process private sum into global sum                   2n2
   – Parallel time is n2/p + n2/p + p, and speedup at best          2n2 + p2



7/5/2011     CSCE930-Advanced Computer Architecture, Introduction              101
Pictorial Depiction

                                      1
                                    (a)
                                                            n2            n2
           work done concurrently




                                          p




                                          1
                                    (b)
                                              n2/p               n2


                                          p




                                          1
                                    (c)
                                                                                       Time
                                              n2/p n2/p p


7/5/2011                             CSCE930-Advanced Computer Architecture, Introduction     102
Concurrency Profiles
•Cannot    usually divide into serial and parallel part

                           1,400

                           1,200

                           1,000
             Concurrency




                            800

                            600

                            400

                            200

                              0
                                  150

                                        219

                                              247

                                                    286

                                                          313

                                                                 343

                                                                       380

                                                                              415

                                                                                    444

                                                                                           483

                                                                                                 504

                                                                                                       526

                                                                                                             564

                                                                                                                   589

                                                                                                                         633

                                                                                                                               662

                                                                                                                                     702

                                                                                                                                           733
                                                                             Cloc k c yc le number



  – Area under curve is total work done, or time with 1 processor
  – Horizontal extent is lower bound on time (infinite processors)
                                                                       

  – Speedup is the ratio:                                          fk k
                                                                  k=1                     , base case:
                                                                                                                               1
                                                                  
                                                                 fk kp
                                                                                                                         s + 1-s
                                                                                                                              p
  – Amdahl‟s law applies to any overhead, not just limited concurrency
                                                                k=1

7/5/2011                      CSCE930-Advanced Computer Architecture, Introduction                                                               103
Assignment
• Specifying mechanism to divide work up among
  processes
    – E.g. which process computes forces on which stars, or which
      rays
    – Together with decomposition, also called partitioning
    – Balance workload, reduce communication and management
      cost
• Structured approaches usually work well
    – Code inspection (parallel loops) or understanding of
      application
    – Well-known heuristics
    – Static versus dynamic assignment
• As programmers, we worry about partitioning
  first
    – Usually independent of architecture or prog model
    – But cost and complexity of using primitives may affect
      decisions
• As architects, we assume program does
    reasonable job of it Computer Architecture, Introduction
7/5/2011   CSCE930-Advanced                                     104
Orchestration
      –    Naming data
      –    Structuring communication
      –    Synchronization
      –    Organizing data structures and scheduling tasks temporally

• Goals
      – Reduce cost of communication and synch. as seen by
        processors
      – Reserve locality of data reference (incl. data structure
        organization)
      – Schedule tasks to satisfy dependences early
      – Reduce overhead of parallelism management

• Closest to architecture (and programming model
  & language)
      – Choices depend a lot on comm. abstraction, efficiency of
        primitives
      – Architects should provide appropriate primitives efficiently
7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   105
  Mapping
• After orchestration, already have parallel program
• Two aspects of mapping:
   – Which processes will run on same processor, if necessary
   – Which process runs on which particular processor
      » mapping to a network topology

• One extreme: space-sharing
   – Machine divided into subsets, only one app at a time in a subset
   – Processes can be pinned to processors, or left to OS
• Another extreme: complete resource management
  control to OS
   – OS uses the performance techniques we will discuss later
• Real world is between the two
   – User specifies desires in some aspects, system may ignore

• Usually adopt the view: process <-> processor
  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   106
Parallelizing Computation vs. Data
• Above view is centered around computation
      – Computation is decomposed and assigned (partitioned)


• Partitioning Data is often a natural view too
      – Computation follows data: owner computes
      – Grid example; data mining; High Performance Fortran (HPF)


• But not general enough
      – Distinction between comp. and data stronger in many
        applications
          » Barnes-Hut, Raytrace (later)
      – Retain computation-centric view
      – Data access and communication is part of orchestration




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   107
   High-level Goals
High performance (speedup over sequential program)

                    Table 2.1    Steps in the Parallelization Pr ocess and Their Goals

                                          Architecture-
                    Step                  Dependent?                  Major Performance Goals
                    Decomposition           Mostly no       Expose enough concurr ency but not too much
                    Assignment              Mostly no      Balance workload
                                                           Reduce communication volume
                    Orchestration           Yes             Reduce noninher ent communication via data
                                                             locality
                                                           Reduce communication and synchr onization cost
                                                             as seen by the pr ocessor
                                                           Reduce serialization at shar ed r esour ces
                                                           Schedule tasks to satisfy dependences early
                    Mapping                Yes             Put r elated pr ocesses on the same pr ocessor if
                                                              necessary
                                                           Exploit locality in network topology




• But low resource usage and development effort
• Implications for algorithm designers and architects
    – Algorithm designers: high-perf., low resource needs
    – Architects: high-perf., low cost, reduced programming effort
        » e.g. gradually improving perf. with programming effort may be
          preferable to sudden threshold after large programming effort
   7/5/2011      CSCE930-Advanced Computer Architecture, Introduction                                          108
What Parallel Programs Look Like
Parallelization of An Example
Program
• Motivating problems all lead to large, complex
  programs

• Examine a simplified version of a piece of Ocean
  simulation
      – Iterative equation solver


• Illustrate parallel program in low-level parallel
  language
      – C-like pseudocode with simple extensions for parallelism
      – Expose basic comm. and synch. primitives that must be
        supported
      – State of most real parallel programming today




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   110
Grid Solver Example

                                          Expres s ion for updating eac h interior point:

                                          A[i,j] = 0.2  (A[i,j] + A[i,j – 1] + A[i – 1, j] +
                                                                  A[i,j + 1] + A[i + 1, j])




  – Simplified version of solver in Ocean simulation
  – Gauss-Seidel (near-neighbor) sweeps to convergence
      » interior n-by-n points of (n+2)-by-(n+2) updated in each
        sweep
      » updates done in-place in grid, and diff. from prev. value
        computed
      » accumulate partial diffs into global diff at end of every
        sweep
      » check if error has converged (to within a tolerance
        parameter)
      » if so, exit solver; if not, do another sweep
7/5/2011     CSCE930-Advanced Computer Architecture, Introduction                               111
           n
       1. i t n;                                    /*size of matrix: (n + 2-by-n + 2) elements*/
           la     A    f
       2. f o t ** , di f = 0;

           an)
       3. m i (
           ei
       4. b g n
       5.     edn
             ra() ;                      /*read input parameter: matrix size*/
       6.    A  m ll c (a 2
                  a o           r y f ie                y
                            -d a ra o s z n + 2 b n + 2 d ub e );    o ls
       7.     nta z( ;
             i i i li e A)               /*initialize the matrix A somehow*/
       8.     ov     )
             S l e (A ;                  /*call the routine to solve equation*/
           n   a
       9. e d m in

            r c d e ol e
       10. p o e ur S v (A      )              /*solve the equation system*/
       11.     la
              f o t ** ;A                      /*A is an (n + 2)-by-(n + 2) array*/
            ei
       12. b g n
       13.     n    , , on
              iti j d e= 0         ;
       14.     la       f     ,
              f o t di f = 0 te p  m;
       15.     hl       dn
              w i e (! o e) d  o               /*outermost loop over sweeps*/
       16.        if
                 df = 0    ;                   /*initialize maximum difference to 0*/
       17.       f r i  1 t n do
                  o           o                /*sweep over nonborder points of grid*/
       18.          f r j  1 t n do
                     o           o
       19.              ep     [, ;
                       t m = A i j]            /*save old value of element*/
       20.             A i j  0 2 * (A i j] + A[ , -
                        [,]        .     [,                         [ - ,j
                                                      i j 1] + A i 1 ] +
       21.                 [,+         i 1 ] ;/*compute average*/
                          A i j 1] + A[ + ,j )
       22.              i f = b (A i j] - t m );
                       df + as [,            ep
       23.           n
                    edfr  o
       24.        n
                 edfr  o
       25.        f d ff ( *n        O) hn oe
                 i (i /n )< TL te dn =1                   ;
       26.     n    h e
              e d w il
            n    r eu
       27. e d p oc d re

7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                              112
  Decomposition
•Simple way to identify concurrency is to look at loop iterations
     –dependence analysis; if not enough concurrency, then look further
•Not much concurrency here at this level (all loops sequential)
•Examine fundamental dependences, ignoring loop structure




– Concurrency O(n) along anti-diagonals, serialization O(n) along diag.
– Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.
– Restructure loops, use global synch; imbalance and too much synch

  7/5/2011       CSCE930-Advanced Computer Architecture, Introduction     113
Exploit Application Knowledge
•Reorder   grid traversal: red-black ordering



                                                          Red point

                                                          Black point




 –   Different ordering of updates: may converge quicker or slower
 –   Red sweep and black sweep are each fully parallel:
 –   Global synch between them (conservative but convenient)
 –   Ocean uses red-black; we use simpler, asynchronous one to illustrate
       » no red-black, simply ignore dependences within sweep
       » sequential order same as original, parallel program
          nondeterministic
7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   114
  Decomposition Only
15. while (!done) do              /*a sequential loop*/
16.   diff = 0;
17.   for_all i  1 to n do       /*a parallel loop nest*/
18.     for_all j  1 to n do
19.       temp = A[i,j];
20.       A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21.         A[i,j+1] + A[i+1,j]);
22.       diff += abs(A[i,j] - temp);
23.     end for_all
24.   end for_all
25.   if (diff/(n*n) < TOL) then done = 1;
26. end while


– Decomposition into elements: degree of concurrency n2
– To decompose into rows, make line 18 loop sequential; degree n
– for_all leaves assignment left to system
    » but implicit global synch. at end of for_all loop


  7/5/2011    CSCE930-Advanced Computer Architecture, Introduction   115
     Assignment
•Static   assignments (given decomposition into rows)
                                                                                    i
     –block     assignment of rows: Row i is assigned to process
                                                                                   p
     –cyclic    assignment of rows: process i is assigned rows i, i+p, and so on

                                  P0



                                  P1



                                  P2



                                  P4



    – Dynamic assignment
       » get a row index, work on the row, get a new row, and so on
    – Static assignment into rows reduces concurrency (from n to p)
       » block assign. reduces communication by keeping adjacent rows
         together
    – Let‟s dig into orchestration under three programming models

     7/5/2011          CSCE930-Advanced Computer Architecture, Introduction        116
1.
     Data Parallel Solver size (n + 2-by-n + 2) and number of processes*/
     int n, nprocs;     /*grid
2.   float **A, diff = 0;

3.   main()
4.   begin
5.     read(n); read(nprocs);     ;     /*read input grid size and number of processes*/
6.     A  G_MALLOC (a 2-d array of size n+2 by n+2 doubles);
7.     initialize(A);                   /*initialize the matrix A somehow*/
8.     Solve (A);                       /*call the routine to solve equation*/
9.   end main

10. procedure Solve(A)                  /*solve the equation system*/
11.     float **A;                      /*A is an (n + 2-by-n + 2) array*/
12.   begin
13.   int i, j, done = 0;
14.   float mydiff = 0, temp;
14a.    DECOMP A[BLOCK,*, nprocs];
15.   while (!done) do                  /*outermost loop over sweeps*/
16.     mydiff = 0;                     /*initialize maximum difference to 0*/
17.     for_all i  1 to n do           /*sweep over non-border points of grid*/
18.       for_all j  1 to n do
19.         temp = A[i,j];              /*save old value of element*/
20.         A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21.           A[i,j+1] + A[i+1,j]);     /*compute average*/
22.         mydiff += abs(A[i,j] - temp);
23.       end for_all
24.     end for_all
24a.      REDUCE (mydiff, diff, ADD);
25.     if (diff/(n*n) < TOL) then done = 1;
26.   end while
27. end procedure
     7/5/2011        CSCE930-Advanced Computer Architecture, Introduction                  117
Shared Address Space Solver
 Single Program Multiple Data (SPMD)


                                         Processes



                             So lv e   So lv e   So lv e    So lv e




                                             Sweep



                                        Test Co n v erg en ce




       – Assignment controlled by values of variables used as loop bounds
7/5/2011     CSCE930-Advanced Computer Architecture, Introduction     118
1.        int n, nprocs;                 /*matrix dimension and number of processors to be used*/
2a.       float **A, diff;               /*A is global (shared) array representing the grid*/
                                         /*diff is global (shared) maximum difference in current
                                         sweep*/
2b.       LOCKDEC(diff_lock);            /*declaration of lock to enforce mutual exclusion*/
2c.       BARDEC (bar1);                 /*barrier declaration for global synchronization between
                                         sweeps*/

3.     main()
4.     begin
5.           read(n); read(nprocs);     /*read input matrix size and number of processes*/
6.           A  G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);
7.           initialize(A);             /*initialize A in an unspecified way*/
8a.          CREATE (nprocs–1, Solve, A);
8.        Solve(A);                     /*main process becomes a worker too*/
8b.          WAIT_FOR_END (nprocs–1);   /*wait for all child processes created to terminate*/
9.     end main

10.    procedure Solve(A)
11.       float **A;                                          /*A is entire n+2-by-n+2 shared array,
                                                              as in the sequential program*/
12.    begin
13.       int i,j, pid, done = 0;
14.       float temp, mydiff = 0;                             /*private variables*/
14a.      int mymin = 1 + (pid * n/nprocs);                   /*assume that n is exactly divisible by*/
14b.      int mymax = mymin + n/nprocs - 1                    /*nprocs for simplicity here*/

15.       while (!done) do                  /*outer loop over all diagonal elements*/
16.          mydiff = diff = 0;             /*set global diff to 0 (okay for all to do it)*/
16a.      BARRIER(bar1, nprocs);            /*ensure all reach here before anyone modifies diff*/
17.          for i  mymin to mymax do      /*for each of my rows*/
18.                 for j  1 to n do       /*for all nonborder elements in that row*/
19.                 temp = A[i,j];
20.                 A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21.                    A[i,j+1] + A[i+1,j]);
22.                 mydiff += abs(A[i,j] - temp);
23.             endfor
24.          endfor
25a.         LOCK(diff_lock);               /*update global diff if necessary*/
25b.         diff += mydiff;
25c.         UNLOCK(diff_lock);
25d.         BARRIER(bar1, nprocs);         /*ensure all reach here before checking if done*/
25e.         if (diff/(n*n) < TOL) then done = 1;               /*check convergence; all get
                                                                same answer*/
25f.         BARRIER(bar1, nprocs);
26.       endwhile
27.    end procedure
 7/5/2011               CSCE930-Advanced Computer Architecture, Introduction                              119
Notes on SAS Program
 – SPMD: not lockstep or even necessarily same instructions


 – Assignment controlled by values of variables used as loop bounds
       » unique pid per process, used to control assignment

 – Done condition evaluated redundantly by all

 – Code that does the update identical to sequential program
       » each process has private mydiff variable


 – Most interesting special operations are for synchronization
       » accumulations into shared diff have to be mutually exclusive
       » why the need for all the barriers?



7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   120
Need for Mutual Exclusion
      – Code each process executes:
•                                 load the value of diff into register r1
•                                    add the register r2 to register r1

•                                    store the value of register r1 into diff


      – A possible interleaving:
•                       P1                                           P2
•          r1  diff                                                            {P1 gets 0 in its r1}
•                                                               r1  diff       {P2 also gets 0}

•           r1  r1+r2                                                          {P1 sets its r1 to 1}

•                                                               r1  r1+r2      {P2 sets its r1 to 1}

•           diff  r1                                                           {P1 sets cell_cost to 1}

•                                                               diff  r1       {P2 also sets cell_cost to 1}


      – Need the sets of operations to be atomic (mutually exclusive)




7/5/2011          CSCE930-Advanced Computer Architecture, Introduction                                     121
Mutual Exclusion
• Provided by LOCK-UNLOCK around critical
  section
      – Set of operations we want to execute atomically
      – Implementation of LOCK/UNLOCK must guarantee mutual
        excl.


• Can lead to significant serialization if contended
      – Especially since expect non-local accesses in critical section
      – Another reason to use private mydiff for partial accumulation




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   122
    Global Event Synchronization
• BARRIER(nprocs): wait here till nprocs processes get
  here
      – Built using lower level primitives
      – Global sum example: wait for all to accumulate before using sum
      – Often used to separate phases of computation
•   Process P_1             Process P_2              Process P_nprocs
•   set up eqn system              set up eqn system              set up eqn system
•   Barrier (name, nprocs) Barrier (name, nprocs)          Barrier (name, nprocs)
•   solve eqn system               solve eqn system               solve eqn system
•   Barrier (name, nprocs) Barrier (name, nprocs)          Barrier (name, nprocs)
•   apply results           apply results                  apply results
•   Barrier (name, nprocs) Barrier (name, nprocs)          Barrier (name, nprocs)

     – Conservative form of preserving dependences, but easy to use


• WAIT_FOR_END (nprocs-1)

    7/5/2011      CSCE930-Advanced Computer Architecture, Introduction        123
  Pt-to-pt Event Synch (Not Used
  Here)
• One process notifies another of an event so it can
  proceed
   – Common example: producer-consumer (bounded buffer)
   – Concurrent programming on uniprocessor: semaphores
   – Shared address space parallel programs: semaphores, or use ordinary
     variables as flags

                                   P1                         P2
                                                          A = 1;
                                                          b: flag
                 a: while (flag is 0) do nothing;         = 1;
                 print A;



 •Busy-waiting   or spinning



  7/5/2011        CSCE930-Advanced Computer Architecture, Introduction   124
Group Event Synchronization
• Subset of processes involved
      – Can use flags or barriers (involving only the subset)
      – Concept of producers and consumers



• Major types:
      – Single-producer, multiple-consumer
      – Multiple-producer, single-consumer
      – Multiple-producer, single-consumer




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction   125
Message Passing Grid Solver
      – Cannot declare A to be shared array any more

      – Need to compose it logically from per-process private arrays
          » usually allocated in accordance with the assignment of
            work
          » process assigned a set of rows allocates them locally

      – Transfers of entire rows between traversals

      – Structurally similar to SAS (e.g. SPMD), but orchestration
        different
          » data structures and data access/naming
          » communication
          » synchronization




7/5/2011      CSCE930-Advanced Computer Architecture, Introduction     126
           1. int pid, n, b;                         /*process id, matrix dimension and number of
                                                     processors to be used*/
           2. float **myA;
           3. main()
           4. begin
           5.      read(n);   read(nprocs);  /*read input matrix size and number of processes*/
           8a.     CREATE (nprocs-1, Solve);
           8b.     Solve();                  /*main process becomes a worker too*/
           8c.     WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/
           9. end main

           10.  procedure Solve()
           11.  begin
           13.     int i,j, pid, n’ = n/nprocs, done = 0;
           14.     float temp, tempdiff, mydiff = 0;        /*private variables*/
           6.   myA  malloc(a 2-d array of size [n/nprocs + 2] by n+2);
                                             /*my assigned rows of A*/
           7. initialize(myA);               /*initialize my rows of A, in an unspecified way*/

           15. while (!done) do
           16.   mydiff = 0;                /*set local diff to 0*/
           16a.  if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);
           16b.  if (pid = nprocs-1) then
                    SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);
           16c.  if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);
           16d.  if (pid != nprocs-1) then
                    RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);
                                            /*border rows of neighbors have now been copied
                                            into myA[0,*] and myA[n’+1,*]*/
           17.   for i  1 to n’ do         /*for each of my (nonghost) rows*/
           18.      for j  1 to n do       /*for all nonborder elements in that row*/
           19.        temp = myA[i,j];
           20.        myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +
           21.           myA[i,j+1] + myA[i+1,j]);
           22.        mydiff += abs(myA[i,j] - temp);
           23.      endfor
           24.   endfor
                                            /*communicate local diff values and determine if
                                            done; can be replaced by reduction and broadcast*/
           25a.     if (pid != 0) then                  /*process 0 holds global total diff*/
           25b.       SEND(mydiff,sizeof(float),0,DIFF);
           25c.       RECEIVE(done,sizeof(int),0,DONE);
           25d.     else                                /*pid 0 does this*/
           25e.       for i  1 to nprocs-1 do          /*for each other process*/
           25f.          RECEIVE(tempdiff,sizeof(float),*,DIFF);
           25g.       mydiff += tempdiff;               /*accumulate into total*/
           25h.     endfor
           25i      if (mydiff/(n*n) < TOL) then            done = 1;
           25j.       for i  1 to nprocs-1 do          /*for each other process*/
           25k.          SEND(done,sizeof(int),i,DONE);
           25l.       endfor
           25m.  endif
7/5/2011     CSCE930-Advanced Computer Architecture, Introduction
           26. endwhile                                                                             127
           27. end procedure
Notes on Message Passing Program
 – Use of ghost rows
 – Receive does not transfer data, send does
     » unlike SAS which is usually receiver-initiated (load fetches data)
 – Communication done at beginning of iteration, so no asynchrony
 – Communication in whole rows, not element at a time
 – Core similar, but indices/bounds in local rather than global space
 – Synchronization through sends and receives
     » Update of global diff and event synch for done condition
     » Could implement locks and barriers with messages
 – Can use REDUCE and BROADCAST library calls to simplify code

       /*communicate local diff values and determine if done, using reduction and broadcast*/
25b.      REDUCE(0,mydiff,sizeof(float),ADD);
25c.      if (pid == 0) then
25i.        if (mydiff/(n*n) < TOL) then done = 1;
25k.      endif
25m.        BROADCAST(0,done,sizeof(int),DONE);



7/5/2011           CSCE930-Advanced Computer Architecture, Introduction                         128
   Send and Receive Alternatives
  Can extend functionality: stride, scatter-gather, groups

  Semantic flavors: based on when control is returned
     Affect when data structures or buffers can be reused at either end


                                     Send/Receive

                    Synchronous                       Asynchronous


                                  Blocking asynch.              Nonblocking asynch.

   – Affect event synch (mutual excl. by fiat: only one process touches data)
   – Affect ease of programming and performance
• Synchronous messages provide built-in synch. through
  match
   – Separate event synchronization needed with asynch. messages
• With synch. messages, our code is deadlocked. Fix?
   7/5/2011        CSCE930-Advanced Computer Architecture, Introduction        129
Orchestration: Summary
• Shared address space
      –    Shared and private data explicitly separate
      –    Communication implicit in access patterns
      –    No correctness need for data distribution
      –    Synchronization via atomic operations on shared data
      –    Synchronization explicit and distinct from data
           communication


• Message passing
      –    Data distribution among local address spaces needed
      –    No explicit shared structures (implicit in comm. patterns)
      –    Communication is explicit
      –    Synchronization implicit in communication (at least in synch.
           case)
             » mutual exclusion by fiat


7/5/2011         CSCE930-Advanced Computer Architecture, Introduction   130
 Correctness in Grid Solver Program

• Decomposition and Assignment similar in SAS and
  message-passing
• Orchestration is different
   – Data structures, data access/naming, communication, synchronization
                                           SAS       Msg-Passing

  Explicit global data structure?                    Yes               No

  Assignment indept of data layout?                  Yes               No

  Communication                                      Implicit          Explicit

  Synchronization                                    Explicit          Implicit

  Explicit replication of border rows?               No                Yes

Requirements for performance are another story ...
 7/5/2011       CSCE930-Advanced Computer Architecture, Introduction              131

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:7/5/2011
language:English
pages:131