Docstoc

Why Parallel Computer Architecture - Computer Architecture

Document Sample
Why Parallel Computer Architecture - Computer Architecture Powered By Docstoc
					Introduction

             Spring 2004
          Seungryoul maeng
        Course Introduction
   Instructor:      Seungryoul Maeng maeng@cs.kaist.ac.kr
                      042-869-3519 (office)
       Web:         http://camars.kaist.ac.kr/~maeng
       Office hours: Tue 1:00-2:30, Thu 1:00-2:30
       Office:        Computer Science Building 4403
                       N2 building CAIS (2nd floor)
   Teaching Assistants
       Jonglok Yu        jlyu@camars.kaist.ac.kr


   WWW address
       http://calab.kaist.ac.kr/~maeng/cs610/index04.html



                                                             2
        Objective of this course
   In-depth understanding of the design and engineering of
    modern parallel computers
       technology forces
       experience of parallel programming
       fundamental architectural issues
          * naming, replication, communication, synchronization
       basic design techniques
          * cache coherence, protocols, networks, pipelining, …
       methods of evaluation
       underlying engineering trade-offs
   from moderate to very large scale


                                                                  3
     Introduction
   What is Parallel Architecture?

   Why Parallel Architecture?
       Read chapter 1


   Evolution and Convergence of Parallel Architectures

   Fundamental Design Issues




                                                          4
      What is Parallel Architecture?
   Where is the parallelism in the computer system?




                                                       5
         What is Parallel Architecture?
   A parallel computer is a collection of processing elements that
    cooperate to solve large problems fast (Almasi and Gottlieb 1989)

   Some broad issues:
        Resource Allocation:
           how large a collection?

           how powerful are the elements?

           how much memory?

        Data access, Communication and Synchronization
           how do the elements cooperate and communicate?

           how are data transmitted between processors?

           what are the abstractions and primitives for cooperation?

        Performance and Scalability
           how does it all translate into performance?

           how does it scale?




                                                                        6
      Role of a computer architect
   Maximum performance and programmability
    within limits of technology and cost




                                              7
        Is Parallel Processing Dead?
        (by Hank Dietz, 1996)

   Thinking Machines Corporation
       Integer-only computation using Lisp – CM1
       CM2, CM5 – addition of floating-point hardware
       Rather expensive machines?


   Multiflow
       VLIW design
       Smart compiler and dumb machines
       Speed-up using fine-grain parallelism, no vector computing
       No newest HW technology, mistake in marketing
       Portions of their compiler technology – Intel and HP



                                                                     8
        Is Parallel Processing Dead? –
        cont’d
   Myrias
       Canadian company
       Shared-memory programming model implemented by Page fault
        mechanisms using conventional message-passing hardware
          This technology is now becoming important

       A lot of unpleasant performance surprise


   Kendall Square Research
       Bright architectural idea – custom cache coherence hardware
       Custom processors vs. commodity microprocessors
       Little cardboard models – cute, but didn’t really inspire confidence



                                                                               9
        Is Parallel Processing Dead? –
        cont’d
   Cray
       Cray Computer – died
          Vector and shared memory high-end computing

       Cray Research, Inc. – subsidiary of SGI
          Attempt to branch into lower-end machines

          Not optimized for that kind of market




   nCUBE
       Custom VLSI processors and hypercube interconnection
       Teamed up with Oracle
       Multimedia server
          Larger markets, less depending if floating point speed




                                                                    10
        Is Parallel Processing Dead? –
        cont’d
   MasPar
       SIMD processing elements in a custom VLSI
          Didn’t give the system the peak “macho MFLOPS” to capture the

           interest of many people
       Canceling MP3
       NeoVista – data mining software company


   DEC and HP – compaq




                                                                       11
      Several lessons to be earned from
      ….
   Parallel processing companies may have died, but their ideas
    largely prospered
   Need of research on large-scale parallel processing system
   Too much custom “stuff” makes the product too late to market
   Parallel commercial computing is larger, more stable market than
    scientific computing


   Parallel processing is NOT DEAD



                                                                  12
        Is Parallel Computing Inevitable?
   Application demands
   Technology Trends
   Architecture Trends
   Economics

   Current trends:
       Today’s microprocessors have multiprocessor support
       Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...
       Cluster computing, GRID computing
       Tomorrow’s microprocessors are multiprocessors




                                                                      13
               www.top500.org                                                                                                   Cray X1

Rank   Site                                     Computer / Processors                                                        Rmax
              Country/Year                           Manufacturer                                                                    Rpeak

1      Earth Simulator Center                   Earth-Simulator / 5120                                                       35860
              Japan/2002                               NEC                                                                           40960

2      Los Alamos National Laboratory           ASCI Q - AlphaServer SC45, 1.25 GHz / 8192                                   13880
              United States/2002                      HP                                                                             20480

3      Virginia Tech                            X                                                                            10280
              United States/2003                       1100 Dual 2.0 GHz Apple G5/Mellanox Infiniband 4X/Cisco GigE / 2200           17600
                                                       Self-made

4      NCSA                                     Tungsten                                                                     9819
              United States/2003                       PowerEdge 1750, P4 Xeon 3.06 GHz, Myrinet / 2500                              15300
                                                       Dell

5      Pacific Northwest National Laboratory    Mpp2                                                                         8633
               United States/2003                      Integrity rx2600 Itanium2 1.5 GHz, Quadrics / 1936                            11616
                                                       HP

6      Los Alamos National Laboratory           Lightning                                                                    8051
              United States/2003                        Opteron 2 GHz, Myrinet / 2816                                                11264
                                                        Linux Networx

7      Lawrence Livermore National Laboratory   MCR Linux Cluster Xeon 2.4 GHz - Quadrics / 2304                             7634
             United States/2002                       Linux Networx/Quadrics                                                         11060

8      Lawrence Livermore National Laboratory   ASCI White, SP Power3 375 MHz / 8192                                         7304
             United States/2000                       IBM                                                                            12288

9      NERSC/LBNL                               Seaborg                                                                      7304
            United States/2002                         SP Power3 375 MHz 16 way / 6656                                               9984
                                                       IBM

10     Lawrence Livermore National Laboratory   xSeries Cluster Xeon 2.4 GHz - Quadrics / 1920                               6586
             United States/2003                        IBM/Quadrics                                                                  9216

11     National Aerospace Laboratory of Japan   PRIMEPOWER HPC2500 (1.3 GHz) / 2304                                          5406
             Japan/2002                              Fujitsu                                                                         11980

12     Pittsburgh Supercomputing Center         AlphaServer SC45, 1 GHz / 3016                                               4463
              United States/2001                      HP                                                                             6032

13     NCAR (National Center for Atmospheric    pSeries 690 Turbo 1.3 GHz / 1600                                             4184
             Research)                                IBM                                                                            8320
             United States/2003                                                                                                              14
     Application Trends
   Application demand for performance fuels advances in hardware,
    which enables new applications, which...




              New Applications
                                 More Performance




   Range of performance demands
       Need range of system performance with progressively
        increasing cost

                                                                     15
Scientific Computing Demand




                              16
         Learning Curve for Parallel
         Applications




   AMBER molecular dynamics simulation program on Intel Paragon
        Starting point was vector code for Cray-1
        145 MFLOP on Cray90, 406 for final version on 128-processor Paragon
        891 on 128-processor Cray T3D
                                                                               17
                      Technology Trends
                100
                                      Supercomputers




                 10
  Performance




                            Mainf rames
                                                                Microprocessors
                                  Minicomputers
                  1




                0.1
                 1965    1970        1975         1980   1985         1990        1995
• The natural building block for multiprocessors is now also about the
  fastest!
                                                                                         19
         General Technology Trends
•   Microprocessor performance increases 50% - 100% per year
•   Transistor count doubles every 3 years
•   DRAM size quadruples every 3 years
•   Huge investment per generation is carried by huge commodity
    market
       180
       160
       140
       120                                            DEC
                                                      alpha
       100                                                    Integer   FP
                                    IBM
        80                                  HP 9000
                                   RS6000
                                              750
        60                          540
                           MIPS
        40         MIPS
             Sun 4 M/120   M2000
        20    260
         0
         1987     1988     1989    1990     1991      1992

• Not that single-processor performance is plateauing, but that
  parallelism is a natural way to improve it.                                20
               Technology: A Closer Look
   Basic advance is decreasing feature size ( )
        Circuits become either faster or lower in power
   Die size is growing too
        Clock rate improves roughly proportional to improvement in 
        Number of transistors improves like (or faster)
   Performance > 100x per decade; clock rate 10x, rest transistor count
   How to use more transistors?
        Parallelism in processing
              multiple operations per cycle reduces CPI                Proc       $
        Locality in data access
              avoids latency and reduces CPI
              also improves processor utilization
        Both need resources, so tradeoff                                  Interconnect


   Fundamental issue is resource distribution, as in uniprocessors
                                                                                          21
     NMOS Inverter

NMOS Transistor(NMOS FET)




                                  SiO2 산화막(약 0.6 micron)

      P type silicon
                                    gate oxide(약 0.05 micron)



                                    polysilicon(Low Pressure Chemical
                                    Vapor Deposition 으로 얹음)




                       AS 이온 주입
                                   Source, drain 영역 형성
      n+                n+
NMOS Inverter

                산화막 성장
  n+      n+

                 contact 부분 식각후
                 aluminum 증착, 패턴 형성


  n+      n+
                         Length unit --- 
                         (micron)




                         2




                 
         Clock Frequency Growth Rate
                                                                                      Intel P4 2.2 GHz
                                                                                            (2002)
                        1,000
                                                                                     Intel Xeon 3.2GHz
                                                                                            (2004)
                                                             
                                                             
                                                             
                                                            
                                                             R10000
                                                           
                                                            
                                                            
                                                           
                                                           
                         100                                
                                                           
                                                          
                                                                                Intel Pentium III 500MHz
                                                            Pentium100
     Clock rate (MHz)




                                                         
                                                         
                                                           
                                                         
                                                         
                                                         
                                                     
                                                      
                                                       
                                                        
                                                      
                                                     
                                                         
                                                    i80386
                                           
                          10     i8086      i80286
                                                                                    Alpha 21264 600 MHz
                                   
                                       i8080                                             21364 1.2GHz
                            1                                                       UltraSparc II 480MHz
                                 
                                 i8008
                                                                                        IV 1GHZ (2001)
                                 i4004                                               UltraSparc IV 1.2GHz
                                                                                            (2003)
                          0.1
                           1970            1980          1990          2000
                                  1975            1985          1995          2005
• 30% per year                                                                                              24
                 Transistor Count Growth Rate
              100,000,000
                                                                                  Alpha 21264 15.2 M
                                                                                21364 100M (8M+92M)
               10,000,000                         
                                                      
                                                  R10000                         UltraSparc II 5.4M
                                                 
                                                   
                                                   Pentium
                                                   
                                                  
                                                   
                                                   
                                                                                 1 Billion Trs in 2010
                                                 
                                                
                                                 
Transistors




                1,000,000                          
                                                   
                                                  
                                             i80386
                                                                                       Pentium 4
                                  i80286        R3000
                  100,000                   R2000                                       55M
                                      i8086
                                                                                         In 2002
                   10,000         
                                i8080
                             
                             i8008
                            i4004
                     1,000                                                                   Itanium 2
                         1970          1980          1990          2000                       (1.5GHz)
                                1975          1985          1995          2005
                                                                                                221M
       • 100 million transistors on chip by early 2000’s A.D.                                  In 2003
       • Transistor count grows much faster than clock rate
              - 40% per year, order of magnitude more contribution in 2 decades                           25
        Similar Story for Storage
   Divergence between memory capacity and speed more
    pronounced
       Capacity increased by 1000x from 1980-95, speed only 2x
       Gap with processor speed much greater
   Larger memories are slower, while processors get faster
       Need to transfer more data in parallel
       Need deeper cache hierarchies
       How to organize caches?
   Parallelism increases effective size of each level of hierarchy,
    without increasing access time
   Parallelism and locality within memory systems too
       New designs fetch many bits within memory chip; follow with fast pipelined
        transfer across narrower interface
       Buffer caches most recently accessed data
   Disks too: Parallel disks plus caching
                                                                                     26
        Architectural Trends
   Architecture translates technology’s gifts to performance and
    capability
   Resolves the tradeoff between parallelism and locality
       Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
       Tradeoffs may change with scale and technology advances
   Understanding microprocessor architectural trends
       Helps build intuition about design issues or parallel machines
       Shows fundamental role of parallelism even in “sequential” computers
   Four generations of architectural history: tube, transistor, IC,
    VLSI
       Here focus only on VLSI generation




                                                                               27
         Architectural Trends
   Greatest trend in VLSI generation is increase in parallelism
        Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
             slows after 32 bit
             adoption of 64-bit now under way, 128-bit far (not performance issue)
             great inflection point when 32-bit micro and cache fit on a chip
        Mid 80s to mid 90s: instruction level parallelism
             pipelining and simple instruction sets, + compiler advances (RISC)
             on-chip caches and functional units => superscalar execution
             greater sophistication: out of order execution, speculation, prediction
                 to deal with control transfer and latency problems


        Next step: thread level parallelism



                                                                                        28
       Phases in VLSI Generation                   Bit-lev el parallelism              Instruction-lev el        Thread-lev el (?)
                           100,000,000




                                                                                                    

                            10,000,000                                                                      
                                                                                                     
                                                                                                        
                                                                                                       R10000
                                                                                                 
                                                                                                 
                                                                                                 
                                                                                                
                                                                                                  
                                                                                       
                                                                                             
                                                                                               
                                                                                                    
                             1,000,000
                                                                                                 
                                                                                                    Pentium
                                                                                                    
             Transistors




                                                                                                   
                                                                                   
                                                                                    i80386
                                                                            
                                                             i80286                           R3000
                              100,000
                                                                                    R2000




                                                            i8086

                               10,000
                                                     i8080
                                                i8008
                                           
                                          i4004

                                1,000
                                    1970           1975        1980         1985        1990            1995     2000        2005



   How good is instruction-level parallelism?
   Thread-level needed in microprocessors?
                                                                                                                                     29
        Architectural Trends: ILP
• Reported speedups for superscalar processors
         • Horst, Harris, and Jardine [1990] ......................                 1.37
         • Wang and Wu [1988] ..........................................            1.70
         • Smith, Johnson, and Horowitz [1989] ..............                       2.30
         • Murakami et al. [1989] ........................................          2.55
         • Chang et al. [1991] .............................................        2.90
         • Jouppi and Wall [1989] ......................................            3.20
         • Lee, Kwok, and Briggs [1991] ...........................                 3.50
         • Wall [1991] ..........................................................   5
         • Melvin and Patt [1991] .......................................           8
         • Butler et al. [1991] ..........................................…         17+
   Large variance due to difference in
       application domain investigated (numerical versus non-numerical)
       capabilities of processor modeled
                                                                                           30
                                          ILP Ideal Potential
                                 30                                                       3
                                                                                                                                                
                                                                                                             
                                 25                                                      2.5
 Fraction of total cycles (%)




                                 20                                                       2
                                                                                                       




                                                                               Speedup
                                 15                                                      1.5

                                 10                                                       1        


                                  5                                                      0.5

                                  0                                                       0
                                      0     1    2     3     4    5       6+                   0                 5           10             15
                                          Number of instructions issued                                    Instructions issued per cy cle
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
                                – real caches and non-zero miss latencies
                                                                                                                                                     31
    Results of ILP Studies
   Concentrate on parallelism for 4-issue machines




• Realistic studies show only 2-fold speedup
• Recent studies show that more ILP needs to look across threads
• “Billion-Transistor Architectures” IEEE Computer, September 1997
                                                                     32
    Threads Level Parallelism “on
    board”
               Proc    Proc     Proc    Proc




                          MEM


   Micro on a chip makes it natural to connect many to shared memory
     –   dominates server and enterprise market, moving down to desktop
   Faster processors began to saturate bus, then bus technology
    advanced
     –   today, range of sizes for bus-based systems, desktop to large servers


                                                                                 33
Architectural Trends: Bus-based
MPs                     70


                                                                                     CRAY CS6400             
                                                                                                              Sun
                        60                                                                                  E10000



                        50
 Number of processors




                        40
                                                                         SGI Challenge
                                                                                

                               Sequent B2100            Symmetry81               SE60                  Sun E6000
                        30                                                                        
                                                                                             SE70


                        20                                           Sun SC2000               SC2000E
                                                                                        SGI PowerChallenge/XL

                                                                                            AS8400
                                Sequent B8000          Symmetry21               SE10         
                                                                                                SE30
                        10                                                                  
                                                                  Power       SS1000         SS1000E

                                                           SS690MP 140             AS2100  HP K400           P-Pro
                                 SGI PowerSeries          SS690MP 120          SS10      SS20
                        0
                        1984        1986         1988      1990         1992        1994             1996            1998


                                                                                                                            34
                                  Bus Bandwidth
                   100,000




                                                                                                                         Sun E10000
                                       10 GB                                                                                    
                              10,000


                                                                                                          SGI
                                                                                                                         Sun   E6000
                                                                                                        PowerCh
Shared bus bandwidth (MB/s)




                                                                                                          XL       AS8400
                                                                                    SGI Challenge                      CS6400
                               1,000   1 GB                                                                        HPK400
                                                                                                                   SC2000E
                                                                                                     SC2000       AS2100
                                                                                                                               P-Pro
                                                                                                                   SS1000E
                                                                                                SS1000            SS20
                                                                              SS690MP 120 
                                                                                                  SS10/           SE70/SE30
                                                                              SS690MP 140         SE10/
                                                                         Symmetry81/21           SE60
                                100
                                                                               
                                                                  SGI   PowerSeries        Power



                                                   Sequent   B2100
                                         Sequent
                                         B8000
                                 10                                                                                                        35
                                  1984          1986           1988          1990        1992            1994         1996          1998
        Interconnection Networks
   Gigabit Ethernet
   Myrinet : 1.2 Gbps
   InfiniBand
       250 MB/sec to 3GB/sec for unidirectional bandwidth
       500 MB/sec to 6GB/sec for bi-directional bandwidth
   What is the difference between the bus and the networks?




                                                               36
        Economics
   Commodity microprocessors not only fast but cheap
        Development cost is tens of millions of dollars (5-100 typical)
       BUT, many more are sold compared to supercomputers
       Crucial to take advantage of the investment, and use the commodity building
        block
       Exotic parallel architectures no more than special-purpose
   Multiprocessors being pushed by software vendors (e.g.
    database) as well as hardware vendors
   Standardization by Intel makes small, bus-based SMPs
    commodity
   Desktop: few smaller processors versus one larger one?
       Multiprocessor on a chip
   Cluster Computing
                                                                                  37
        Consider Scientific
        Supercomputing
   Proving ground and driver for innovative architecture and
    techniques
       Market smaller relative to commercial as MPs become mainstream
       Dominated by vector machines starting in 70s
       Microprocessors have made huge gains in floating-point
        performance
          high clock rates

          pipelined floating point units (e.g., multiply-add every cycle)

          instruction-level parallelism

          effective use of caches (e.g., automatic blocking)

       Plus economics
   Large-scale multiprocessors replace vector supercomputers
       Well under way already
       Top-500 Supercomputers

                                                                             38
Raw Uniprocessor Performance:
LINPACK
                        10,000

                                    CRA Y    n   = 1,000
                                    s CRA Y   n   = 100
                                    Micro    n   = 1,000
                                    Micro    n   = 100
                                                                                                                    
                         1,000                                                                                                 14CPU, 28GB,
                                                                                                              T94
                                                                                                                               85 TB of Disk
                                                                                                                     s
                                                                                                    C90
                                                                                                          s          
    LINP ACK (MFLOPS)




                                                                                     
                                                                                                                        DEC 8200
                                                                              Ymp
                                                            
                                                                        Xmp/416    s                          
                                                                    s                                           IBM Power2/990
                          100                                                                                
                                                                                                               MIPS R4400
                                                                Xmp/14se
                                                            s                                                DEC Alpha
                                                                                                       
                                                                                                         HP9000/735
                                                                                                   DEC Alpha AXP
                                           CRAY 1s
                                          s                                                      HP 9000/750
                                                                                           IBM RS6000/540

                           10
                                                                                     
                                                                                         MIPS M/2000
                                                                                
                                                                                     
                                                                                    MIPS M/120
                                                                                
                                                                         Sun 4/260
                            1                                               
                                                                            
                            1975              1980                1985                   1990                      1995             2000

                                                                                                                                               39
         Raw Parallel Performance:
         LINPACK
                            10,000
                                            MPP peak
                                            CRAY peak




                             1,000                                                                                ASCI Red 
                                                                                                   Paragon XP/S MP
             ACK (GFLOPS)




                                                                                                             (6768) 
                                                                                            Paragon XP/S MP
                                                                                                     (1024) 
                                                                                              CM-5          T3D
                              100
                                                                                                           T932(32) 
          LINP




                                                                                                   Paragon XP/S
                                                                                                    
                                                                           CM-200 
                                                                       CM-2                          C90(16)
                               10                                                          Delta


                                                                              iPSC/860
                                                                               nCUBE/2(1024)
                                                          Ymp/832(8)
                                1
                                     Xmp /416(4)




                               0.1
                                1985               1987         1989               1991            1993           1995   1996

• Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)                                                                                 40
500 Fastest Computers
                    350
                            313                      319
                                                        
                    300                   284
                                            
                                   239
Number of systems




                    250
                                                 MPP
                    200             198          PVP
                          187                   s SMP
                    150
                                           110       106
                    100                     s
                                                       
                                           106
                                   s                    s
                     50                               73
                                   63
                     0s
                     11/93        11/94    11/95         11/96
                                                                 41
        Summary: Why Parallel
        Architecture?
   Increasingly attractive
       Economics, technology, architecture, application demand
   Increasingly central and mainstream
   Parallelism exploited at many levels
       Instruction-level parallelism
       Multiprocessor servers
       Large-scale multiprocessors (“MPPs”)
   Focus of this class: multiprocessor level of parallelism
   Same story from memory system perspective
       Increase bandwidth, reduce average latency with many local
        memories
   Wide range of parallel architectures make sense
       Different cost, performance and scalability

                                                                     42

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:12/25/2011
language:
pages:41