CS433 Introduction - PowerPoint by HC12110721401


									      CS 320
   Spring 2003

    Laxmikant Kale
          Course objectives and outline
• You will learn about:
   – Parallel architectures overview
       • Message passing support, routing, interconnection networks
       • Cache-coherent scalable shared memory, synchronization
       • Later
           – Relaxed consistency models (?)
           – Novel architectures: Tera, Blue Gene, Processors-in-memory
   – Parallel programming models
       • Emphasis on 3: message passing, shared memory, and shared objects
       • Ongoing evaluation and comparison of models
   – Commonly needed parallel algorithms/operations
       • Analysis techniques
   – Parallel application categories
   – Performance analysis and optimization of parallel applications
   – Parallel application case studies
               Project and homeworks
• Significant (effort and grade percentage) course project
   – Groups of 5 students
   – Expect publication quality results
• Homeworks/machine problems:
   – weekly (sometimes biweekly)
• Parallel machines:
   – NCSA Origin 2000, Turing Cluster, SUN cluster, SMP machine
   – Possible: Large machines for evaluating scalability:
       • 1000 processor NCSA cluster
       • 3000 processor Lemieux machine at PSC

• Much of the course will be run via the web
   – Lecture slides, assignments, will be available on the course web
       • http://www-courses.cs.uiuc.edu/~cs320
   – Most of the reading material (papers, manuals) will be on the web
   – Projects will coordinate and submit information on the web
       • Web pages for individual pages will be linked to the course web page
   – Newsgroup: uiuc.class.ece392
• You are expected to read the newsgroup and web pages

         Advent of parallel computing

• “Parallel computing is necessary to increase speeds”
   – Cry of the ‘70s
   – Processors kept pace with Moore’s law:
       • Doubling speeds every 18 months
• Now, finally, the time is ripe
   – Uniprocessors are commodities (and proc. speeds shows signs
     of slowing down)
   – Highly economical to build parallel machines

               Why parallel computing
• It is the only way to increase speed beyond uniprocessors
   – Except, of course, waiting for uniprocessors to become faster!
   – Several applications require orders of magnitude higher
     performance than feasible on uniprocessors
• Cost effectiveness:
   – older argument
   – in 1985, a supercomputer cost 2000 times more than a desktop, yet
     performed only 400 times faster.
   – So: combine microcomputers to get speed at lower costs
   – Incremental scalability:
       • can get in-between performance points with 20, 50, 100,… processors
   – But:
       • You may get speedup lower than 400 on 2000 processors!
       • Microcomputers became faster, killing supercomputers, effectively

                               Technology Trends



                           Mainf rames

                 1965   1970        1975         1980   1985         1990        1995

The natural building block for multiprocessors is now also about the fastest!

• Commodity microprocessors not only fast but CHEAP
   • Development cost is tens of millions of dollars (5-100 typical)
   • BUT, many more are sold compared to supercomputers
   – Crucial to take advantage of the investment, and use the
     commodity building block
   – Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g.
  database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs
• Desktop: few smaller processors versus one larger one?
   – Multiprocessor on a chip
                     What to Expect?
• Parallel Machine classes:
   – Cost and usage defines a class! Architecture of a class may change.
   – Desktops, Engineering workstations, database/web servers,
• Commodity (home/office) desktop:
   – less than $10,000
   – possible to provide 10-50 processors for that price!
   – Driver applications:
       • games, video /signal processing,
       • possibly “peripheral” AI: speech recognition, natural language
         understanding (?), smart spaces and agents
       • New applications?

              Engineeering workstations
• Price: less than $100,000 (used to be):
   – new proce level acceptable may be $50,000
   – 100+ processors, large memory,
   – Driver applications:
       •   CAD (Computer aided design) of various sorts
       •   VLSI
       •   Structural and mechanical simulations…
       •   Etc. (many specialized applications)

                 Commercial Servers
• Price range: variable ($10,000 - several hundreds of thousands)
   – defining characteristic: usage
   – Database servers, decision support (MIS), web servers, e-
• High availability, fault tolerance are main criteria
• Trends to watch out for:
   – Likely emergence of specialized architectures/systems
       • E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks
   – TPC: transactions per second
   – But this may change to data mining and application servers, with
     corresponding impact on architecure.

• “Definition”: expensive system?!
   – Used to be defined by architecture (vector processors, ..)
   – More than a million US dollars?
   – Thousands of processors
• Driving applications
   –   Grand challenges in science and engineering:
   –   Global weather modeling and forecast
   –   Rational Drug design / molecular simulations
   –   Processing of genetic (genome) information
   –   Rocket simulation
   –   Airplane design (wings and fluid flow..)
   –   Operations research?? Not recognized yet
   –   Other non-traditional applications?

     Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
   – Market smaller relative to commercial as MPs become mainstream
   – Dominated by vector machines starting in 70s
   – Microprocessors have made huge gains in floating-point
       •   high clock rates
       •   pipelined floating point units (e.g., multiply-add every cycle)
       •   instruction-level parallelism
       •   effective use of caches (e.g., automatic blocking)
   – Plus economics
• Large-scale multiprocessors replace vector supercomputers
   – Well under way already
   – Except with the Earth Simulator: thousands of vector processors

Scientific Computing Demand

      Engineering Computing Demand
• Large parallel machines a mainstay in many industries
   – Petroleum (reservoir analysis)
   – Automotive (crash simulation, drag analysis, combustion
   – Aeronautics (airflow analysis, engine efficiency, structural
     mechanics, electromagnetism),
   – Computer-aided design
   – Pharmaceuticals (molecular modeling)
   – Visualization
       • in all of the above
       • entertainment (films like Toy Story)
       • architecture (walk-throughs and rendering)
   – Financial modeling (yield and derivative analysis)
   – etc.
Applications: Speech and Image Processing

    10 GIP S                                                                      5,000 Words
                                                        1,000 Words               Speech
     1 GIPS                                             Continuous                Recognition
                                                        Speech             HDT V Receiver
                                        Telephone       Recognition
                                        Number                             CIF Video
   100 MIPS                             Recognition       ISDN-CD Stereo
                      200 Words                           Receiver
                      Isolated Speech
                      Recognition                     CELP
   10 MIPS                                            Speech C oding

     1 MIPS           Sub-Band
                      Speech C oding

          1980           1985                             1990                         1995

  • Also CAD, Databases, . . .
  • 100 processors gets you 10 years, 1000 gets you 20 !
   Learning Curve for Parallel Applications

• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
  891 on 128-processor Cray T3D
               Scalability Challenges
• Scalability Challenges
   – Machines are getting bigger and faster
• But
   – Communication Speeds?
   – Memory speeds?

 "Now, here, you see, it takes all the running you can do to keep
 in the same place"
        ---Red Queen to Alice in “Through The Looking Glass”
    –Applications are getting more ambitious and complex
       •Irregular structures and Dynamic behavior
 –Programming models?
           Current Scenario: Machines
• Extremely High Performance machines abound
• Clusters in every lab
   – GigaFLOPS per processor!
   – 100 GFLOPS/S performance possible
• High End machines at centers and labs:
   – Many thousand processors, multi-TF performance
   – Earth Simulator, ASCI White, PSC Lemieux,..
• Future Machines
   – Blue Gene/L : 128k processors!
   – Blue Gene Cyclops Design: 1M processors
       • Multiple Processors per chip
       • Low Memory to Processor Ratio

              Communication Architecture

• On clusters:
   – 100 MB ethernet
        • 100 μs latency
    – Myrinet switches
        • User level memory-mapped communication
        • 5-15 μs latency, 200 MB/S Bandwidth..
        • Relatively expensive, when compared with cheap PCs
   – VIA, Infiniband
• On high end machines:
   – 5-10 μs latency, 300-500 MB/S BW
   – Custom switches (IBM, SGI, ..)
   – Quadrix
• Overall:
   – Communication speeds have increased but not as much as processor speeds
                  Memory and Caches
• Bottom line again:
   – Memories are faster, but not keeping pace with processors
   – Deep memory hierarchies:
       • On Chip and off chip.
   – Must be handled almost explicitly in programs to get good
       • A factor of 10 (or even 50) slowdown is possible with bad cache
       • Increase reuse of data: If the data is in cache, use it for as many
         different things you need to do..
       • Blocking helps

    Application Complexity is increasing
• Why?
   – With more FLOPS, need better algorithms..
       • Not enough to just do more of the same..
   – Better algorithms lead to complex structure
   – Example: Gravitational force calculation
       • Direct all-pairs: O(N2), but easy to parallelize
       • Barnes-Hut: N log(N) but more complex
   – Multiple modules, dual time-stepping
   – Adaptive and dynamic refinements
• Ambitious projects
   – Projects with new objectives lead to dynamic behavior and
     multiple components

Disparity between peak and attained speed
• As a combination of all of these factors:
   – The attained performance of most real applications is substantially
     lower than the peak performance of machines
   – Caution: Expecting to attain peak performance is a pitfall..
       • We don’t use such a metric for our internal combustion engines, for
       • But it gives us a metric to gauge how much improvement is possible


To top