Course objectives and outline
• You will learn about:
– Parallel architectures overview
• Message passing support, routing, interconnection networks
• Cache-coherent scalable shared memory, synchronization
– Relaxed consistency models (?)
– Novel architectures: Tera, Blue Gene, Processors-in-memory
– Parallel programming models
• Emphasis on 3: message passing, shared memory, and shared objects
• Ongoing evaluation and comparison of models
– Commonly needed parallel algorithms/operations
• Analysis techniques
– Parallel application categories
– Performance analysis and optimization of parallel applications
– Parallel application case studies
Project and homeworks
• Significant (effort and grade percentage) course project
– Groups of 5 students
– Expect publication quality results
• Homeworks/machine problems:
– weekly (sometimes biweekly)
• Parallel machines:
– NCSA Origin 2000, Turing Cluster, SUN cluster, SMP machine
– Possible: Large machines for evaluating scalability:
• 1000 processor NCSA cluster
• 3000 processor Lemieux machine at PSC
• Much of the course will be run via the web
– Lecture slides, assignments, will be available on the course web
– Most of the reading material (papers, manuals) will be on the web
– Projects will coordinate and submit information on the web
• Web pages for individual pages will be linked to the course web page
– Newsgroup: uiuc.class.ece392
• You are expected to read the newsgroup and web pages
Advent of parallel computing
• “Parallel computing is necessary to increase speeds”
– Cry of the ‘70s
– Processors kept pace with Moore’s law:
• Doubling speeds every 18 months
• Now, finally, the time is ripe
– Uniprocessors are commodities (and proc. speeds shows signs
of slowing down)
– Highly economical to build parallel machines
Why parallel computing
• It is the only way to increase speed beyond uniprocessors
– Except, of course, waiting for uniprocessors to become faster!
– Several applications require orders of magnitude higher
performance than feasible on uniprocessors
• Cost effectiveness:
– older argument
– in 1985, a supercomputer cost 2000 times more than a desktop, yet
performed only 400 times faster.
– So: combine microcomputers to get speed at lower costs
– Incremental scalability:
• can get in-between performance points with 20, 50, 100,… processors
• You may get speedup lower than 400 on 2000 processors!
• Microcomputers became faster, killing supercomputers, effectively
1965 1970 1975 1980 1985 1990 1995
The natural building block for multiprocessors is now also about the fastest!
• Commodity microprocessors not only fast but CHEAP
• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the
commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs
• Desktop: few smaller processors versus one larger one?
– Multiprocessor on a chip
What to Expect?
• Parallel Machine classes:
– Cost and usage defines a class! Architecture of a class may change.
– Desktops, Engineering workstations, database/web servers,
• Commodity (home/office) desktop:
– less than $10,000
– possible to provide 10-50 processors for that price!
– Driver applications:
• games, video /signal processing,
• possibly “peripheral” AI: speech recognition, natural language
understanding (?), smart spaces and agents
• New applications?
• Price: less than $100,000 (used to be):
– new proce level acceptable may be $50,000
– 100+ processors, large memory,
– Driver applications:
• CAD (Computer aided design) of various sorts
• Structural and mechanical simulations…
• Etc. (many specialized applications)
• Price range: variable ($10,000 - several hundreds of thousands)
– defining characteristic: usage
– Database servers, decision support (MIS), web servers, e-
• High availability, fault tolerance are main criteria
• Trends to watch out for:
– Likely emergence of specialized architectures/systems
• E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks
– TPC: transactions per second
– But this may change to data mining and application servers, with
corresponding impact on architecure.
• “Definition”: expensive system?!
– Used to be defined by architecture (vector processors, ..)
– More than a million US dollars?
– Thousands of processors
• Driving applications
– Grand challenges in science and engineering:
– Global weather modeling and forecast
– Rational Drug design / molecular simulations
– Processing of genetic (genome) information
– Rocket simulation
– Airplane design (wings and fluid flow..)
– Operations research?? Not recognized yet
– Other non-traditional applications?
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point
• high clock rates
• pipelined floating point units (e.g., multiply-add every cycle)
• instruction-level parallelism
• effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers
– Well under way already
– Except with the Earth Simulator: thousands of vector processors
Scientific Computing Demand
Engineering Computing Demand
• Large parallel machines a mainstay in many industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
– Aeronautics (airflow analysis, engine efficiency, structural
– Computer-aided design
– Pharmaceuticals (molecular modeling)
• in all of the above
• entertainment (films like Toy Story)
• architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
Applications: Speech and Image Processing
10 GIP S 5,000 Words
1,000 Words Speech
1 GIPS Continuous Recognition
Speech HDT V Receiver
Number CIF Video
100 MIPS Recognition ISDN-CD Stereo
200 Words Receiver
10 MIPS Speech C oding
1 MIPS Sub-Band
Speech C oding
1980 1985 1990 1995
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
• Scalability Challenges
– Machines are getting bigger and faster
– Communication Speeds?
– Memory speeds?
"Now, here, you see, it takes all the running you can do to keep
in the same place"
---Red Queen to Alice in “Through The Looking Glass”
–Applications are getting more ambitious and complex
•Irregular structures and Dynamic behavior
Current Scenario: Machines
• Extremely High Performance machines abound
• Clusters in every lab
– GigaFLOPS per processor!
– 100 GFLOPS/S performance possible
• High End machines at centers and labs:
– Many thousand processors, multi-TF performance
– Earth Simulator, ASCI White, PSC Lemieux,..
• Future Machines
– Blue Gene/L : 128k processors!
– Blue Gene Cyclops Design: 1M processors
• Multiple Processors per chip
• Low Memory to Processor Ratio
• On clusters:
– 100 MB ethernet
• 100 μs latency
– Myrinet switches
• User level memory-mapped communication
• 5-15 μs latency, 200 MB/S Bandwidth..
• Relatively expensive, when compared with cheap PCs
– VIA, Infiniband
• On high end machines:
– 5-10 μs latency, 300-500 MB/S BW
– Custom switches (IBM, SGI, ..)
– Communication speeds have increased but not as much as processor speeds
Memory and Caches
• Bottom line again:
– Memories are faster, but not keeping pace with processors
– Deep memory hierarchies:
• On Chip and off chip.
– Must be handled almost explicitly in programs to get good
• A factor of 10 (or even 50) slowdown is possible with bad cache
• Increase reuse of data: If the data is in cache, use it for as many
different things you need to do..
• Blocking helps
Application Complexity is increasing
– With more FLOPS, need better algorithms..
• Not enough to just do more of the same..
– Better algorithms lead to complex structure
– Example: Gravitational force calculation
• Direct all-pairs: O(N2), but easy to parallelize
• Barnes-Hut: N log(N) but more complex
– Multiple modules, dual time-stepping
– Adaptive and dynamic refinements
• Ambitious projects
– Projects with new objectives lead to dynamic behavior and
Disparity between peak and attained speed
• As a combination of all of these factors:
– The attained performance of most real applications is substantially
lower than the peak performance of machines
– Caution: Expecting to attain peak performance is a pitfall..
• We don’t use such a metric for our internal combustion engines, for
• But it gives us a metric to gauge how much improvement is possible