Docstoc

lec1

Document Sample
lec1 Powered By Docstoc
					                                                                                                                                   Course Topics
                                                                                                                                      Introduction, Background
                                                                                                                                             Orders of magnitude, Recurrences
                    CS 575 Parallel Processing
                                                                                                                                         

                                                                                                                                      Models of Parallel Computing, communication
                                                                                                                                      Performance, Speedup, Efficiency
                                                                                                                                      Parallel Algorithms
                                  Lecture one: Introduction                                                                                        Dense Linear Algebra
                                                                                                                                                    Sorting
                                                  Wim Bohm
                                                                                                                                                

                                                                                                                                                   Graphs
                                                                                                                                                    Search
                                  Colorado State University
                                                                                                                                                

                                                                                                                                                   Fast Fourier Transform


Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
                                                                                                                                                                    CS575 lecture 1                    2




                 Course Organization                                                                                               Cost effective Parallel Computing
                   Course reorganization
                                                                                                                                      Off the shelf, commodity processors are very fast
                          Unite 575, 575dl
                                                                                                                                      Memory is very cheap
                          Modernize: more // algorithms, GPUs
                                 We have separate course streams in networking and distributed systems                               Building a processor that is a small factor faster costs
                   Check the web page regularly                                                                                       an order of magnitude more
                   Course organization is described on the web                                                                       Clusters:
                                                                                                                                             Cheapest way to get more performance: multiprocessor
                          www.cs.colostate.edu/~cs575dl                                                                                 

                                                                                                                                             NoW: Networks of workstations
                           let's go look...
                                                                                                                                         
                       
                                                                                                                                            Datacenters employ O(100K) simple processors with cheap
                   Project changes regularly to stay fresh                                                                                  interconnects
                          second half of the course                                                                                        Workstation can be an SMP
                          GPUs/CUDA                                                                                                               Shared memory, Bus or Crossbar (eg. Cray)


                                                            CS575 lecture 1                                                   3                                     CS575 lecture 1                    4




                                                                                                                                                                                                           1
Wile E. Coyote’s Parallel Computer                  Now you gotta program it!
     Get a lot of the fastest processors           Parallel programming introduces:
     Get a lot of memory per processor
     Get the fastest network
     Hook it all together

     And then what ???


                        CS575 lecture 1         5                       CS575 lecture 1         6




Now you gotta program it!                           Now you gotta program it!
 Parallel programming introduces:                   Parallel programming introduces:
          Task partitioning, task scheduling             Task partitioning, task scheduling
                                                          Data partitioning, distribution




                        CS575 lecture 1         7                       CS575 lecture 1         8




                                                                                                    2
Now you gotta program it!                             Now you gotta program it!
Parallel programming introduces:                      Parallel programming introduces:
      Task partitioning, task scheduling                      Task partitioning, task scheduling
      Data partitioning, distribution                         Data partitioning, distribution
      Synchronization                                         Synchronization
                                                               Load balancing




                         CS575 lecture 1    9                                 CS575 lecture 1        10




Now you gotta program it!                             Problem with Wile E. Coyote Architecture
                                                     For high speed, processors have lots of state
Parallel programming introduces:                          Cache, stack, global memory
      Task partitioning, task scheduling
      Data partitioning, distribution
      Synchronization
      Load balancing
      Latency issues
            hiding
            tolerance

                         CS575 lecture 1    11                                CS575 lecture 1        12




                                                                                                          3
     Problem with Wile E. Coyote Architecture                               Problem with Wile E. Coyote Architecture
     For high speed, processors have lots of state                         For high speed, processors have lots of state
          Cache, stack, global memory                                           Cache, stack, global memory
    To tolerate latency, we need fast context switch.                     To tolerate latency, we need fast context switch.
     WHY?                                                                   WHY?
                                                                           No free lunch: can’t have both
                                                                                 Certainly not if the processor was not designed for both




                               CS575 lecture 1                    13                                  CS575 lecture 1                    14




     Problem with Wile E. Coyote Architecture                               Problem with Wile E. Coyote Architecture
     For high speed, processors have lots of state                         For high speed, processors have lots of state
          Cache, stack, global memory                                           Cache, stack, global memory
    To tolerate latency, we need fast context switch.                     To tolerate latency, we need fast context switch.
     WHY?                                                                   WHY?
    No free lunch: can’t have both                                        No free lunch: can’t have both
          Certainly not if the processor was not designed for both              Certainly not if the processor was not designed for both
    Memory wall: memory gets slower and slower                            Memory wall: memory gets slower and slower
          WHY? HOW?                                                             in terms of number of cycles it takes to access
                                                                                 Memory hierarchy gets more complex

                               CS575 lecture 1                    15                                  CS575 lecture 1                    16




                                                                                                                                              4
Sequential Algorithms                                                         Parallel Algorithms
                                                                                  Efficient Parallel Algorithms
    Efficient Sequential Algorithms                                                   Use efficient sequential algorithms
         Minimize time, space                                                         Maximize parallelism
         Maximize state (avoiding re-computation)                                           re-computation is sometimes better than communication

         Efficiency is portable                                                       Minimize overhead
               Efficient program on Pentium ~ Efficient program on Opteron
                                                                                             synchronization, remote accesses
                                                                                       Parallel efficiency is Architecture Dependent




                                   CS575 lecture 1                      17                                  CS575 lecture 1                     18




Speedup                                                                       Super linear speedup
    Ideal: n processors  n fold speed up                                        Super linear speedup: α > 1
         Ideal not always possible. WHY?                                              Discuss... is it possible?
               Tasks are data dependent
               Not all processors are always busy
               Remote data needs communication
                      Memory wall PLUS Communication wall

    Linear speedup: α n speedup (α <= 1)




                                   CS575 lecture 1                      19                                  CS575 lecture 1                     20




                                                                                                                                                      5
      Super linear speedup                                                      Super linear speedup
           Super linear speedup: α > 1                                             Super linear speedup: α > 1
                    Nonsense!                                                           No nonsense!!
                     Because we can execute the faster parallel                           Because parallel computers do not just have more
                     program sequentially                                                 processors, they have more local memory / caches




                                    CS575 lecture 1                       21                                     CS575 lecture 1                          22




      Parallel Programming Paradigms                                            Paradigms cont’
    Implicit parallel programming: Super Compilers
                                                                                    Implicit parallel programming cont’
       Compiler extracts parallelism from sequential code
                                                                                        Simple, clean case: Functional Programming (FP)
               Distributes data, creates and schedules tasks
                                                                                               Functions: no side effects, order of execution less constrained
               Complication: side effects:
                                                                                                  F ( P(x,y), Q(y,z) ) P and Q can be executed in parallel
                 -the sequential order of reads and writes to a memory
                                                                                               Simple single assigment memory model: no pointers, no
                  location determines the program outcome                                       write after read or write after write hazards (dataflow
                 -a parallelizing compiler must obey the sequential order of                    semantics)
                  side effecting statements and still create //ism                             FP was long doomed too high level too inefficient,
                 - pointers, aliases, indirect array reference make analyzing                   because the simple memory model causes lots of copies
                   which statements access which locations hard or impossible                  FP is coming back: MapReduce approach in data
                                                                                                centers (Google) is a data parallel functional paradigm
                 - 40 years of compiler research for general purpose parallel
                    computing has not brought much result.
                                    CS575 lecture 1                       23                                     CS575 lecture 1                          24




                                                                                                                                                                  6
     Explicit parallel programming                                Example 1: Weather Prediction
                                                                      Area, segments
         Explicit parallel programming                                    3000*3000*11 cubic miles
             Multithreading: OpenMP, Pthreads
                                                                           .1*.1*.1 cubic mile: ~ 1011 segments
             Message Passing: MPI
                                                                      Two day prediction
             Data parallel programming (important niche): CUDA
                                                                           half hour time steps: ~ 100 time steps
         Explicit Parallelism complicates programming
             creation, allocation, scheduling of processes
                                                                      Computation per segment
             data partitioning                                             Temp, Pressure, Humidity, Wind speed, Wind
              Synchronization ( semaphores, locks, messages )              direction for each time step in each segment
                                                                            Assume ~ 100 FLOPs per time step per segment

                             CS575 lecture 1                 25                                CS575 lecture 1         26




     Performance: Weather Prediction                              Parallel Weather Prediction
                                                                      1 K workstations, grid connected
                                                                            108 segment computations per processor
  Computational requirement:       FLOPs      1015                     

                                                                           108 instructions per second
   assume one FLOP per clock cycle                                         100 instructions per segment computation
  1 core: 4 GHz
                                                                           100 time steps: 104 seconds = ~3 hours
                                                                                 Much more acceptable
  Total serial time: 25*104 sec ~ 70 hours                                Assumption: Communication not a problem here
                                                                            Why is this assumption reasonable?
  Not too good for 48 hour weather
                                                                      More workstations:
   prediction                                                              finer grid, better accuracy

                             CS575 lecture 1                 27                                CS575 lecture 1         28




                                                                                                                            7
Example 2: N body problem                                                 Other Challenging Applications
    Astronomy: bodies in space                                               Satellite data acquisition: billions of bits / sec
          Attract each other: Gravitational force                                 Pollution levels, Remote sensing of materials
           Newtons law                                                             Image recognition
          O(n2) calculations per “snapshot”                                  Discrete optimization problems
                  Galaxy: ~ 1011 bodies -> ~ 1022 calculations/snapshot
              
                                                                                   Planning, Scheduling, VLSI design
                 Calculation 1 micro sec
                 Snapshot: 1016 secs = ~1011 days = ~ 3*108 years            Bio-informatics, computational chemistry
          Is parallelism going to help us? NO                                Airplane/Satellite/Vehicle design
          What does help? Better algorithm: Barnes Hut                       Internet (Google search)
                 Divides the space in “quad tree”
                 Treats “far’ away quads as one body

                                CS575 lecture 1                     29                                   CS575 lecture 1                      30




Application Specific Architectures                                        ASICS cont’
     ASICs: Application Specific Integrated Circuits
 
                                                                              How much faster than General purpose?
    Levels of ‘specificity’
                                                                                   Example: 1D 1024 FFT
          Full custom ASICs
                                                                                         General purpose machine (G4): 25 micro secs
          Standard cell ASICs
                                                                                         ASIC device (MIT Lincoln Labs): 32 nano secs
          Field programmable gate arrays
                                                                                         ASIC device uses 20 milliwatts (100 * less power)
    Computational models
          Dataflow graphs                                                    Other applications
          Systolic arrays                                                               Finite Impulse Response (FIR) Filters
                                                                                          Matrix multiply
    Promising orders of magnitude better performance,                                

                                                                                          QR decomposition
     lower power                                                                      

                                                                                         What do these all have in common?



                                CS575 lecture 1                     31                                   CS575 lecture 1                      32




                                                                                                                                                   8
Background                                                               Background: Orders of Magnitude
    If you do not have necessary                                           O, Ω, Θ
     background in analysis of algorithms
         See the book                                                           f(x) = O(g(x)) iff ∃ c, n0 : f(x) < c.g(x) ∀ n> n0
               Introduction to Algorithms by                                          used for upper bound of algorithm complexity
                     Cormen, Leiserson, Rivest and Stein
               Or go online                                                     f(x) = Ω(g(x)) iff ∃ c, n0 : f(x) > c.g(x) ∀ n> n0
         Topics to study                                                              used for lower bound of problem complexity
               Introduction
               Growth of functions                                              f(x) = Θ(g(x)) iff f(x)=O(g(x)) and f(x)=Ω(g(x))
                                                                                       “Tight” bound
               Summations
               Recurrences

                                 CS575 lecture 1            33                                          CS575 lecture 1                34




                                                                    Divide and conquer recurrence
Background: Closed problems                                             An = C An/d + f(n)
                                                                 Cormen, Leiserson et.al. master method is complex.
    Closed problem P:                                           An easier version from Rosen: An = C An/d+knp
          ∃ algorithm X with O(X) = Ω(P)
           eg. Sort has tight bound: Θ(nlog(n))
                                                                       An = O(np) if C < dp       eg. An = 3 An/2+n2
    Problem P has algorithmic gap:                                    An = O(nplog(n)) if C = dp eg. An = 2An/2+n
          P is not closed, eg., all NP Complete problems
      
                                                                       An = O(nlogdc) if C > dp   eg. An = 3 An/2+k
          (problems with polynomial lower bound but
           currently exponential upper bound, such as TSP)
                                                                 Discuss binary search and merge sort
                                 CS575 lecture 1            35                                          CS575 lecture 1                36




                                                                                                                                            9
Recurrence Relations

    Algorithmic complexity often described using
     recurrence relations: f(n) = R( f(1) .. f(n-1) )
    Two important types of recurrence relations
         Linear
         Divide and Conquer
    cs420(dl) covers these



                        CS575 lecture 1            37




                                                        10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:10/16/2012
language:
pages:10