Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Advanced Computer Architecture - PowerPoint by Levone

VIEWS: 50 PAGES: 36

									Advanced Computer Architecture
• We will consider the issues involved in
  current Architecture design and
  implementation:
  – RISC instruction sets
  – Pipelining
  – Reducing Cache Misses and Optimizing Virtual
    Memory Usage
  – Designing an I/O System
  – Interconnection Networks and Multiprocessing
       A Different Perspective
• Rather than focusing on the roles of the
  architectural components, we will
  – Use Quantitative measures to test ideas
  – Use a RISC instruction set for examples
  – Discuss a variety of software and hardware
    techniques to provide optimization
  – Attempt to force as much parallelism out of the
    code as possible
         How did we get here?
– Growth of microprocessor capabilities and reduction
  in prices in the past few years (see figure 1.1 p. 2)
– Dominance of micro’s (workstations, PCs) in society
– Freedom from compatibility with older architectures
  and designs
– Renaissance in computer design emphasizing new
  architectural innovations
– Efficient use of improvements in hardware and
  compiler technologies
– Sustaining these improvements will require continued
  innovations in design! Our task!
          Tasks of an Architect
• Determine what attributes (HSA, ISA) are
  important for a new machine
• Design the machine to maximize its performance
  while staying within cost constraints
  – ISA design, functional organization, logic design,
    implementation of circuits, packaging, power supply
    & cooling
• We will concentrate on the first 3
• See figure 1.2, page 5
    Trends in Computer Usage
• Replacing assembly language with high-
  level languages - easier software
  compatibility, less restrictive on hardware
• Memory needs of software grows by a
  factor of 1.5 to 2 every year
• Compiler technology continues to improve
  (optimization)
• Improved ISA’s moving towards RISC
     Trends in Hardware Technology
•   Transistor density increases 50%/year
•   Die sizes increase 10-25%/year
•   Combined, transistor count increases 60-80%/year
•   DRAM density increases 60%/year
•   DRAM cycle time decreases 1/3 in 10 years
•   Disk density increases 50%/year
•   Disk access time decreases 1/3 in 10 years
                    Cost
• Factors:
  – learning curve (manufacturing cost which
    decreases over time)
  – Yield (how quickly can components be mass
    produced)
  – Volume (how much is mass produced)
  – Commodization (how widely distributed is the
    produce)
  – Packaging
• See figure 1.3, page 9
   Formulas for computing cost
– Cost(IC) = (Cost(die) + Cost(testing) + Cost(packaging
  & testing)) / final test yield
– Cost(die) = Cost(wafer)/(Dies per wafer * Die yield)
– Dies per wafer = pi * (Diameter(wafer)/2)^2 / Die area
  - pi * Diameter(wafer) / (square root (2 * Die area))
– Example: # of dies per 20 cm wafer for a die that is
  1.5 cm on a side = pi * (20/2)^2/2.25 -
  pi*20/sqrt(2*2.25) = 314/2.25 - 62.8/2.12 = 110
           Another Example
– yield = wafer yield * (1 + dpua * die area / a) ^ -a
– dpua is defects per unit area, a is a parameter
  roughly corresponding to the number of masking
  levels (a measure of manufacturing complexity)
– Find the die yield for dies that are 1 cm on a side
  assuming a defect of .8 per cm2. What about 1.5 cm
  on a side?
– Total die areas are 1 cm2 and 2.25 cm2 for the
  example.
– 1 cm per side: die yield = (1 + .8*1/3)-3 = .49
– 1.5 cm per side: die yield = (1+.8*2.25/3)-3 = .24
        Reporting Performance
• What does it mean that one computer is
  faster than another?
• We might use terms such as:
  –   execution time (also called response time)
  –   throughput
  –   wall-clock time (or elapsed time)
  –   CPU time, user CPU time, system CPU time
  –   System performance
  –   CPU performance
       Performance Measures
• To say X is n times faster than Y means that
  – Execution time Y / Execution time X = n
  – Performance X / Performance Y = n
• The throughput of X is 1.3 times higher
  than Y means that the number of tasks that
  can be executed on X is 1.3 more than on Y
  in the same amount of time
          Comparing Results
• MIPS, MegaFLOPS, Mhz, Throughput,
  Response time, etc… are all misleading
  statements
• Computer X might execute 300 Mhz and Y
  might execute 100 Mhz and Y might have a
  larger throughput than X
• It is important to compare computers that
  are performing the same (or equivalent)
  tasks - this is the only way to get accurate
  comparisons
                  Benchmarks
• There are four levels or programs that can be
  used to test performance
  – Real programs - e.g., C compiler, TeX, CAD tool,
    programs that have input and output and options that
    the user can select
  – Kernels - remove key pieces of programs and just
    test those
  – Toy benchmarks - 10-100 lines of code such as
    quicksort whose performance is known in advance
  – Synthetic benchmarks - try to match average
    frequency of operations to simulate those
    instructions in some large program
           Benchmark Suite
• A set of programs that test different
  performance metrics such as arrays, floating
  point operations, loops, etc…
• SPEC92 is a commonly quoted benchmark
  suite
• One problem that has arisen is that some
  architectures are now optimized to perform
  well on SPEC92 even though the computers
  produced may not be as good as others!
             SPEC92 programs
• Consists of source programs in C and FORTRAN
• Programs differ from 272 lines of code to 83589
• Real-world applications such as circuit simulator,
  Monte Carlo simulation of nuclear reactor,
  chemical application that solves equations for a
  model of 500 atoms, matrix multiplication and
  FFT, neural net training simulator, lisp
  interpreter, spreadsheet computations, etc…
• See figure 1.9, page 22
 Reporting Performance Results
• One important factor is that performance
  results be reproducible -
  – However, reported results may omit such
    information as the input, compiler settings,
    version of compiler, version of OS, size and
    number of disks, etc…
  – SPEC benchmark reports must include
    information like compiler flags, fairly complete
    description of the machine, and results running
    with normal and optimized compilers
       Comparing Performances
• Consider Figure 1.11, page 24, we can say:
  –   A is 10 times faster than B for P1
  –   B is 10 times faster than A for P2
  –   A is 20 times faster than C for P1
  –   C is 50 times faster than A for P2
  –   B is 2 times faster than C for P1
  –   C is 5 times faster than B for P2
• If we take one of these by itself, it does not
  give a real picture of the power of any
  computer -- but advertisers might use one of
  these anyways!
        A Consistent Measure
• We can solve the previous problem by
  computing total execution time for the 2
  programs and say
  – B is 9.1 times faster than A for P1 & P2
  – C is 25 times faster than A for P1 & P2
  – C is 2.75 times faster than A for P1 and P2
• We can also use arithmetic mean, harmonic
  mean, weighted mean and geometric mean
  to provide a better picture. See figures 1.12
  and 1.13 on pages 26-27 for example
             Amdahl’s Law
• A fundamental law in describing performance
  gain created through some architectural
  improvement as speedup
• Speedup = performance of task in enh mode /
  performance without enh mode or
• Speedup = execution time without enh mode /
  execution time using enh mode when possible
        Using Amdahl’s Law
• We must consider two factors in using this:
  – Fraction of the computation time in the original
    machine that can be converted to take
    advantage of the enhancement
  – Improvement gained by the enhanced execution
    mode how much faster will the task run if the
    enhanced mode is used for the entire program?)
• Speedup = 1 / [ (1 - fraction enhanced) +
  Fraction enhanced / Speedup enhanced) ]
                     Examples
• An enhancement runs 10 times faster but is only
  usable 40% of the time.
  – Speedup = 1 / [(1 - .4) * .4/10] = 1.56
• Suppose FP sqrt is responsible for 20% of
  instructions in a benchmark. We could add FP
  sqrt hardware that will speed up the performance
  by a factor of 10, or we could try to enhance all
  FP operations by a factor of 2 (1/2 of all
  instructions in the benchmark are FP operations)
  – Speedup FP sqrt = 1/[(1-.2) * .2/10] = 1.22
  – Speedup all FP = 1/[(1-.5)*.5/2] = 1.33
          CPU Performance
– CPU time = CPU clock cycles * clock cycle time
– CPU time = CPU clock cycles for prog / Clock rate
– IC - instruction count (# of instructions in the
  program), CPI - clock cycles per instruction
– CPI = CPU clock cycles for prog / IC
– CPU time = IC * CPI * Clock cycle time
– CPU time = (Sum CPIi * ICi) * clock cycle time
– CPI = Sum (CPIi * ICi/ Instruction Count)
                       Example
  –   Frequency of FP operations = 25%
  –   Average CPI of FP operations = 4.0
  –   Average CPI of other instructions = 1.33
  –   Frequency of FP sqrt = 2%, CPI of FP sqrt = 20
  –   CPI = 4*25%+1.33*75% = 2.0
• Two alternatives: reduce CPI of FP sqrt to 2 or
  reduce CPI of all FP ops to 2
  – CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64
  – CPI new FP = 75%*1.33+25%*2.0=1.5
  – Speedup new FP = CPI original/CPI new FP = 1.33
    (refer back to previous example)
 CPU Components’ Performance
• Large part of a Comp. Architect’s job is to
  design tools or means of measuring the
  CPU component performances
  – Low level tools: timing estimators
• We can also measure the instruction count
  for a program using compiler technology,
  using the program execution duration and
  the instruction mix
• Execution-based monitoring by including in
  the program, code that saves the instruction
  mix during execution
            Measuring CPI
• Requires knowing the processor’s
  organization and the instruction stream
• Designers may use Average CPIs but this is
  influenced by cache and pipeline structures
• We might assume a perfect memory system
  that does not cause delays
• Pipeline CPI measures can be determined
  by simulating the pipeline (which might be
  sufficient for simple pipes but not for
  advanced pipes)
    More Ex’s of CPU Performance
• 2 alternatives for a conditional branch instruction
  – CPU A: condition code is set by a compare instruction
    and followed by a branch that tests the condition code
  – CPU B: compare is included in the branch
• Conditional branch takes 2 cycles, all other
  instructions take 1 clock cycle. For CPU A, 20%
  of all instructions are conditional branches.
• Assume CPU A has a clock cycle time 1.25 times
  faster than CPU B (since CPUA does not have the
  compare included in the branch statement)
• Which CPU is faster?
                     Solution
• CPI A = .2 * 2 + .8 * 1 = 1.2
• CPU time A = IC A * 1.2 * Clock Cycle time A
• A’s clock rate is 1.25 times higher than B.
  Compares are not executed in isolation on B, so
  there are instead 25% compares and 75% other
• CPI B = .25 * 2 + .75 * 1 = 1.25
• CPU time B = IC B * 1.25 * Clock Cycle time B
      = .8 * IC A * 1.25 * 1.25 * Clock Cycle time A
• CPU time B = 1.25 * IC A * Clock Cycle time A
• So, CPU time A is shorter than B and so A is faster
            Memory Hierarchy
•   Register (CPU)
•   Cache
•   Main Memory
•   I/O Devices
    – Hard Disk
    – Optical disk, floppy disk
    – Magnetic tape
• See figures 1.15 and 1.16 on p. 40-41
         Cache Performance
• Assume: Cache is 10 times faster than
  memory and cache hit rate is 90%. How
  much speedup is gained using this cache?
• Use Amdahl’s law:
• Speedup = 1 / [(1-90%) + (90%/10)] =
           1/[.1+.9/10] = 5.3
• Over a 5 times speedup by using cache with
  these specifications!
        Memory Impact on CPU
• In a pipeline, a memory stall will occur if the
  memory fetch of an operand is not found in cache
  – CPU execution time = (CPU clock cycles + Memory
    stall cycles) * Clock cycle
  – Memory stall cycles = number of misses * miss penalty
     = IC * misses per instruction * miss penalty
     = IC * mem ref’s/instr * miss rate * miss penalty
  – Miss rate is determined by cache efficiency
  – Miss penalty is determined by main memory system
    speed (also bus load and bandwidth, etc…)
                  Example

• Assume a machine with
  – CPI = 2.0 when all memory accesses are hits
  – Only data accesses are loads and stores (40% of
    all instructions are loads and stores)
  – Miss penalty = 25 clock cycles
  – Miss rate = 2%
  – How much faster would the machine be if all
    accesses are hits?
                     Solution
• For machine with no misses:
  – CPU exec. Time = (CPU clock cycles + memory stall
    cycles) * clock cycle = (IC * CPI + 0) * clock cycle
• For machine with 2% miss rate:
  – Memory stall cycles = IC * memory references/instr
    * miss rate * miss penalty = IC * (1 + .4) * .02 * 25 =
    IC * .7
  – CPU exec Time = (IC * 2.0 + IC * .7) * clock cycle =
    2.7 * IC * clock cycle
• So, the machine with no misses is 2.7/2.0 times
  faster or 1.35 times faster
                  Fallacies
• MIPS is an accurate measure for comparing
  performance among computers
• MFLOPS is a consistent and useful measure of
  performance
• Synthetic benchmarks predict performance for
  real programs
• Benchmarks remain valid indefinitely
• Peak performance tracks observed performance
        What is wrong with MIPS?
• MIPS is dependent on the instruction set making it
  difficult to compare with computers using different
  ISAs
• MIPS varies between programs on the same
  computer!
• MIPS can vary inversely to performance
  – Example: floating point operations which might be
    implemented in floating point hardware (and thus not
    counted in MIPS) or as simple integer instructions,
    providing a higher MIPS rating though a slower outcome
  Example: optimized compiler
• Optimized compiler for load-store machine
  with specs as shown in figure 1.17, p. 45
• Compiler discards 1/2 of the ALU
  instructions
• Assume a 2 nsec clock cycle (and no system
  issues), 1.57 unoptimized CPI, what is the
  MIPS rating for optimized vs. unoptimized
  code?
                   Solution
• CPI unopt = 1.57
• MIPS unopt = 500 Mhz/1.57 * 106 = 318.5
• CPU time unopt = IC unopt * 1.57 * 2*10-9
      = 3.14 * 10-9 * IC unopt
• CPI opt = [(.43/2)*1 + .21*2 + .12*2 + .24*2] /
      [1 - (.43/2)] = 1.73
• MIPS opt = 500 Mhz/1.73*106 = 289.0
• CPU time opt = (.785*IC unopt)*1.73* 10-9 =
       2.72 * 10-9 * IC unopt
• Optimized code is 3.14/2.27 = 1.15 times faster
  but MIPS rating is lower!

								
To top