Computer Architecture CIS775 Computer Architecture Chapter 1 Fundamentals by abdmh83

VIEWS: 164 PAGES: 43

Computer Architecture

More Info
									CIS775: Computer Architecture

     Chapter 1: Fundamentals of
         Computer Design

           Course Objectives
• To evaluate the issues involved in choosing and
  designing instruction set.
• To learn concepts behind advanced pipelining
• To understand the “hitting the memory wall”
  problem and the current state-of-art in memory
  system design.
• To understand the qualitative and quantitative
  tradeoffs in the design of modern computer

  What is Computer Architecture?
  • Functional operation of the individual HW
    units within a computer system, and the
    flow of information and control among
                          Parallelism   Programming
             Technology                 Language

                      Computer          Interface Design
Hardware Organization Architecture:           (ISA)

                    Measurement &
     Computer Architecture Topics
    Input/Output and Storage
                            Disks, WORM, Tape                      RAID

                                                    Emerging Technologies
                             DRAM                   Interleaving Memories

Memory                    L2 Cache              Bandwidth,
Hierarchy                                       Latency

                        L1 Cache            Addressing,
      VLSI                                  Protection,
             Instruction Set Architecture   Exception Handling

             Pipelining, Hazard Resolution,     Pipelining and Instruction
             Superscalar, Reordering,           Level Parallelism
             Prediction, Speculation,
             Vector, DSP                                                  4
  Computer Architecture Topics
                                        Shared Memory,
   P M    P M              P M    P M
                                        Message Passing,
                   °°°                  Data Parallelism

     S       Interconnection Network    Network Interfaces

         Processor-Memory-Switch        Topologies,
Multiprocessors                         Bandwidth,
Networks and Interconnections           Latency,

       Measurement and Evaluation
                    Architecture is an iterative process:
                    • Searching the space of possible designs
                    • At all levels of computer systems

                        Cost /

                                             Good Ideas
                                      Mediocre Ideas
                    Bad Ideas                             6
 Issues for a Computer Designer
• Functional Requirements Analysis (Target)
   – Scientific Computing – HiPerf floating pt.
   – Business – transactional support/decimal arith.
   – General Purpose –balanced performance for a range of
• Level of software compatibility
   – PL level
      • Flexible, Need new compiler, portability an issue
   – Binary level (x86 architecture)
      • Little flexibility, Portability requirements minimal
• OS requirements
   – Address space issues, memory management, protection
• Conformance to Standards                                     7
   – Languages, OS, Networks, I/O, IEEE floating pt.
       Computer Systems: Technology
• 1988                                • 2002
  –   Supercomputers                    – Powerful PC’s and
  –   Massively Parallel Processors       SMP Workstations
  –   Mini-supercomputers               – Network of SMP
  –   Minicomputers                       Workstations
  –   Workstations                      – Mainframes
  –   PC’s                              – Supercomputers
                                        – Embedded Computers

 Why Such Change in 10 years?
• Performance
   – Technology Advances
      • CMOS (complementary metal oxide semiconductor) VLSI
        dominates older technologies like TTL (transistor transistor
        logic) in cost AND performance
   – Computer architecture advances improves low-end
      • RISC, pipelining, superscalar, RAID, …
• Price: Lower costs due to …
   – Simpler development
      • CMOS VLSI: smaller systems, fewer components
   – Higher volumes
   – Lower margins by class of computer, due to fewer
• Function :Rise of networking/local
  interconnection technology                                           9
Growth in Microprocessor

Six Generations of DRAMs

       Updated Technology Trends
            Capacity              Speed (latency)
Logic       4x in 4 years         2x in 3 years
DRAM        4x in 3 years         2x in 10 years
Disk        4x in 2 years         2x in 10 years
Network (bandwidth) 10x in 5 years

       • Updates during your study period??
          BS (4 yrs)
          MS (2 yrs)
          PhD (5 yrs)
Cost of Microprocessors

          Integrated Circuits Costs
IC cost = Die cost + Testing cost + Packaging cost
                       Final test yield
Die cost =           Wafer cost
              Dies per Wafer * Die yield

Dies per wafer = š * ( Wafer_diam / 2) – š * Wafer_diam – Test dies

                        Die Area             ¦ 2 * Die Area

                                Defects_per_unit_area * Die_Area
Die Yield = Wafer yield * 1 +
Die Cost goes roughly with die area4                            14
                                                              DAP.S98 1
        Performance Trends
• Workstation performance (measured in Spec
  Marks) improves roughly 50% per year
  (2X every 18 months)

• Improvement in cost performance estimated
  at 70% per year

        Computer Engineering
                 Evaluate Existing
                   Systems for
     Implement Next        Simulate New
    Generation System      Designs and
 How to Quantify Performance?
   Plane      DC to Paris    Speed     Passengers

 Boeing 747    6.5 hours    610 mph           470    286,700

                3 hours     1350 mph          132    178,200

• Time to run the task (ExTime)
   – Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
   – Throughput, bandwidth
      The Bottom Line:
 Performance and Cost or Cost
      and Performance?
"X is n times faster than Y" means

 ExTime(Y)          Performance(X)
 ---------     =   ---------------
 ExTime(X)          Performance(Y)

• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
• Cost is also an important parameter in the
  equation which is why concordes are being put
  to pasture!                                   18
          Measurement Tools
• Benchmarks, Traces, Mixes
• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)
    – ISA, RT, Gate, Circuit
•   Queuing Theory
•   Rules of Thumb
•   Fundamental “Laws”/Principles
•   Understanding the limitations of any
    measurement tool is crucial.

       Metrics of Performance
      Application             Answers per month
                              Operations per second
                    (millions) of Instructions per second: MIPS
          ISA       (millions) of (FP) operations per second: MFLOP/s
            Control           Megabytes per second
     Function Units
Transistors Wires Pins        Cycles per second (clock rate)

Cases of Benchmark Engineering
• The motivation is to tune the system to the benchmark
  to achieve peak performance.
• At the architecture level
   – Specialized instructions
• At the compiler level (compiler flags)
   – Blocking in Spec89  factor of 9 speedup
   – Incorrect compiler optimizations/reordering.
   – Would work fine on benchmark but not on other programs
• I/O level
   – Spec92 spreadsheet program (sp)
   – Companies noticed that the produced output was always out
     put to a file (so they stored the results in a memory buffer) and
     then expunged at the end (which was not measured). 21
   – One company eliminated the I/O all together.
After putting in a blazing performance on the benchmark test,
Sun issued a glowing press release claiming that it had
outperformed Windows NT systems on the test.
Pendragon president Ivan Phillips cried foul, saying the results
weren't representative of real-world Java performance and that
Sun had gone so far as to duplicate the test's code within Sun's
Just-In-Time compiler. That's cheating, says Phillips, who claims
that benchmark tests and real-world applications aren't
the same thing.

Did Sun issue a denial or a mea culpa? Initially, Sun neither
denied optimizing for the benchmark test nor apologized for
it. "If the test results are not representative of real-world Java
applications, then that's a problem with the benchmark,"
Sun's Brian Croll said.

After taking a beating in the press, though, Sun retreated and
issued an apology for the optimization.[Excerpted from PC Online22
      Issues with Benchmark
• Motivated by the bottom dollar, good
  performance on classic suites  more
  customers, better sales.
• Benchmark Engineering  Limits the
  longevity of benchmark suites
• Technology and Applications Limits the
  longevity of benchmark suites.

      SPEC: System Performance
        Evaluation Cooperative
• First Round 1989
  – 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
  – SPECInt92 (6 integer programs) and SPECfp92 (14
    floating point programs)
     • Compiler Flags unlimited. March 93
     • new set of programs: SPECint95 (8 integer programs) and
       SPECfp95 (10 floating point)
  – “benchmarks useful for 3 years”
  – Single flag setting for all programs: SPECint_base95,
  – SPEC CPU2000 (11 integer benchmarks – CINT2000,
    and 14 floating-point benchmarks – CFP2000
SPEC 2000 (CINT 2000)Results

SPEC 2000 (CFP 2000)Results

 Reporting Performance Results
• Reproducability
•  Apply them on publicly available
  benchmarks. Pecking/Picking order
  –   Real Programs
  –   Real Kernels
  –   Toy Benchmarks
  –   Synthetic Benchmarks

 How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
  tracks execution time: sum(Ti)/n or sum(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
  rates (e.g., MFLOPS) tracks execution time:
  n/sum(1/Ri) or 1/sum(Wi/Ri)
• Normalized execution time is handy for scaling
  performance (e.g., X times faster than
  SPARCstation 10)
• But do not take the arithmetic mean of
  normalized execution time,
  use the geometric mean = (Product(Ri)^1/n)
        Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
   – Good benchmarks
   – Good ways to summarize performance
• Given sales is a function in part of performance
  relative to competition, investment in improving
  product as reported by performance summary
• If benchmarks/summary inadequate, then choose
  between improving product for real programs vs.
  improving product to get more sales;
  Sales almost always wins!
• Execution time is the measure of computer
  performance!                                       29
• When are simulations useful?

• What are its limitations, I.e. what real world
  phenomenon does it not account for?

• The larger the simulation trace, the less
  tractable the post-processing analysis.
           Queueing Theory
• What are the distributions of arrival rates
  and values for other parameters?

• Are they realistic?

• What happens when the parameters or
  distributions are changed?
Quantitative Principles of Computer
• Make the Common Case Fast
  – Amdahl’s Law
• CPU Performance Equation
  – Clock cycle time
  – CPI
  – Instruction Count
• Principles of Locality
• Take advantage of Parallelism
                Amdahl's Law
Speedup due to enhancement E:
              ExTime w/o E        Performance w/ E
Speedup(E) = -------------   =   -----------------
              ExTime w/ E         Performance w/o

Suppose that enhancement E accelerates a fraction F
 of the task by a factor S, and the remainder of the
 task is unaffected

                     Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =               =
                   ExTimenew       (1 - Fractionenhanced) + Fractionenhanced

              Amdahl’s Law
  • Floating point instructions improved to run 2X;
    but only 10% of actual instructions are FP

ExTimenew =

  Speedupoverall =

     CPU Performance Equation
CPU time   = Seconds     = Instructions x     Cycles    x Seconds
               Program        Program         Instruction   Cycle

                     Inst Count CPI               Clock Rate
  Program                 X

  Compiler                X             (X)

  Inst. Set.              X             X

  Organization                          X               X

  Technology                                            X
           Cycles Per Instruction
“Average Cycles per Instruction”
       CPI = (CPU Time * Clock Rate) / Instruction Count
           = Cycles / Instruction Count
       CPU time = CycleTime *   •CPI        * I
                                             i    i
                                i =1

 “Instruction Frequency”
        CPI =    •CPI   *i F
                                        where F       =
                                                      i     I    i
                i =1
                                                          Instruction Count

 Invest Resources where time is Spent!
    Example: Calculating CPI
Base Machine (Reg / Reg)
Op            Freq Cycles CPI(i)   (% Time)
ALU           50%    1    .5       (33%)
Load          20%    2    .4       (27%)
Store         10%    2    .2       (13%)
Branch        20%    2    .4       (27%)
        Typical Mix

            Chapter Summary, #1
• Designing to Last through Trends
               Capacity         Speed
    Logic   2x in 3 years     2x in 3 years
    DRAM    4x in 3 years     2x in 10 years
    Disk    4x in 3 years     2x in 10 years
•   6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
    – Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
    – Throughput, bandwidth
• “X is n times faster than Y” means
       ExTime(Y)              Performance(X)
       ---------      =       --------------
       ExTime(X)              Performance(Y)
             Chapter Summary, #2

 • Amdahl’s Law:
Speedupoverall =                =
                    ExTimenew       (1 - Fractionenhanced) + Fractionenhanced
 • CPI Law:
  CPU time    = Seconds      = Instructions x    Cycles    x Seconds
                   Program       Program        Instruction    Cycle

 • Execution time is the REAL measure of computer
 • Good products created when have:
     – Good benchmarks, good ways to summarize
       performance                                                     40
 • Die Cost goes roughly with die               area4
             Food for thought
• Two companies reports results on two benchmarks
  one on a Fortran benchmark suite and the other on
  a C++ benchmark suite.
• Company A’s product outperforms Company B’s
  on the Fortran suite, the reverse holds true for the
  C++ suite. Assume the performance differences
  are similar in both cases.
• Do you have enough information to compare the
  two products. What information will you need?
           Food for Thought II
• In the CISC vs. RISC debate a key argument of the
  RISC movement was that because of its simplicity,
  RISC would always remain ahead.
• If there were enough transistors to implement a CISC
  on chip, then those same transistors could implement
  a pipelined RISC
• If there was enough to allow for a pipelined CISC
  there would be enough to have an on-chip cache for
  RISC. And so on.
• After 20 years of this debate what do you think?
• Hint: Think of commercial PC’s, Moore’s law and
  some of the data in the first chapter of the book (and
  on these slides)
       Amdahl’s Law (answer)
  • Floating point instructions improved to run 2X;
    but only 10% of actual instructions are FP

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

  Speedupoverall =                     =   1.053


To top