computer architecture

Shared by: gegeshandong
Categories
Tags
-
Stats
views:
3
posted:
12/29/2011
language:
pages:
34
Document Sample
scope of work template
							                                         (6.1)
  Central Processing Unit Architecture
 Architecture overview
 Machine organization
  – von Neumann
 Speeding up CPU operations
  – multiple registers
  – pipelining
  – superscalar and VLIW
 CISC vs. RISC
                                      (6.2)
  Computer Architecture
 Major components of a computer
  – Central Processing Unit (CPU)
  – memory
  – peripheral devices
 Architecture is concerned with
  – internal structures of each
  – interconnections
     » speed and width
  – relative speeds of components
 Want maximum execution speed
  – Balance is often critical issue
                                                 (6.3)
  Computer Architecture (continued)
 CPU
  – performs arithmetic and logical operations
  – synchronous operation
  – may consider instruction set architecture
    » how machine looks to a programmer
  – detailed hardware design
                                                  (6.4)
 Computer Architecture (continued)
 Memory
  – stores programs and data
  – organized as
    » bit
    » byte = 8 bits (smallest addressable location)
    » word = 4 bytes (typically; machine dependent)
  – instructions consist of operation codes and
    addresses       oprn     addr 1



                    oprn   addr 1   addr 2


                    oprn   addr 1   addr 2   addr 3
                                                         (6.5)
  Computer Architecture (continued)
 Numeric data representations
  – integer (exact representation)
     » sign-magnitude            s         magnitude

     » 2’s complement
        •negative values change 0 to 1, add 1
  – floating point (approximate representation)
     » scientific notation: 0.3481 x 106
     » inherently imprecise       s exp    significand

     » IEEE Standard 754-1985
                                                                  (6.6)
  Simple Machine Organization
 Institute for Advanced Studies machine (1947)
   – “von Neumann machine”
      » ALU performs transfers between memory and
        I/O devices
      » note two instructions per memory word
                            Arithmetic -
                            Logic Unit
     main                                             Input-
     memory                                           Output
                                                      Equipment


                            Program
                            Control Unit

        0         8               20        28         39
        op code       address     op code   address
                                                 (6.7)
  Simple Machine Organization (continued)
 ALU does arithmetic and logical comparisons
   – AC = accumulator holds results
   – MQ = memory-quotient holds second portion of
     long results
   – MBR = memory buffer register holds data while
     operation executes
                                                     (6.8)
  Simple Machine Organization (continued)
 Program control determines what computer does
  based on instruction read from memory
   – MAR = memory address register holds address of
     memory cell to be read
   – PC = program counter; address of next instruction
     to be read
   – IR = instruction register holds instruction being
     executed
   – IBR holds right half of instruction read from memory
                                                  (6.9)
  Simple Machine Organization (continued)
 Machine operates on fetch-execute cycle
 Fetch
  – PC     MAR
  – read M(MAR) into MBR
  – copy left and right instructions into IR and IBR
 Execute
  – address part of IR   MAR
  – read M(MAR) into MBR
  – execute opcode
                                      (6.10)
Simple Machine Organization (continued)
                                                      (6.11)
    Architecture Families
 Before mid-60’s, every new machine had a
  different instruction set architecture
  – programs from previous generation didn’t run on
    new machine
  – cost of replacing software became too large
 IBM System/360 created family concept
  – single instruction set architecture
  – wide range of price and performance with same
    software
 Performance improvements based on different
  detailed implementations
  – memory path width (1 byte to 8 bytes)
  – faster, more complex CPU design
  – greater I/O throughput and overlap
 “Software compatibility” now a major issue
  – partially offset by high level language (HLL) software
                        (6.12)
Architecture Families
                                                 (6.13)
  Multiple Register Machines
 Initially, machines had only a few registers
  – 2 to 8 or 16 common
  – registers more expensive than memory
 Most instructions operated between memory
  locations
  – results had to start from and end up in
    memory, so fewer instructions
     » although more complex
  – means smaller programs and (supposedly)
    faster execution
     » fewer instructions and data to move between
       memory and ALU
 But registers are much faster than memory
  – 30 times faster
                                                (6.14)
  Multiple Register Machines (continued)
 Also, many operands are reused within a
  short time
  – waste time loading operand again the next
    time it’s needed
 Depending on mix of instructions and
  operand use, having many registers may
  lead to less traffic to memory and faster
  execution
 Most modern machines use a multiple
  register architecture
  – maximum number about 512, common
    number 32 integer, 32 floating point
                                               (6.15)
  Pipelining
 One way to speed up CPU is to increase
  clock rate
  – limitations on how fast clock can run to
    complete instruction
 Another way is to execute more than one
  instruction at one time
                                              (6.16)
  Pipelining
 Pipelining breaks instruction execution down
  into several stages
  – put registers between stages to “buffer” data
    and control
  – execute one instruction
  – as first starts second stage, execute second
    instruction, etc.
  – speedup same as number of stages as long as
    pipe is full
                                         (6.17)
  Pipelining (continued)
 Consider an example with 6 stages
  – FI = fetch instruction
  – DI = decode instruction
  – CO = calculate location of operand
  – FO = fetch operand
  – EI = execute instruction
  – WO = write operand (store result)
                                                      (6.18)
  Pipelining Example




 Executes 9 instructions in 14 cycles rather than 54 for
  sequential execution
                                                  (6.19)
  Pipelining (continued)
 Hazards to pipelining
  – conditional jump
     » instruction 3 branches to instruction 15
     » pipeline must be flushed and restarted
  – later instruction needs operand being
    calculated by instruction still in pipeline
     » pipeline stalls until result ready
                                (6.20)
   Pipelining Problem Example




 Is this really a problem?
                                                  (6.21)
  Real-life Problem
 Not all instructions execute in one clock
  cycle
  – floating point takes longer than integer
  – fp divide takes longer than fp multiply which
    takes longer than fp add
  – typical values
     »   integer add/subtract       1
     »   memory reference           1
     »   fp add                     2 (make 2 stages)
     »   fp (or integer) multiply   6 (make 2 stages)
     »   fp (or integer) divide     15
 Break floating point unit into a sub-pipeline
  – execute up to 6 instructions at once
                                                        (6.22)
   Pipelining (continued)




 This is not simple to implement
   – note all 6 instructions could finish at the same
     time!!
                                              (6.23)
  More Speedup
 Pipelined machines issue one instruction
  each clock cycle
  – how to speed up CPU even more?
 Issue more than one instruction per clock
  cycle
                                                 (6.24)
  Superscalar Architectures
 Superscalar machines issue a variable
  number of instructions each clock cycle, up
  to some maximum
  – instructions must satisfy some criteria of
    independence
     » simple choice is maximum of one fp and one
       integer instruction per clock
     » need separate execution paths for each
       possible simultaneous instruction issue
  – compiled code from non-superscalar
    implementation of same architecture runs
    unchanged, but slower
                                                   (6.25)
  Superscalar Example

  0    1    2     3    4     5     6       7   8   clock




 Each instruction path may be pipelined
                                                   (6.26)
  Superscalar Problem
 Instruction-level parallelism
  – what if two successive instructions can’t be
    executed in parallel?
     » data dependencies, or two instructions of slow
       type
 Design machine to increase multiple
  execution opportunities
                                                   (6.27)
  VLIW Architectures
 Very Long Instruction Word (VLIW)
  architectures store several simple instructions
  in one long instruction fetched from memory
  – number and type are fixed
     » e.g., 2 memory reference, 2 floating point, one
       integer
  – need one functional unit for each possible
    instruction
     » 2 fp units, 1 integer unit, 2 MBRs
     » all run synchronized
  – each instruction is stored in a single word
     » requires wider memory communication paths
     » many instructions may be empty, meaning
       wasted code space
                                                           (6.28)
     VLIW Example

 Memory       Memory           FP 1         FP 2      Integer
  Ref 1        Ref 2
LD F0, 0(R1) LD F6, 8(R1)


LD F10,      LD F14,                                  SB
16(R1)       24(R1)                                   R1,R1,#4
                                                      8
LD           LD             AD F4,F0,F2 AD F8,F6,F2
F18,32(R1)   F22,40(R1)

LD                          AD           AD
F26,48(R1)                  F12,F10,F2   F16,F14,F2
                                                    (6.29)
  Instruction Level Parallelism
 Success of superscalar and VLIW machines
  depends on number of instructions that occur
  together that can be issued in parallel
  – no dependencies
  – no branches
 Compilers can help create parallelism
 Speculation techniques try to overcome
  branch problems
  – assume branch is taken
  – execute instructions but don’t let them store
    results until status of branch is known
                                            (6.30)
  CISC vs. RISC
 CISC = Complex Instruction Set Computer
 RISC = Reduced Instruction Set Computer
                                                 (6.31)
  CISC vs. RISC (continued)
 Historically, machines tend to add features
  over time
  – instruction opcodes
     » IBM 70X, 70X0 series went from 24 opcodes to
       185 in 10 years
     » same time performance increased 30 times
  – addressing modes
  – special purpose registers
 Motivations are to
  – improve efficiency, since complex instructions
    can be implemented in hardware and
    execute faster
  – make life easier for compiler writers
  – support more complex higher-level languages
                                                (6.32)
  CISC vs. RISC
 Examination of actual code indicated many
  of these features were not used
 RISC advocates proposed
  – simple, limited instruction set
  – large number of general purpose registers
     » and mostly register operations
  – optimized instruction pipeline
 Benefits should include
  – faster execution of instructions commonly
    used
  – faster design and implementation
                                                       (6.33)
  CISC vs. RISC
 Comparing some architectures

            Year Instr.   Instr.   Addr    Registers
                           Size    Modes
 IBM        1973   208    2-6       4         16
 370/168
 VAX        1978   303    2 - 57    22        16
 11/780
 I 80486    1989   235    1 - 11    11        8
 M 88000    1988   51       4        3        32
 MIPS       1991   94       4        1        32
 R4000
 IBM 6000   1990   184      4        2        32
                                               (6.34)
  CISC vs. RISC
 Which approach is right?
 Typically, RISC takes about 1/5 the design
  time
  – but CISC have adopted RISC techniques

						
Related docs
Other docs by gegeshandong
GrossIncomeExclusionsLecture5100
Views: 4  |  Downloads: 0
JEG1-Securisation bourse
Views: 6  |  Downloads: 0
High Court Judgment Template
Views: 0  |  Downloads: 0
Case study Academic Representation System
Views: 0  |  Downloads: 0
2007-09-26 Board Minutes
Views: 2  |  Downloads: 0
CREDIT-CARD
Views: 1  |  Downloads: 0
Reference Desk Schedule
Views: 4  |  Downloads: 0