Docstoc

370 lectures

Document Sample
370 lectures Powered By Docstoc
					CDA 5155 and 4150
   Computer Architecture

         Week 1:
      30 August 2011
         Goals of the course
• Advanced coverage of computer
  architecture – general purpose processors,
  embedded processors,historically
  significant processors, design tools.
  – Instruction set architecture
  – Processor microarchitecture
  – Systems architecture
     • Memory systems
     • I/O systems

                                   2/49
                   Teaching Staff
• Professor Gary Tyson
  – PhD: University of California – Davis
  – Faculty jobs:
     •   California State University Sacramento: 1987 - 1990
     •   University of California – Davis: 1993 - 1994
     •   University of California – Riverside: 1995 - 1996
     •   University of Michigan: 1997 - 2003
     •   Florida State University: 2003 - present
  – Office Hours:
     • Monday 10:00am – 11:30am; Wednesday 1:30pm - 3:00pm
     • www.cs.fsu.edu/~tyson
                                                      3/49
        Grading in 5155 (Fall’11)
   Programming assignments
       In-order pipeline simulation (15%)
       Out-of-order pipeline simulation (15%)
   Exam (40%)
       In class, 75 minutes
   Team Project (25%)
       3 or 4 students per team
   Class Participation (5%)


                                                 4/49
             Time Management
• 3 hours/week lecture
   – This is probably the most important time
• 2 hours/week reading
   – Hennessy/Patterson:
   – Computer Architecture: A Quantitative Approach
• 3-5 hours/week exam prep
• 5+ hours/week Project (1/3 semester)
   Total: ~10-15 hours per week.


                                                5/49
               Web Resources
• Course Web Page:
    http://www. cs.fsu.edu/~tyson/cda5155


   Wikipedia:
       http://en.wikipedia.org/wiki/Microprocessor

   Wisconsin Architecture Page:
    http://arch-www.cs.wisc.edu/home


                                            7/49
              Levels of Abstraction
•   Problem/Idea (English?)
•   Algorithm (pseudo-code)
•   High-Level languages (C, Verilog)
•   Assembly instructions (OS calls)
•   Machine instructions (I/O interfaces)
•   Microarchitecture/organization (block diagrams)
•   Logic level: gates, flip-flops (schematic, HDL)
•   Circuit level: transistors, sizing (schematic, HDL)
•   Physical: VLSI layout, feature size, cabling, PC boards.
       What are the abstractions at each level?
                                                  8/49
              Levels of Abstraction
•   Problem/Idea (English?)
•   Algorithm (pseudo-code)
•   High-Level languages (C, Verilog)
•   Assembly instructions (OS calls)
•   Machine instructions (I/O interfaces)
•   Microarchitecture/organization (block diagrams)
•   Logic level: gates, flip-flops (schematic, HDL)
•   Circuit level: transistors, sizing (schematic, HDL)
•   Physical: VLSI layout, feature size, cabling, PC boards.
       At what level do I perform a square root? Recursion?
                                                 9/49
              Levels of Abstraction
•   Problem/Idea (English?)
•   Algorithm (pseudo-code)
•   High-Level languages (C, Verilog)
•   Assembly instructions (OS calls)
•   Machine instructions (I/O interfaces)
•   Microarchitecture/organization (block diagrams)
•   Logic level: gates, flip-flops (schematic, HDL)
•   Circuit level: transistors, sizing (schematic, HDL)
•   Physical: VLSI layout, feature size, cabling, PC boards.
       Who/what translates from one level to the next?
                                                10/49
            Role of Architecture
• Responsible for hardware specification:
   – Instruction set design
• Also responsible for structuring the overall
  implementation
   – Microarchitectural design.
• Interacts with everyone
   – mainly compiler and logic level designers.
• Cannot do a good job without knowledge of both
  sides

                                              11/49
      Design Issues: Performance
• Get acceptable performance out of system.
   – Scientific: floating point throughput, memory&disk
     intensive, predictable
   – Commercial: string handling, disk (databases),
     predictable
   – Multimedia: specific data types (pixels), network?
     Predictable?
   – Embedded: what do you mean by performance?
   – Workstation: Maybe all of the above, maybe not


                                             12/49
         Calculating Performance
•   Execution time is often the best metric
•   Throughput (tasks/sec) vs. latency (sec/task)
•   Benchmarks: what are the tasks?
    –   What I care about!
    –   Representative programs (SPEC, Linpack)
    –   Kernels: representative code fragments
    –   Toy programs: useful for testing end-conditions
    –   Synthetic programs: does nothing but with a
        representative instruction mix.

                                               13/49
               Design Issues: Cost
• Processor
   – Die size, packaging, heat sink? Gold connectors?
   – Support: fan, connectors, motherboard specifications, etc.
• Calculating processor cost:
   – Cost of device = (die + package + testing) / yield
   – Die cost = wafer cost / good die yield
       • Good die yield related to die size and defect density
   – Support costs: direct costs (components, labor),
                    indirect costs ( sales, service, R&D)
   – Total costs amortized over number of systems sold(PC vs NASA)



                                                             14/49
             Other design issues
• Some applications care about other design issues.
• NASA deep space mission
   – Reliability: software and hardware (radiation
     hardening)
• AMD
   – Code compatibility
• ARM
   – Power


                                              15/49
     A Quantitative Approach
• Hardware systems performance is
  generally easy to quantify
  – Machine A is 10% faster than Machine B
  – Of course Machine B’s advertising will show
    the opposite conclusion


• Many software systems tend to have much
  more subjective performance evaluations.
                                     16/49
     Measuring Performance
• Total Execution Time:
  – A is 3 times faster than B for programs P1,P2

                 1   n

                 n
                     Σ Time   i
                     i=1



  – Issue: Emphasizes long running programs


                                       17/49
     Measuring Performance
• Weighted Execution Time:


              n
             ∑      Weighti X Timei
              i=1



  – What if P1 is executed far more frequently?

                                       18/49
     Measuring Performance
• Normalized Execution Time:
  – Compare machine performance to a reference
    machine and report a ratio.
     • SPEC ratings measure relative performance to a
       reference machine.




                                           19/49
                Amdahl’s Law
• Rule of Thumb: Make the common case faster


       http://en.wikipedia.org/wiki/Amdahl's_law




  (Attack longest running part until it is no longer) repeat


                                                  20/49
       Instruction Set Design
• Software Systems: named variables;
  complex semantics.
• Hardware systems: tight timing
  requirements; small storage structures;
  simple semantics
• Instruction set: the interface between very
  different software and hardware systems

                                    21/49
           Design decisions
• How much “state” is in the
  microarchitecture?
  – Registers; Flags; IP/PC
• How is that state accessed/manipulated?
  – Operand encoding
• What commands are supported?
  – Opcode; opcode encoding

                                  22/49
  Design Challenges: or why is
   architecture still relevant?
• Clock frequency is increasing
  – This changes the number of levels of gates that
    can be completed each cycle so old designs
    don’t work.
  – It also tend to increase the ratio of time spent
    on wires (fixed speed of light)
• Power
  – Faster chips are hotter; bigger chips are hotter

                                         23/49
      Design Challenges (cont)
• Design Complexity
   – More complex designs to fix frequency/power issues
     leads to increased development/testing costs
   – Failures (design or transient) can be difficult to
     understand (and fix)
• We seem far less willing to live with hardware
  errors (e.g. FDIV) than software errors
   – which are often dealt with through upgrades – that we
     pay for!)


                                              24/49
     Techniques for Encoding
            Operands
• Explicit operands:
  – Includes a field to specify which state data is
    referenced
  – Example: register specifier
• Implicit operands:
  – All state data can be inferred from the opcode
  – Example: function return (CISC-style)


                                         25/49
                Accumulator
• Architectures with one implicit register
  – Acts as source and/or destination
  – One other source explicit
• Example: C = A + B
  – Load A          // (Acc)umulator  A
  – Add B           // Acc  Acc + B
  – Store C         // C  Acc
    Ref: “Instruction Level Distributed Processing:
          Adapting to Shifting Technology” 26/49
                         Stack
 • Architectures with implicit “stack”
     – Acts as source(s) and/or destination
     – Push and Pop operations have 1 explicit operand
 • Example: C = A + B
     –   Push A       // Stack = {A}
     –   Push B       // Stack = {A, B}
     –   Add          // Stack = {A+B}
     –   Pop C        // C  A+B ; Stack = {}

Compact encoding; may require more instructions though
                                              27/49
                    Registers
• Most general (and common) approach
  – Small array of storage
  – Explicit operands (register file index)
• Example: C = A + B
  Register-memory    load/store
  Load R1, A      Load R1, A
                  Load R2, B
  Add R3, R1, B Add R3, R1, R2
  Store R3, C     Store R3, C

                                              28/49
                    Memory
• Big array of storage
   – More complex ways of indexing than registers
      • Build addressing modes to support efficient
        translation of software abstractions
      • Uses less space in instruction than 32-bit
        immediate field
A[i]; use base (i) + displacement (A) (scaled?)
a.ptr; use base (a) + displacement (ptr)


                                             29/49
           Addressing modes
Register            Add R4, R3
Immediate           Add R4, #3
Base/Displacement   Add R4, 100(R1)
Register Indirect   Add R4, (R1)
Indexed             Add R4, (R1+R2)
Direct              Add R4, (1001)
Memory Indirect     Add R4, @(R3)
Autoincrement       Add R4, (R2)+
                                30/49
                  Other Memory Issues
    What is the size of each element in memory?



0x000   0-255                                Byte

0x000   0 - 65535                            Half word

0x000   0 - ~4B                              Word



                                           31/49
             Other Memory Issues
    Non-word loads? ldb R3, (000)



0x000   11                    00 00 00 11

        44

        88
        FF


                                       34/49
             Other Memory Issues
    Non-word loads? ldb R3, (003)



        11                    FF FF FF FF

        44                    Sign extended

        88
0x003   FF


                                         35/49
             Other Memory Issues
    Non-word loads? ldbu R3, (003)



        11                    00 00 00 FF

        44                    Zero filled

        88

0x003   FF


                                            36/49
               Other Memory Issues
    Alignment? Word accesses only address ending in 00
          Half-word accesses only ending in 0
          Byte accesses any address
        11

        44
                        ldw R3, (002) is illegal!
0x002   88
                        Why is it important to be aligned?
        FF
                        How can it be enforced?

                                                    37/49
       Techniques for Encoding
              Operators
• Opcode is translated to control signals that
   – direct data (MUX control)
   – select operation for ALU
   – Set read/write selects for register/memory/PC
• Tradeoff between how flexible the control is and
  how compact the opcode encoding.
   – Microcode – direct control of signals (Improv)
   – Opcode – compact representation of a set of control
     signals.
      • You can make decode easier with careful opcode selection


                                                     38/49
        Handling Control Flow
•   Conditional branches (short range)
•   Unconditional branches (jumps)
•   Function calls
•   Returns
•   Traps (OS calls and exceptions)
•   Predicates (conditional retirement)


                                     39/49
      Encoding branch targets
• PC-relative addressing
  – Makes linking code easier
• Indirect addressing
  – Jumps into shared libraries, virtual functions,
    case/switch statements
• Some unusual modes to simplify target
  address calculation
  – (segment offset) or (trap number)

                                         40/49
               Condition codes
• Flags
   – Implicit: flag(s) specified in opcode (bgt)
   – Flag(s) set by earlier instructions (compare, add, etc.)
• Register
   – Uses a register; requires explicit specifier
• Comparison operation
   – Two registers with compare operation specified in
     opcode.


                                                    41/49
        Higher Level Semantics:
               Functions
•   Function call semantics
    –   Save PC + 1 instruction for return
    –   Manage parameters
    –   Allocate space on stack
    –   Jump to function
•   Simple approach:
    – Use a jump instruction + other instructions
•   Complex approach:
    –   Build implicit operations into new “call” instruction


                                                42/49
              Role of the Compiler
   • Compilers make the complexity of the ISA
     (from the programmers point of view) less
     relevant.
      – Non-orthogonal ISAs are more challenging.
      – State allocation (register allocation) is better
        left to compiler heuristics
      – Complex Semantics lead to more global
        optimization – easier for a machine to do.
People are good at optimizing 10 lines of code.
Compilers are good at optimizing 10M lines.       43/49
                     LC processor
   • Little Computer Fall 2011
      – For programming projects
   • Instruction Set Design




opcode regA   regB                          destReg

                                    44/49
                       LC processor
                  R-type instructions

opcode regA     regB                                     destReg
24- 22   21- 19 18 –16             15 – 3                 2-0


                add:     destReg  regA + regB

                nand: destReg  regA & regB


                                                 45/49
                        LC processor
                   I-type instructions

opcode regA      regB                  offsetField
24- 22   21- 19 18 –16                    15 – 0

          lw:    regB  Memory[regA + offsetField]

          sw:    Memory[regA +offsetField]  regB

          beq:   if (regA= = regB) PC  PC + 1 + offsetField
                                                   46/49
                 LC processor
             O-type instructions

opcode                    unused
24- 22                    21 – 0


         noop:   do nothing

         halt:   halt the simulation


                                       47/49
                   LC assembly example

     lw 0      1    five     load reg1 with 5 (uses symbolic address)
     lw 1      2    3        load reg2 with -1 (uses numeric address)
start     add 1     2    1       decrement reg1
     beq 0     1    2        goto end of program when reg1==0
     beq 0     0    start    go back to the beginning of the loop
     noop
done      halt                   end of program
five      .fill     5
neg1      .fill     -1
stAddr    .fill     start        will contain the address of start (2)




                                                            48/49
LC machine code example
  (address   0):   8454151 (hex 0x810007)
  (address   1):   9043971 (hex 0x8a0003)
  (address   2):   655361 (hex 0xa0001)
  (address   3):   16842754 (hex 0x1010002)
  (address   4):   16842749 (hex 0x100fffd)
  (address   5):   29360128 (hex 0x1c00000)
  (address   6):   25165824 (hex 0x1800000)
  (address   7):   5 (hex 0x5)
  (address   8):   -1 (hex 0xffffffff)
  (address   9):   2 (hex 0x2)

  Input for simulator:
  8454151
  9043971
  655361
  16842754
  16842749
  29360128
  25165824
  5
  -1
  2                                           49/49

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:9/25/2011
language:English
pages:49