Document Sample
C2 Powered By Docstoc
					               Homework Example
1.21Vendors often sell several models of a computer that have
   identical hardware with the sole exception of processor clock
   speed. The following questions explore the influence of clock
   speed on performance.

a. From the collection of computers with reported SPEC CFP2000
   benchmark results at, choose
   a set of three computer models that are identical in tested
   configurations (both hardware and software) except for clock
   speed. For each pair of models, compare the clock speed to the
   SPECint_base2000 benchmark speedup. How closely does
   benchmark performance track clock speed? Is this consistent
   with the description of the SPEC benchmarks on pages 28-30?
     A Random Selection of 3 CPUs
        From CINT2000 Results
                                                #CPUs Base Peak

1.   Dell Precision Workstation 650 (2.40 GHz      1 1089 1123
          Xeon, 1MB L3 Cache)
                                                   1 1220 1270
2.   Dell Precision Workstation 650 (2.80 GHz
          Xeon, 1MB L3 Cache)
                                                   1 1242 1294
3.   Dell Precision Workstation 650 (3.06 GHz
          Xeon, 1MB L3 Cache)

        SPEC Speedup speedup
M1-M2       1.12     1.167      1.4
M2-M3      1.018      1.09      1.2
M1-M3       1.14      1.26
                                0.8               SPEC Speedup
                                0.6               Clock speedup
                                      0   2   4

Chapter 2 – Instruction Set Principles
           and Examples
• Taxonomy of ISAs (Instruction Set
• The Role of Compilers
• Example: The MIPS architecture
• Example: The Trimedia TM32 CPU
            ISA Classification
•   Based on operand location
•   Memory addressing
•   Addressing modes
•   Type and size of operands
•   Operations in the instruction set
•   Instruction flow control
•   Instruction set encoding
           Type of Internal Storage
• The most basic distinction between ISAs

       A            B   Where do the operands come from?

                        Where are the results stored?

 CPU         R           R = A op B

Four Basic Types
• To add two numbers: C = A + B
push A
push B
pop C
Used by early computers
Have to access operands in certain order
Lots of memory accesses
• Only one register available for
  operand and result.
load A          ; acc  A
add B           ; acc  acc + B
store C         ; C  acc

Lots of memory accesses (for operand, and for
loading and storing accumulator)
• One operand is memory
load R1,A
add R3, R1, B       ;R3  R1+B
store R3, C

Typically many addressing modes to specify
the memory operand.
   • Operands and results are
   load R1, A
   load R2, B
   add R3, R1, R2
   store R3, C
Only load and store instructions access memory
Limited number of address modes
• Almost all modern processors use a
  load/store architecture
  – Registers are faster to access than memory
  – Registers are more convenient for compilers to
    use than other forms (like stacks)
  – General purpose registers are more convenient
    for compilers than special purpose registers
    (like accumulators)
         Memory addressing
• How memory addresses are specified
  (address modes) and interpreted
• Interpretation:
  – Byte ordering within wide memory
  – Support for variable size memory accesses
  – Restrictions on alignment
    Little Endian vs Big Endian
• Byte Ordering – two possibilities

        7 6 5 4 3 2 1 0       Little Endian

        0 1 2 3 4 5 6 7       Big Endian

 A problem when computers using different conventions
 share data.
           Alignment Restrictions
• For objects larger than                 word
  one byte, some                double-                 byte
  computers require
  alignment on object-                                  0
  sized boundaries.
   – Allowing misaligned                                3
     accesses complicates                               4 Memory
     hardware                                           5
                 “misaligned”                           6
          Addressing Modes
• Specify data locations in memory,
  constants, and registers.
  – Use ADD instruction to illustrate
  – Result and one operand are same (A  A + B)
  – Location of second operand depends on mode

• Also locations for control transfer
              Register, Immediate
      Add R4, R3        Regs[R4]  Regs[R4] + Regs[R3]
 Operand is in register R3

     Add R4, #3 Regs[R4]  Regs[R4] + 3
 Operand is coded in the instruction

Add R4, 100(R1)   Regs[R4]  Regs[R4] + Mem[100+Regs[R1]]

Operand is in memory at address 100 + [R1]
       Register Indirect, Indexed
Register Indirect:

Add R4, (R1)   Regs[R4]  Regs[R4] + Mem[Regs[R1]]

Operand is in memory at address [R1]


Add R4, (R1+R2)      Regs[R4]  Regs[R4] + Mem[Regs[R1]+Regs[R2]]

Operand is in memory at address [R1] +[R2]
          Direct, Memory Indirect
Add R4, (100)   Regs[R4]  Regs[R4] + Mem[100]

Source is in memory at 100

Memory Indirect:
Add R4, @(R3)    Regs[R4]  Regs[R4] + Mem[Mem[Regs[R3]]]

Source is in memory at Mem[Regs[R3]] – requires register
access and two memory accesses.
  Autoincrement, Autodecrement
Add R4, R2+      Regs[R4]  Regs[R4] + Mem[Regs[R2]]
                 Regs[R2]  Regs[R2] + d
Operand is in memory at [R2], [R2] in incremented by one data size

Add R4, R2-     Regs[R4]  Regs[R4] + Mem[Regs[R2]]
                Regs[R2]  Regs[R2] - d

Operand is in memory at [R2], [R2] in decremented by one data size
Add R4, 100(R2)[R3]   Regs[R4]  Regs[R4] + Mem[100+Regs[R2]

Operand is in memory at 100+[R2] + [R3]*d
Used to index arrays
d is size of array element
Address Mode Use
      VAX architecture (has all the modes)
      Memory modes only
      Register modes not counted
      Register modes – count for 50%

Statistics on Displacement Mode
• Size of displacement is an issue
   – Large size requires more bits in instruction
                                                    Statistics for
                                                    with full
                                                    for SPEC

       Immediate (Literal) Address Mode
    • Important to understand which operations require
      immediate mode the most.
~ ¼ of loads
and alu

                            Floating point instruction use immediate mode
                            less frequently than integer instructions
 Immediate (Literal) Address Mode
• Size of immediate values affect instruction length

     Small values used most frequently
     Addressing Modes for Signal
• DSP (Digital Signal Processing) computers operate on infinite,
  continuous streams of data
• Rely on circular buffers and pointers
• Modulo or circular address mode used (like auto-increment, but with
  start and end registers stored so that address resets at end of buffer)
• Specific address modes (bit reverse addreressing) included for FFT
  (Fast Fourier Transform) algorithm
• Library routines used so that software takes advantage of these modes.
    Conclusions about Memory
• Most important non-register addressing
  – displacement, immediate, register indirect
• Size of displacement field should be at least
  12-16 bits.
• Size of immediate field should be at least 8-
  16 bits.
    Type and Size of Operands
• How is type (integer, single-precision, etc)
  of operand specified?
  – Normally encoded in the opcode of instruction
• What size are operands and how are they
  – Standards used:
     • Integers: 2’s complement
     • Characters: ASCII (8) or Unicode (16)
     • Floating point: IEEE 754 standard
            SPEC Benchmarks
• Operands used:
  –   Byte or character (8)
  –   Half-word (short integer) (16)
  –   Word (integer) (32)
  –   Double-word (long integer) (64)
  –   Floating point
      Frequency of Data Access
• Important for tradeoffs such as datapath width and
  support/no support for misaligned data.

                                                For Benchmark
                                                programs on cpu
                                                with 64-bit

                                                All integer
                                                accesses of 64-bits
                                                are for addresses,
                                                so on a 32-bit cpu,
                                                there were be few
                                                64-bit accesses.
 Operands for Media Processing
• Graphics applications have data in vertices
  and in pixel formats.
  – Vertex: X, Y, Z, W coordinates – usually 32-bit
    floating-point values.
  – Pixel: 32-bits (Red (8), Green (8), Blue (8),
             Operands for DSP
• Fixed-point – binary fractions between -1
  and 1, exponent is stored separately
16-bit fixed-point: (can be other sizes)
            _._ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 Sign bit

 Example: 1001 1000 0000 0001 = -(1/8 + 1/16 + 1/215)
                              = -0.187530517578125
        DSP Example Data Sizes
Generation     Year       Example      Data      Accumulator
                           DSP         width       Width
    1         1982      TI TMS32010    16 bits     32 bits
    2         1987      Motorola       24 bits     56 bits

    3         1995      Motorola       24 bits     56 bits

    4         1998      TI             16 bits     40 bits

   Accumulators wider to avoid round-off error
   Conclusions about Operands
• 32-bit architectures should support
  – 8, 16, 32-bit integers
  – 32-bit and 64-bit floating point data
• 64-bit architectures should support 64-bit
  integers as well
• DSPs need wider accumulating registers
  then their memory width for accurate fixed-
  point processing
          Instruction Set Operations
• Most instruction sets implement these:
   –   Arithmetic and Logical
   –   Data transfer
   –   Control
   –   System – operating system calls, virtual memory management
   –   Floating point
   –   Decimal – operations on data encoded in BCD format
   –   String – string moves, compares, searches
   –   Graphics – pixel and vertex operations,
        Example Usage Statistics
Simple instructions tend to be used most often.

Example: 80x86 top 10 instructions:

Rank              Instruction                % total executed
1                 Load                                22%
2                 Conditional branch                  20%
3                 Compare                             16%
4                 Store                               12%
5                 Add                                 8%
6                 And                                 6%
7                 Sub                                 5%
8                 Move register-register              4%
9                 Call                                1%
10                Return                              1%
Operations for Media and Signal
• Data for multi-media operations is often
  narrower than 32 or 64-bit cpu word.
• Useful if the 32 or 64-bit ALU can do
  several narrow operations in parallel.
For example   16             16

                                  An example of
                                  SIMD (single instruction
                                  multiple data) or vector
     Digital Signal Processing
• Often operate on real-time streams of data
• Exceptions from arithmetic overflow not
  tolerated (risks losing data)
• Use “saturating arithmetic”
  – For numbers too large, use largest number
  – For numbers too small, use smallest number
• Instructions for rounding large numbers into
  smaller ones.
• Special instructions for making dot products
      Digital Signal Processing
• Many DSP programs require the accumulation of
  several product terms:

    F = a0x0 + a1x1 + a2x2 + …

• Use multiply-accumulate (MAC) instructions to do
  a multiply and add in one instruction
• Typical instruction mixes for DSPs show that these
  instructions are in the majority.
Conclusions about Instruction Set
• Simple instructions are important and
• Media and Signal Processing have special
  instruction needs, but still mostly use simple
              Flow Control
• Terminology
  – “Transfer” instructions (old term)
  – “Branch” – will be used for conditional flow
  – “Jump” – will be used for unconditional
    program flow control
  – “Procedure calls”
  – “Procedure returns”
    Addressing Modes for Flow
• Flow control instructions must specify destination
   – PC-relative - Specify displacement from program counter
      • Requires fewer bits than other modes
      • Practical since branch target is often nearby
      • Allows code to be independent of its location in memory
   – Other modes must be used for returns and indirect jumps
        Register Indirect Jumps
• Used when target address is not known at
  compile time.
• Some common uses
  –   Case or switch statement
  –   Virtual functions or methods in OO languages
  –   High-order functions or function pointers
  –   Dynamically shared libraries
             Branching Distances
• For PC-relative mode, need to know how large an offset is needed.

 Branch distances (in # of instructions) for a load-store
 architecture (Alpha) for the SPEC CPU2000 benchmarks
 Looks like we need at least 8 bits to cover most frequent
          Conditional Branches
• Architectural issue: How to specify branch
   – Condition code – state bits in the cpu to store results of
     ALU operations (Zero, Positive, Overflow, etc).
     Branches are based on value of condition code bits.
   – Condition register – tests arbitrary register with the
     result of a comparison (register = 0?, etc)
   – Compare and branch – compare is included in branch
   Branch Condition Specification

Method        Advantages                  Disadvantages
Condition     Sometimes condition is set CC is extra state. Can
              for free.                  constrain order of
code (CC)                                instructions.
Condition     Simple.                     Uses up a register.

Compare and   One instruction rather than Complicated instructions
              two.                        are harder to pipeline.
    Frequency of Compare Types
Statistics for

< and  dominate
                Procedure Calls
Procedure calls require both control transfer and
  state storage.
Storage options:
   - Caller saving – calling procedure saves state
   - Callee saving – called procedure saves the registers it
     wants to use

Most modern systems use a combination of both.
    Conclusions: Control Flow
• PC-relative branch displacements should be
  at least 8 bits.
• The frequent cases: conditional branches
  using <, , and =, should be fast.
      Instruction Set Encoding
• Encoding has two major effects:
  – Size of compiled program
  – Implementation of processor
• “Opcode” field typically used to encode
  instruction operation (add, move, etc)
• How to encode address modes?
  – large # of address modes may use separate
    address specifier for each operand
  – small # of address modes – encoded in opcode
          Competing Forces
• Have lots of registers and address modes
• Have small register and address mode fields
  to keep instructions small
• Instructions should be easy to decode in
  pipelined architecture (they should all be
  the same size, and multiples of bytes)
                         Some Examples
  Operation and   Address
                  specifier 1
                                    Address field 1    ...   Address       Address field n
  # of operands                                              specifier n
1. Variable length instructions (VAX, Intel 80x86)

  Operation       Address field 1   Address field 2    Address field 3

2. Fixed length instructions (Alpha, ARM, MIPS, PowerPC, etc)

   Operation       Address           Address field 1

   Operation       Address           Address             Address field
                   specifier 1       specifier 2

   Operation       Address           Address field 1     Address field 2
3. Hybrid (IBM 360/70, MIPS16, TI TMS320C54x, etc)
         Chapter 2 Continued
Taxonomy of ISAs (Instruction Set Architectures)
The Role of Compilers
Example: The MIPS architecture
Example: The Trimedia TM32 CPU
        The Role of Compilers
• For desktop and server applications, programs
  written in high level languages and compiled.
• ISA is a compiler target language.
• Compilers have a large impact on performance.
• Difficult to isolate the effect of compiler
  technology on performance from effects of
• ISA affects the quality of the code that can be
  generated for a computer and the complexity of
  the compiler.
    The Anatomy of Compilers



         Compilers and ISAs
• First goal is correctness
• Second goal is speed of resulting code
• Other goals:
  – Speed of compilation
  – Debugging support
  – Interoperability with other languages
• First goal (correctness) is complex, and
  limits the complexity of optimizations.
       Types of Optimizations
• High-level Optimization – on source code, fed to
  lower level optimizations
• Local optimizations – optimize code only within a
  straight-line code fragment
• Global optimizations – extend local optimizations
  across branches and apply transformations to
  optimize loops
• Register allocation – associate registers with
• Processor-dependent optimizations – take
  advantage of specific architecture features
            Register Allocation
• One the most important optimizations
• Based on graph coloring techniques
   – Construct graph of possible allocations to a register
   – Use graph to allocate registers efficiently
   – Goal is to achieve 100% register allocation for all
     active variables.
   – Graph coloring works best when there are at least 16
     general-purpose registers available for integers and
     more for floating-point variables.
 Impact of Optimizations on Performance

Change in instruction count for two programs from SPEC2000 as compiler
optimization levels vary.
Level 0 means no optimization.
• Important to use optimized code when considering architecture
      ISA Guidelines for Efficient
          Compiler Interface
• Provide regularity through orthogonality
   – Operations, data types, addressing modes should be
• Provide primitives, not solutions
   – Do not attempt to support particular high level language
     features directly
• Provide instructions that bind quantities known at
  compile time to constants
   – As much as possible, compilers should avoid producing
     code that interprets values at run time that are known at
     compile time.
         Chapter 2 Continued
Taxonomy of ISAs (Instruction Set Architectures)
The Role of Compilers
Example: The MIPS architecture
Example: The Trimedia TM32 CPU
         Example: MIPS ISA

• 64-bit load-store architecture

• Full instruction set explained in Appendix C
  (on text web page)
• 32 64-bit general purpose registers (GPRs)
  – R0, R1, R2…. R31
  – R0 is always 0
• 32 64-bit floating-point registers (FPRs)
  –   F0, F1, F2….F31
  –   32 single-precision values or
  –   32 double-precision values
  –   Both single and double-precision instructions
                Data Types
• 8, 16, 32, 64-bit data types for integers
• 32-bit single precision, 64-bit double
  precision for floating point
• 8, 16, 32-bit data loaded into GPRs with
  zeros or the sign bit filling in remaining bits
  to make 64.
• 64-bit integer operations
 Address Modes for Data Transfers
• Immediate mode (16-bit values)
  – DADDIU R1,R2,#3      R1  R2+3
• Displacement mode (16-bit displacement)
  – LD R1, 200(R2)       R1  Mem[200+Regs[R2]]

• Other address modes can be implemented as
  special cases of these two
• Examples:
  – Register indirect: put 0 in displacement field
  – Absolute: use R0 for base register
     MIPS Instruction Format
• All instructions are 32-bits
• 6-bit opcode encodes both operation and
  address mode
• 16-bit fields for displacement addressing,
  immediate constants, or PC-relative branch
MIPS Instruction Format Overview
                    I-type Instructions

LD R1, 30(R2) Load Double Word            Regs[R1] 64 Mem[30+Regs[R2]]
opcode = load double word, displacement   rs = R2, rt = R1, Immediate = 30

S.S F0, 40(R3) Store FP Single            Mem[40+Regs[R3]] 32 Regs[F0]0..31
opcode = store FP single, displacement    rs = F0, rt = R3, Immediate = 40

BEQZ R4, name Branch equal zero           if (Regs[R4] == 0) PC  name;
                                          ((PC+4)-217)  name < ((PC+4)+217)
opcode = branch equal zero, immediate     rs = R4, Immediate = name
                      R-type Instructions

DADD R1, R2, R3             Add double       Regs[R1]  Regs[R2] + Regs[R3]
Opcode = R-type register mode                rd = R1, rs = R2, rt = R3
                                             shamt = 0, funct = ADD

DSLT     R1, R2, R3        Set less than     if (Regs[R2]<Regs[R3])
                                             Regs[R1]  1 else Regs[R1] 0
Opcode=R-type register mode                  rd = R1, rs = R2, rt=R3
                                             shamt = 0, funct = SLT

SLL R1, R2, 10 shift left logical    Regs[R1]  Regs[R2]<<10
Opcode = R-type register mode                 rd=R1, rs=0, rt=R3,
                                              shamt=10, funct=SLL
                 J-type Instructions

J name          Jump            PC36..63 name
opcode = jump                   Offset = name

JAL name        Jump and link    PC36..63 name, Regs[R31] PC+4
                                ((PC+4)-227)  name < ((PC+4)+227)
opcode = jump and link          Offset = name
            MIPS Instructions
•   Loads and Stores
•   ALU operations
•   Branches and Jumps
•   Floating point operations
          Loads and Stores
• Any of the GPR or FPRs may be loaded or
  stored except R0, which is permanently 0
• Single-precision floating point numbers
  occupy ½ of a floating point register.
• Conversions between single and double
  precision must be done explicitly.
              ALU Operations

•   All ALU instructions are register-register
•   Arithmetic instructions: add, subtract
•   Logic instructions: AND, OR, XOR
•   Shift instructions
•   Immediate forms for all ALU instructions
•   16-bit sign-extended immediate field
               ALU Operations
• Note: ALU instruction “Load Upper Immediate”
   – LUI R1,#23              Regs[R1]  032##23##016
   – Allows 32-bit constant to be loaded (LUI and DADDIU)

• Note: ALU compare instructions: “set less than” and
  “set equal”, etc, have both register and immediate
   – SLT R1, R2, R3   if Regs[R2] < Regs[R3], Regs[R1] 1 else R1 0

• Note: Register-register moves can be done with add:
   – DADDU R1, R2, R0         Regs[R1]  Regs[R2]
        Branches and Jumps
• Plain jump and “jump and link” for saving
  return location
• Register version and PC offset version
• Examples:
  – J name      PC36..63  name
  – JALR R2     Regs[R31]PC+4; PC  Regs[R2];
             Branches and Jumps
• Branches are conditional
    – Register test (zero, non-zero, negative, etc)
    – Two register compare
• Branch target addressed specified by 16-bit signed
    – Offset is shifted left two places (multiply by 4 because
      instructions are 4 bytes) and then added to PC.
BNE R3, R4, name   Branch not equal zero    if (Regs[R4]==0) PCname;
                                       ((PC+4)-217  name < ((PC+4)+217)
             Conditional Move
• Branches make pipelining difficult, so a
  common instruction is made conditional

MOVZ R1, R2, R3   Conditional move if zero   if (Regs[R3]==0)
     Floating Point Operations
• Single and double-precision operations
  indicated by .S and .D
• MOV.S and MOV.D copy registers of the
  same type
• Special instructions for moving data from
  FPR and GPR
• Conversion instructions convert integers to
  floating-point and visa-versa
     Floating Point Operations
• Floating point arithmetic operations:
   – add, subtract, multiply, divide
• Floating point compares – compare FPRs
  and set bit in floating –point status register
• Paired-single operations available that do
  two 32-bit operations in the 64-bit ALU.
• Multiply-add instructions available for both
  integer and floating-point. (accumulator is
  NOT extra-wide however)
Most Popular MIPS Instructions

                      Integer benchmarks
Most Popular MIPS Instructions

                Floating-point benchmarks
         Chapter 2 Continued
Taxonomy of ISAs (Instruction Set Architectures)
The Role of Compilers
Example: The MIPS architecture
Example: The Trimedia TM32 CPU
 Media Processor: Trimedia TM32
• Dedicated to multimedia processing
  –   data communication
  –   audio coding
  –   video coding
  –   video processing
  –   graphics
• Operate on narrower data than PCs
• Operate on data streams
• Typically found in set-top boxes
       Unique Features of TM32
• Lots of registers: 128 (32-bit)
• Registers can be either integer or floating-point
• SIMD instructions available
• Both 2’s complement and saturating arithmetic
• Programmer can specify five independent
  instructions to be issued at the same time!
    – nops placed in slots if 5 are not available
    – VLIW (very long instruction word) coding technique
• Instructions compacted in memory and
  expanded when loaded into cache
• About twice as many instructions as MIPS
           TM32 Instructions
•   Load-store (33 operations)
•   Byte shuffles (11)
•   Bit shifts (10)
•   Multiplies and multimedia (23)
•   Integer arithmetic (62)
•   Floating-point (42)
•   Special operations (20)
•   Branch (6)
TM32 Instruction Mix
 EEMBC Benchmark
 Operation                      Out of box Modified C
 add word                            26.5%       20.5%
 load byte                           10.4%        1.0%
 subtract word                       10.1%        1.1%
 shift left arithemetic                7.8%       0.2%
 store byte                            7.4%       1.5%
 multiply word                         5.5%       0.4%
 shift right arithmetic                3.6%       0.7%
 and word                              3.6%       6.8%
 load word                             3.5%       7.2%
 load immediate                        3.1%       1.6%
 set greater than, equal               2.9%       1.3%
 store word                            2.0%       5.3%
 jump                                  1.8%       0.8%
 conditional branch                    1.3%       1.0%
 pack/merge bytes                      2.6%      16.8%
 SIMD sum of half-word products        0.0%      10.1%
 SIMD sum of byte products             0.0%       7.7%
 pack/merge half words                 0.0%       6.5%
 SIMD subtract half word               0.0%       2.9%
 SIMD maximum byte                     0.0%       1.9%
 TM32 CPU code size (bytes)        243,968    387,328
 MIPS code size (bytes)            120,729
  Instruction Set Design: Pitfalls
• Designing a “high-level” instruction set feature
  specifically oriented to supporting a high-level
  language structure
   – Often makes instructions too specialized to be useful
• Innovating at the ISA to reduce code size without
  accounting for the compiler
   – Compilers can make much more impact
   – Use optimized code when considering changes
 Instruction Set Design: Fallacies
• There is such a thing as a typical program
   – programs vary widely in how they use instruction sets
• An architecture with flaws cannot be successful
   – 80x86 case in point
• A flawless architecture can be designed
   – All designs contain tradeoffs
   – Technologies change, making previous good decisions
         ISA Conclusions:
        Trends in the 1990s
• Address size 32-bit  64-bit
• Addition of conditional execution
• Optimization of cache performance via
• Support for multimedia
• Faster floating-point operations
          ISA Conclusions:
         Trends in the 2000s
• Long instruction words
• Increased conditional execution
• Blending of DSP and general purpose
• 80x86 emulation