Docstoc

6304Lectures-part1

Document Sample
6304Lectures-part1 Powered By Docstoc
					       EE (CE) 6304 Computer Architecture

                       Introduction




          Prof. Vojin G. Oklobdzija
    Department of Electrical Engineering
        University of Texas at Dallas
 Professor Emeritus, University of California


*acknowledgment to Prof. Rama Sangiredy for this set of lectures
         and Prof. D. Patterson for his original lectures
                       Outline

•   Computer Architecture at a Crossroads
•   Why Take 6304?
•   Fundamental Abstractions & Concepts
•   Administrivia
•   Understanding & Evaluating Performance
•   Computer Architecture v. Instruction Set Arch.
•   What Computer Architecture brings to table
•   Summary
     Crossroads: Conventional Wisdom in Comp. Arch

• Old Conventional Wisdom: Power is free, Transistors expensive
• New Conventional Wisdom: “Power wall” Power expensive, Xtors free
  (Can put more on chip than can afford to turn on)
• Old CW: Sufficiently increasing Instruction Level Parallelism via
  compilers, innovation (Superscalar, Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” law of diminishing returns on more HW for ILP
• Old CW: Multiplies are slow, Memory access is fast
• New CW: “Memory wall” Memory slow, multiplies fast
  (200 clock cycles to DRAM memory, 4 clocks for multiply)
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
   – Uniprocessor performance now 2X / 5(?) yrs
   Sea change in chip design: multiple “cores”
      (2X processors per chip / ~ 2 years)
       » More simpler processors are more power efficient
                                          Crossroads: Uniprocessor Performance
                               10000
                                       From Hennessy and Patterson, Computer
                                       Architecture: A Quantitative Approach, 4th
                                       edition, October, 2006                                    ??%/year
                                1000
Performance (vs. VAX-11/780)




                                                                               52%/year

                                 100




                                  10
                                              25%/year



                                   1
                                   1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
                                                • VAX       : 25%/year 1978 to 1986
                                                • RISC + x86: 52%/year 1986 to 2002
                                                • RISC + x86: ??%/year 2002 to present
                     Sea Change in Chip Design
• Intel 4004 (1971): 4-bit processor,
  2312 (correction: 2,108 ) transistors, 0.4
  MHz,
  10 micron PMOS, 11 mm2 chip
• RISC II (1983): 32-bit, 5 stage
  pipeline, 40,760 transistors, 3 MHz,
  3 micron NMOS, 60 mm2 chip


• 125 mm2 chip, 0.065 micron CMOS
  = 2312 RISC II+FPU+Icache+Dcache
   – RISC II shrinks to ~ 0.02 mm2 at 65 nm
   – Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?
   – Proximity Communication via capacitive coupling at > 1 TB/s ?



   • Processor is the new transistor?
        Déjà vu all over again?

• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …
• “… today’s processors … are nearing an impasse as
  technologies approach the speed of light..”
     David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature
   Custom multiprocessors strove to lead uniprocessors
   Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development
  to multicore designs. … This is a sea change in computing”
                           Paul Otellini, President, Intel (2004)
• Difference is all microprocessor companies switch to
  multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2
  CPUs)
   Procrastination penalized: 2X sequential perf. / 5 yrs
   Biggest programming challenge: 1 to 2 CPUs
                  Problems with Sea Change

•       Algorithms, Programming Languages, Compilers,
        Operating Systems, Architectures, Libraries, …
        not ready to supply Thread Level Parallelism or
        Data Level Parallelism for 1000 CPUs / chip,
•       Architectures not ready for 1000 CPUs / chip
    •     Unlike Instruction Level Parallelism, cannot be solved by just by
          computer architects and compiler writers alone, but also cannot
          be solved without participation of computer architects
•       This course (and 4th Edition of textbook Computer
        Architecture: A Quantitative Approach) explores
        shift from Instruction Level Parallelism to Thread
        Level Parallelism / Data Level Parallelism
                Why take 6304?
• To design the next great instruction
  set?...well...
   – instruction set architecture has largely converged
   – especially in the desktop / server / laptop space
   – dictated by powerful market forces
• Tremendous organizational innovation relative to
  established ISA abstractions
• Many New instruction sets or equivalent
   – embedded space, controllers, specialized devices, ...
• Design, analysis, implementation concepts vital to
  all aspects of EE & CS
   – systems, PL, theory, circuit design, VLSI, comm.
• Equip you with an intellectual toolbox for dealing
  with a host of systems design challenges
              Example Hot Developments
• Manipulating the instruction set abstraction
   –   itanium: translate ISA64 -> micro-op sequences
   –   transmeta: continuous dynamic translation of IA32
   –   tinsilica: synthesize the ISA from the application
   –   reconfigurable HW
• Virtualization
   – vmware: emulate full virtual machine
   – JIT: compile to abstract virtual machine, dynamically compile to
     host
• Parallelism
   – wide issue, dynamic instruction scheduling, EPIC
   – multithreading (SMT) or Hyperthreading
   – chip multiprocessors (multiple-core processors)
• Communication
   – network processors, network interfaces
• Exotic explorations
   – nanotechnology, quantum computing
     Forces on Computer Architecture


       Technology           Programming
                            Languages

Applications
                  Computer
                 Architecture



     Operating
     Systems
                                   History
                                   (A = F / M)
         Moore’s Law: 2X transistors / “year”




•   “Cramming More Components onto Integrated Circuits”
     –   Gordon Moore, Electronics, 1965
•   # of transistors / cost-effective integrated circuit double every N months (12 ≤
    N ≤ 24)
                                    A take on Moore’s Law
                                       Bit-lev el parallelism              Instruction-lev el        Thread-lev el (?)
              100, 000,000




                                                                                        

               10,000, 000                                                                      
                                                                                         
                                                                                            
                                                                                           R10000
                                                                                     
                                                                                     
                                                                                     
                                                                                    
                                                                                      
                                                                           
                                                                                 
                                                                                   
                                                                                        
                1, 000,000
                                                                                       
                                                                                      
                                                                                        Pentium
                                                                                        
Transistors




                                                                                       
                                                                       
                                                                        i80386
                                                                
                                                 i80286                           R3000
                  100, 000
                                                                        R2000




                                                i8086

                   10,000
                                          i8080
                                     i8008
                                
                              i4004

                    1, 000
                         1970          1975        1980         1985        1990            1995     2000        2005
A take on Moore’s Law




      Adapted from http://www.intel.com/research/silicon
                Technology Trends

•   Clock Rate:           ~30% per year
•   Transistor Density:   ~35%
•   Chip Area:            ~15%
•   Transistors per chip: ~55%
•   Total Performance Capability: ~100%
•   by the time you graduate...
     – 3x clock rate (5-6 GHz)
     – 10x transistor count (more than a Billion transistors)


• plus 16x dram density, 32x disk density
                        Performance Trends

              100
                                   Supercomputers




               10
Performance




                         Mainf rames
                                                             Microprocessors
                               Minicomputers
                1




              0.1
               1965   1970        1975         1980   1985         1990        1995
What is “Computer Architecture”?
               Application
                             Operating
                               System
                 Compiler       Firmware
                                            Instruction Set
                                             Architecture
              Instr. Set Proc. I/O system
                 Datapath & Control
                   Digital Design
                   Circuit Design
                       Layout

  • Coordination of many levels of abstraction
  • Under a rapidly changing set of forces
  • Design, Measurement, and Evaluation
                     Computer Architecture is
                       Design and Analysis
                      Architecture is an iterative process:
                      • Searching the space of possible designs
                      • At all levels of computer systems
           De sign



Analysis




    Creativity
                          Cost /
                          Performance
                          Analysis



                                               Good Ideas
                                        Mediocre Ideas
                     Bad Ideas
                      Coping with 6304
• Review:
   – Chapters 1 to 7 of Computer Organization & Design (3rd
     edition), if never took prerequisite
   – If took a class, be sure COD Chapters 2, 5, 6, 7 are
     familiar
• Quiz1: A diagnostic 30-min test next Thursday (08/28)
   – Questions from undergraduate course material
   – Counts for grade
   – EE4304 lecture notes and sample questions on webCT
• You are a graduate student, so
   – should get familiar to reading research papers
       » helps you to think of new possible ideas
   – should do a project
        » that helps you understand architecture better
        » that guides you how to design and evaluate new architectures
          or enhancements in existing architectures at your job
• this course will help you in these
                        Grading (TBD)
                         fromprevious years
• 15% Homework
• 10% Quizzes
   – Each quiz based on a research paper
   – 3 or 4 quizzes in the semester
   – List of papers online (will be updated as semester progresses)
• 50% Two Examinations
• 20% Project (work in a team of two)
   –   transition from undergrad to grad student
   –   we want you to succeed, but you need to show initiative
   –   Decide the team (send email by 09/30/2009)
   –   meet at least once a week with faculty/TA to see progress
   –   demonstration in final exam week (absolutely necessary)
   –   written report must
   –   opportunity to do “research in the small” to help make transition
       from good student to a research colleague
• 5% Class Participation
         Final grades (TBD)
              from previous years

• Final grades typically will be:
  – At least 88% for an A
  – At least 78% for a B
• Remember, final grade is based on your
  overall performance
  – Exams, homework, quiz, project, and class participation
             Class participation
• Just ~100% attendance of classes will not
  automatically fetch you 5 points
• Ask questions in class
  – Questions will help you and classmates better understand
  – Questions will stimulate my teaching
  – Questions will help me pause and better reorganize my
    explanation
  – Questions will initiate a good discussion
  – And, some questions help me look at things that I have not
    thought before
  – Remember, all questions are equally important
      » No question is silly
• Occasionally meet me and/or TA to have discussion
  on issues that interest you
• Show initiative in knowing how far and how much
  you can do in project
                    Miscellaneous

• Course webpage
  – http://www.utdallas.edu/~vgo071000/classes/EECECS6304
    /Lectures
• Keep checking the webpage for periodical
  updates
• Lecture notes, homework, and solutions will
  be made available ONLY ON the web
  – Course webpage will only reflect given dates, due dates,
    reference material, and other misc. items.
• Office Hours MW 11.00-12.00pm
  – Office: ECS North 4.914
• If you need to meet me outside office
  hours, please send email for an appointment
                          Miscellaneous
• In general, announcements will be made by email
• Email will be sent ONLY to your abc@utdallas.edu
• You can send me an email
   – for general questions or
   – for seeking appointment outside office hours
• Response is guaranteed ONLY if you send email from your
  abc@utdallas.edu address
• If you have a technical question on
  homeworks/projects/exams, please try to meet me or TA in
  person
   – Please do not send email with your question
        » It is hard to explain technical points over email
• Be aware of
   – penalty on late submission
   – Academic dishonesty
• Feel free to send your feedback to improve the quality
Understanding & Quantifying
       Performance
                 Which is faster?

               DC to                               Throughput
   Plane                    Speed     Passengers
               Paris                                 (pmph)

 Boeing 747   6.5 hours     610 mph      470        286,700



  BAD/Sud
              3 hours     1350 mph       132        178,200
  Concorde


• Time to run the task (ExTime)
  – Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
  (Performance)
  – Throughput, bandwidth
Definitions
• Performance is in units of things per sec
     – bigger is better
• If we are primarily concerned with response time

    – performance(x) =            1
                           execution_time(x)

     " X is n times faster than Y" means

                  Performance(X)       Execution_time(Y)
n        =                         =
                  Performance(Y)       Execution_time(X)
                                                              CPI
Processor performance equation

                                                 inst count   Cycle time
   CPU time   = Seconds     = Instructions x   Cycles    x Seconds
                  Program       Program        Instruction     Cycle

                       Inst Count      CPI        Clock Rate
     Program                X

     Compiler               X          (X)

     Inst. Set.             X             X

     Organization                         X             X

     Technology                                         X
               Cycles Per Instruction
                    (Throughput)

“Average Cycles per Instruction”
       CPI = (CPU Time * Clock Rate) / Instruction Count
           = Cycles / Instruction Count
                                 n
       CPU time  Cycle Time   CPI j  I j
                                 j 1




              n                                  Ij
       CPI   CPI j  Fj   where Fj 
              j 1                       Instructio n Count



                                 “Instruction Frequency”
  Example: Calculating CPI bottom up

Base Machine   (Reg /   Reg)
Op              Freq     Cycles   CPI(i)   (% Time)
ALU             50%      1         .5      (33%)
Load            20%      2         .4      (27%)
Store           10%      2         .2      (13%)
Branch          20%      2         .4      (27%)
                                  1.5
        Typical Mix of
        instruction types
        in program
           Example: Branch Stall Impact


• Assume CPI = 1.0 ignoring branches (ideal)
• Assume branch was stalling for 3 cycles
• If 30% branch, Stall 3 cycles on 30%

• Op         Freq   Cycles CPI(i) (% Time)
• Other      70%    1        .7   (37%)
• Branch     30%    4      1.2    (63%)


• => new CPI = 1.9
• New machine is 1/1.9 = 0.52 times faster (i.e. slow!)
       Speed Up Equation for Pipelining


 CPIpipelined  Ideal CPI  Average Stall cycles per Inst



For simple RISC pipeline, CPI = 1:



                       1            Cycle Timeunpipelined
  Speedup                        
            1  Pipeline stall CPI Cycle Timepipelined
             Making common case fast
• Many a time an architect spends tremendous effort
  and time to optimize some aspect of system
   – Later realize that overall speedup is unrewarding
• So, better to measure the usage of that aspect of
  system, before attempt to optimize it
• In making a design trade-off
   – Favor the frequent case over the infrequent case
• In allocating additional resources
   – Allocate to improve frequent event, rather than a rare event



So, what principle quantifies this scenario?
                         Amdahl’s Law
                                                     Fraction enhanced 
ExTime new  ExTime old  1  Fraction enhanced  
                                                     Speedup enhanced 


                   ExTimeold                          1
Speedupoverall              
                   ExTimenew                                 Fractionenhanced
                                 1  Fractionenhanced  
                                                             Speedupenhanced

Best you could ever hope to do:
                                             1
               Speedupmaximum    
                                   1 - Fractionenhanced 
              Amdahl’s Law example
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O



                                           1
  Speedupoverall 
                     1  Fractionenhanced  Fractionenhanced
                                              Speedupenhanced
                           1                1
                                               1.56
                     1  0.4  0.4       0.64
                                10
 • Apparently, its human nature to be attracted by
   10X faster, vs. keeping in perspective its just
   1.6X faster
                  Amdahl’s law

• For example:
  – A program takes a certain time to execute on a
    processor, of which 60% is consumed by floating-
    point operations
  – Say the floating-point hardware in the processor
    is enhanced
  – Now, the floating-point operations in the program
    consume only 40% of the execution time

  – So, what is amount of enhancement done in
    floating-point hardware?
  – What is the overall speedup?
    What Computer Architecture brings to Table
•    Other fields often borrow ideas from
     architecture
•    Quantitative Principles of Design
     1.   Take Advantage of Parallelism
     2.   Principle of Locality
     3.   Focus on the Common Case
     4.   Amdahl’s Law
     5.   The Processor Performance Equation
•    Careful, quantitative comparisons
     –    Define, quantify, and summarize relative performance
     –    Define and quantify relative cost
     –    Define and quantify dependability
     –    Define and quantify power
•    Culture of anticipating and exploiting advances in
     technology
•    Culture of well-defined interfaces that are
     carefully implemented and thoroughly checked
   1) Taking Advantage of Parallelism
• Increasing throughput of server computer via
  multiple processors or multiple disks
• Detailed HW design
   – Carry lookahead adders uses parallelism to speed up computing
     sums from linear to logarithmic in number of bits per operand
   – Multiple memory banks searched in parallel in set-associative
     caches
• Pipelining: overlap instruction execution to reduce
  the total time to complete an instruction
  sequence.
   – Not every instruction depends on immediate predecessor 
     executing instructions completely/partially in parallel possible
   – Classic 5-stage pipeline:
     1) Instruction Fetch (Ifetch),
     2) Register Read (Reg),
     3) Execute (ALU),
     4) Data Memory Access (Dmem),
     5) Register Write (Reg)
     Pipelined Instruction Execution
                        Time (clock cycles)

     Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I




                          ALU
n     Ifetch    Reg              DMem     Reg

s
t
r.



                                    ALU
               Ifetch     Reg             DMem    Reg



O
r



                                            ALU
                        Ifetch     Reg            DMem    Reg

d
e
r



                                                    ALU
                                 Ifetch    Reg            DMem   Reg
           2) The Principle of Locality
• The Principle of Locality:
   – Program access a relatively small portion of the address space at
     any instant of time.
• Two Different Types of Locality:
   – Temporal Locality (Locality in Time): If an item is referenced, it
     will tend to be referenced again soon (e.g., loops, reuse)
   – Spatial Locality (Locality in Space): If an item is referenced,
     items whose addresses are close by tend to be referenced soon
     (e.g., straight-line code, array access)
• Last 30 years, HW relied on locality for memory
  perf.

              P          $           MEM
Capacity
           Levels of the Memory Hierarchy
Access Time                                              Staging
Cost                                                     Xfer Unit
 CPU Registers               Registers                                  Upper Level
 100s Bytes
 300 – 500 ps (0.3-0.5 ns)                            prog./compiler
                                    Instr. Operands   1-8 bytes           faster
 L1 and L2 Cache             L1 Cache
 10s-100s K Bytes                                      cache cntl
 ~1 ns - ~10 ns                     Blocks             32-64 bytes
 $1000s/ GByte
                             L2 Cache
                                                       cache cntl
 Main Memory                        Blocks             64-128 bytes
 G Bytes
 80ns- 200ns                 Memory
 ~ $100/ GByte
                                                        OS
                                    Pages               4K-8K bytes
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
                             Disk
~ $1 / GByte                                            user/operator
                                    Files               Mbytes
                                                                            Larger
 Tape
 infinite                     Tape                                   Lower Level
 sec-min
 ~$1 / GByte
       3) Focus on the Common Case
• Common sense guides computer design
  – Since its engineering, common sense is valuable
• In making a design trade-off, favor the frequent
  case over the infrequent case
  – E.g., Instruction fetch and decode unit used more frequently
    than multiplier, so optimize it 1st
  – E.g., If database server has 50 disks / processor, storage
    dependability dominates system dependability, so optimize it
    1st
• Frequent case is often simpler and can be done
  faster than the infrequent case
  – E.g., overflow is rare when adding 2 numbers, so improve
    performance by optimizing more common case of no overflow
  – May slow down overflow, but overall performance improved by
    optimizing for the normal case
• What is frequent case and how much
  performance improved by making case faster =>
  Amdahl’s Law
                       4) Amdahl’s Law
                                                     Fraction enhanced 
ExTime new  ExTime old  1  Fraction enhanced  
                                                     Speedup enhanced 


                   ExTimeold                          1
Speedupoverall              
                   ExTimenew                                 Fractionenhanced
                                 1  Fractionenhanced  
                                                             Speedupenhanced

Best you could ever hope to do:
                                             1
               Speedupmaximum    
                                   1 - Fractionenhanced 
                                                              CPI
5) Processor performance equation

                                                 inst count   Cycle time
   CPU time   = Seconds     = Instructions x   Cycles    x Seconds
                  Program       Program        Instruction     Cycle

                       Inst Count      CPI        Clock Rate
     Program                X

     Compiler               X          (X)

     Inst. Set.             X             X

     Organization                         X             X

     Technology                                         X
                          Summary
• Modern Computer Architecture is about
  managing and optimizing across several levels
  of abstraction wrt dramatically changing
  technology and application load
• Key Abstractions
   – instruction set architecture
   – memory
   – bus
• Key concepts
   –   HW/SW boundary
   –   Compile Time / Run Time
   –   Pipelining
   –   Caching
• Performance Iron Triangle relates combined
  effects
   – Total Time = Inst. Count x CPI + Cycle Time
EE (CE) 6304 Computer Architecture

           Introduction




Additional reference material
              Review from last lecture
• Computer Architecture >> instruction sets
• Computer Architecture skill sets are different
   –   5 Quantitative principles of design
   –   Quantitative approach to design
   –   Solid interfaces that really work
   –   Technology tracking and anticipation
• 6304: to learn new skills, transition to research
• Computer Architecture at the crossroads from
  sequential to parallel computing
   – Salvation requires innovation in many fields, including
     computer architecture
    Review: Computer Architecture brings
•   Other fields often borrow ideas from
    architecture
•   Quantitative Principles of Design
    1.   Take Advantage of Parallelism
    2.   Principle of Locality
    3.   Focus on the Common Case
    4.   Amdahl’s Law
    5.   The Processor Performance Equation
•   Careful, quantitative comparisons
    –    Define, quantify, and summarize relative performance
    –    Define and quantify relative cost
    –    Define and quantify dependability
    –    Define and quantify power
•   Culture of anticipating and exploiting advances in
    technology
•   Culture of well-defined interfaces that are
    carefully implemented and thoroughly checked
                    Outline


•    Review
•    Technology Trends: Culture of tracking,
     anticipating and exploiting advances in
     technology
•    Careful, quantitative comparisons:
1.   Define, quantify, and summarize relative
     performance
2.   Define and quantify relative cost
3.   Define and quantify dependability
4.   Define and quantify power
         Moore’s Law: 2X transistors / “year”




•   “Cramming More Components onto Integrated Circuits”
     –   Gordon Moore, Electronics, 1965
•   # of transistors / cost-effective integrated circuit double every N months (12 ≤
    N ≤ 24)
        Tracking Technology Performance Trends
• Drill down into 4 technologies:
   –   Disks,
   –   Memory,
   –   Network,
   –   Processors
• Compare ~1980 Archaic (Nostalgic) vs.
  ~2000 Modern (Newfangled)
   – Performance Milestones in each technology
• Compare for Bandwidth vs. Latency improvements
  in performance over time
• Bandwidth: number of events per unit time
   – E.g., M bits / second over network, M bytes / second from
     disk
• Latency: elapsed time for a single event
   – E.g., one-way network delay in microseconds,
     average disk access time in milliseconds
Disks: Archaic(Nostalgic) v. Modern(Newfangled)

•   CDC Wren I, 1983       • Seagate 373453, 2003
•   3600 RPM               • 15000 RPM             (4X)
•   0.03 GBytes capacity   • 73.4 GBytes        (2500X)
•   Tracks/Inch: 800       • Tracks/Inch: 64000 (80X)
•   Bits/Inch: 9550        • Bits/Inch: 533,000 (60X)
•   Three 5.25” platters   • Four 2.5” platters
                             (in 3.5” form factor)
• Bandwidth:               • Bandwidth:
  0.6 MBytes/sec             86 MBytes/sec       (140X)
• Latency: 48.3 ms         • Latency: 5.7 ms       (8X)
• Cache: none              • Cache: 8 MBytes
       Latency Lags Bandwidth (for last ~20 years)
     10000
                                                   • Performance Milestones

      1000


Relative
  BW                           Disk
        100
Improve
  ment


        10


                      (Latency improvement
                  = Bandwidth improvement)
                                                   • Disk: 3600, 5400, 7200,
         1                                           10000, 15000 RPM (8x, 143x)
              1              10              100     (latency = simple operation w/o contention
              Relative Latency Improvement           BW = best-case)
       Memory: Archaic (Nostalgic) v. Modern (Newfangled)

• 1980 DRAM              • 2000 Double Data Rate Synchr.
  (asynchronous)           (clocked) DRAM
• 0.06 Mbits/chip        • 256.00 Mbits/chip     (4000X)
• 64,000 xtors, 35 mm2   • 256,000,000 xtors, 204 mm2
• 16-bit data bus per    • 64-bit data bus per
  module, 16 pins/chip     DIMM, 66 pins/chip       (4X)
• 13 Mbytes/sec          • 1600 Mbytes/sec        (120X)
• Latency: 225 ns        • Latency: 52 ns           (4X)
• (no block transfer)    • Block transfers (page mode)
      Latency Lags Bandwidth (last ~20 years)
    10000
                                                 • Performance Milestones


     1000


Relative    Memory
  BW                        Disk
Improve
        100
                                                 • Memory Module: 16bit plain
  ment                                             DRAM, Page Mode DRAM,
                                                   32b, 64b, SDRAM,
       10                                          DDR SDRAM (4x,120x)
                                                 • Disk: 3600, 5400, 7200,
                    (Latency improvement           10000, 15000 RPM (8x, 143x)
                = Bandwidth improvement)
        1                                         (latency = simple operation w/o contention
            1              10              100    BW = best-case)

            Relative Latency Improvement
          LANs: Archaic (Nostalgic)v. Modern (Newfangled)


    • Ethernet 802.3                     • Ethernet 802.3ae
    • Year of Standard:                  • Year of Standard: 2003
      1978                               • 10,000 Mbits/s (1000X)
    • 10 Mbits/s                           link speed
      link speed                         • Latency: 190 msec (15X)
    • Latency: 3000 msec                 • Switched media
    • Shared media                       • Category 5 copper wire
    • Coaxial cable
                                              "Cat 5" is 4 twisted pairs in bundle
Coaxial Cable:   Plastic Covering               Twisted Pair:
                    Braided outer conductor
                          Insulator
                           Copper core           Copper, 1mm thick,
                                                 twisted to avoid antenna effect
     Latency Lags Bandwidth (last ~20 years)
    10000
                                                 • Performance Milestones

     1000
                                Network
Relative    Memory                               • Ethernet: 10Mb, 100Mb,
  BW
        100
                            Disk                   1000Mb, 10000 Mb/s (16x,1000x)
                                                 • Memory Module: 16bit plain
Improve
  ment
                                                   DRAM, Page Mode DRAM,
                                                   32b, 64b, SDRAM,
                                                   DDR SDRAM (4x,120x)
       10


                    (Latency improvement
                                                 • Disk: 3600, 5400, 7200,
        1
                = Bandwidth improvement)           10000, 15000 RPM (8x, 143x)
            1              10              100
                                                   (latency = simple operation w/o contention
            Relative Latency Improvement           BW = best-case)
         CPUs: Archaic (Nostalgic) v. Modern (Newfangled)

• 1982 Intel 80286           • 2001 Intel Pentium 4
• 12.5 MHz                   • 1500 MHz             (120X)
• 2 MIPS (peak)              • 4500 MIPS (peak) (2250X)
• Latency 320 ns             • Latency 15 ns          (20X)
• 134,000 xtors, 47 mm2      • 42,000,000 xtors, 217 mm2
• 16-bit data bus, 68 pins   • 64-bit data bus, 423 pins
• Microcode interpreter,     • 3-way superscalar,
  separate FPU chip            Dynamic translate to RISC,
• (no caches)                  Superpipelined (22 stage),
                               Out-of-Order execution
                             • On-chip 8KB Data caches,
                               96KB Instr. Trace cache,
                               256KB L2 cache
          Latency Lags Bandwidth (last ~20 years)

     10000                                    • Performance Milestones
CPU high,
                          Processor           • Processor: ‘286, ‘386, ‘486,
                                                Pentium, Pentium Pro, Pentium
Memory low
(“Memory
Wall”) 1000                                     4 (21x,2250x)
                                 Network      • Ethernet: 10Mb, 100Mb,
 Relative    Memory
                             Disk
                                                1000Mb, 10000 Mb/s (16x,1000x)
   BW
 Improve
         100                                  • Memory Module: 16bit plain
   ment                                         DRAM, Page Mode DRAM,
                                                32b, 64b, SDRAM,
        10                                      DDR SDRAM (4x,120x)
                                              • Disk : 3600, 5400, 7200,
                     (Latency improvement       10000, 15000 RPM (8x, 143x)
                 = Bandwidth improvement)
         1
             1              10              100
              Relative Latency Improvement
   Rule of Thumb for Latency Lagging BW

• In the time that bandwidth doubles, latency
  improves by no more than a factor of 1.2 to
  1.4
   (and capacity improves faster than bandwidth)
• Stated alternatively:
 Bandwidth improves by more than the square
 of the improvement in Latency
           Computers in the News

• “Intel loses market share in own backyard,”
  By Tom Krazit, CNET News.com, 1/18/2006
• “Intel's share of the U.S. retail PC market fell by
  11 percentage points, from 64.4 percent in the
  fourth quarter of 2004 to 53.3 percent. … Current
  Analysis' market share numbers measure U.S.
  retail sales only, and therefore exclude figures
  from Dell, which uses its Web site to sell directly
  to consumers. …
  AMD chips were found in 52.5 percent of desktop
  PCs sold in U.S. retail stores during that period.”
• Technical advantages of AMD Opteron/Athlon vs.
  Intel Pentium 4 as we’ll see in this course.
            6 Reasons Latency Lags Bandwidth

1.       Moore’s Law helps BW more than latency
     •     Faster transistors, more transistors,
           more pins help Bandwidth
          » MPU Transistors:          0.130 vs. 42 M xtors       (300X)
          » DRAM Transistors: 0.064 vs. 256 M xtors             (4000X)
          » MPU Pins:                 68 vs. 423 pins              (6X)
          » DRAM Pins:                16 vs. 66 pins               (4X)
     •     Smaller, faster transistors but communicate
           over (relatively) longer lines: limits latency
          » Feature size:             1.5 to 3 vs. 0.18 micron (8X,17X)
          » MPU Die Size:             35 vs. 204 mm2 (ratio sqrt  2X)
          » DRAM Die Size:            47 vs. 217 mm2 (ratio sqrt  2X)
      6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency
  •    Size of DRAM block  long bit and word lines
        most of DRAM access time
  •    Speed of light and computers on network
  •    1. & 2. explains linear latency vs. square BW?
3. Bandwidth easier to sell (“bigger=better”)
  •    E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.
              10 msec latency Ethernet
  •    4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
  •    Even if just marketing, customers now trained
  •    Since bandwidth sells, more resources thrown at bandwidth,
       which further tips the balance
6 Reasons Latency Lags Bandwidth (cont’d)

 4. Latency helps BW, but not vice versa
    •    Spinning disk faster improves both bandwidth and
         rotational latency
        » 3600 RPM  15000 RPM = 4.2X
        » Average rotational latency: 8.3 ms  2.0 ms
        » Things being equal, also helps BW by 4.2X
    •    Lower DRAM latency 
         More access/second (higher bandwidth)
    •    Higher linear density helps disk BW
          (and capacity), but not disk Latency
        » 9,550 BPI  533,000 BPI  60X in BW
6 Reasons Latency Lags Bandwidth (cont’d)

5. Bandwidth hurts latency
  •   Queues help Bandwidth, hurt Latency (Queuing Theory)
  •   Adding chips to widen a memory module increases
      Bandwidth but higher fan-out on address lines may
      increase Latency
6. Operating System overhead hurts
    Latency more than Bandwidth
  •   Long messages amortize overhead;
      overhead bigger part of short messages
   Summary of Technology Trends

• For disk, LAN, memory, and microprocessor,
  bandwidth improves by square of latency
  improvement
   – In the time that bandwidth doubles, latency improves by no more
     than 1.2X to 1.4X
• Lag probably even larger in real systems, as
  bandwidth gains multiplied by replicated components
   –   Multiple processors in a cluster or even in a chip
   –   Multiple disks in a disk array
   –   Multiple memory modules in a large memory
   –   Simultaneous communication in switched LAN
• HW and SW developers should innovate assuming
  Latency Lags Bandwidth
   – If everything improves at the same rate, then nothing really
     changes
   – When rates vary, require real innovation
                    Outline


•    Review
•    Technology Trends: Culture of tracking,
     anticipating and exploiting advances in
     technology
•    Careful, quantitative comparisons:
1.   Define and quantify power
2.   Define and quantify dependability
3.   Define, quantify, and summarize relative
     performance
4.   Define and quantify relative cost
      Define and quantify power ( 1 / 2)
  • For CMOS chips, traditional dominant energy consumption
    has been in switching transistors, called dynamic power
                                            2
Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched

  • For mobile devices, energy better metric
                                                    2
            Energydynamic  CapacitiveLoad  Voltage
  • For a fixed task, slowing clock rate (frequency switched)
    reduces power, but not energy
  • Capacitive load a function of number of transistors
    connected to output and technology, which determines
    capacitance of wires and transistors
  • Dropping voltage helps both, so went from 5V to 1V
  • To save energy & dynamic power, most CPUs now turn off
    clock of inactive modules (e.g. Fl. Pt. Unit)
           Example of quantifying power
  • Suppose 15% reduction in voltage results in a
    15% reduction in frequency. What is impact on
    dynamic power?

Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched
                                            2




             1 / 2  .85  CapacitiveLoad  (.85Voltage)  FrequencySwitched
                                                       2




             (.85)3  OldPowerdynamic
             0.6  OldPowerdynamic
    Define and quantify power (2 / 2)
• Because leakage current flows even when a
  transistor is off, now static power important too
          Powerstatic  Currentstatic  Voltage

• Leakage current increases in processors with
  smaller transistor sizes
• Increasing the number of transistors increases
  power even if they are turned off
• In 2006, goal for leakage is 25% of total power
  consumption; high performance designs at 40%
• Very low power systems even gate voltage to
  inactive modules to control loss due to leakage
                    Outline


•    Review
•    Technology Trends: Culture of tracking,
     anticipating and exploiting advances in
     technology
•    Careful, quantitative comparisons:
1.   Define and quantify power
2.   Define and quantify dependability
3.   Define, quantify, and summarize relative
     performance
4.   Define and quantify relative cost
Define and quantity dependability (1/3)
•  How decide when a system is operating properly?
•  Infrastructure providers now offer Service Level
   Agreements (SLA) to guarantee that their
   networking or power service would be dependable
• Systems alternate between 2 states of service
   with respect to an SLA:
1. Service accomplishment, where the service is
   delivered as specified in SLA
2. Service interruption, where the delivered service
   is different from the SLA
• Failure = transition from state 1 to state 2
• Restoration = transition from state 2 to state 1
Define and quantity dependability (2/3)
•   Module reliability = measure of continuous service
   accomplishment (or time to failure).
    2 metrics
1. Mean Time To Failure (MTTF) measures Reliability
2. Failures In Time (FIT) = 1/MTTF, the rate of failures
    •   Traditionally reported as failures per billion hours of operation
•   Mean Time To Repair (MTTR) measures Service
    Interruption
    –   Mean Time Between Failures (MTBF) = MTTF+MTTR
•   Module availability measures service as alternate between
    the 2 states of accomplishment and interruption (number
    between 0 and 1, e.g. 0.9)
•   Module availability = MTTF / ( MTTF + MTTR)
      Example calculating reliability
•   If modules have exponentially distributed
    lifetimes (age of module does not affect
    probability of failure), overall failure rate
    is the sum of failure rates of the modules
•   Calculate FIT and MTTF for 10 disks (1M
    hour MTTF per disk), 1 disk controller
    (0.5M hour MTTF), and 1 power supply
    (0.2M hour MTTF):
    FailureRat 
             e



         MTTF
      Example calculating reliability
•   If modules have exponentially distributed lifetimes
    (age of module does not affect probability of
    failure), overall failure rate is the sum of failure
    rates of the modules
•   Calculate FIT and MTTF for 10 disks (1M hour
    MTTF per disk), 1 disk controller (0.5M hour
    MTTF), and 1 power supply (0.2M hour MTTF):

FailureRate  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000
             10  2  5 / 1,000,000
             17 / 1,000,000
             17,000FIT
     MTTF 1,000,000,000 / 17,000
             59,000hours
                    Outline


•    Review
•    Technology Trends: Culture of tracking,
     anticipating and exploiting advances in
     technology
•    Careful, quantitative comparisons:
1.   Define and quantify power
2.   Define and quantify dependability
3.   Define, quantify, and summarize relative
     performance
4.   Define and quantify relative cost
      Definition: Performance
• Performance is in units of things per sec
     – bigger is better
• If we are primarily concerned with response time

    performance(x) =               1
                                           execution_time(x)

    " X is n times faster than Y" means

                  Performance(X)           Execution_time(Y)
n            =                         =
                  Performance(Y)           Execution_time(X)
        Performance: What to measure
• Usually rely on benchmarks vs. real workloads
• To increase predictability, collections of benchmark
  applications, called benchmark suites, are popular
• SPECCPU: popular desktop benchmark suite
   –   CPU only, split between integer and floating point programs
   –   SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
   –   SPECCPU2006 to be announced Spring 2006
   –   SPECSFS (NFS file server) and SPECWeb (WebServer) added as
       server benchmarks
• Transaction Processing Council measures server performance
  and cost-performance for databases
   –   TPC-C Complex query for Online Transaction Processing
   –   TPC-H models ad hoc decision support
   –   TPC-W a transactional web benchmark
   –   TPC-App application server and web services benchmark
   How Summarize Suite Performance (1/5)
• Arithmetic average of execution time of all pgms?
   – But they vary by 4X in speed, so some would be more
     important than others in arithmetic average
• Could add a weights per program, but how pick
  weight?
   – Different companies want different weights for their products
• SPECRatio: Normalize execution times to
  reference computer, yielding a ratio proportional
  to performance =
            time on reference computer
           time on computer being rated
 How Summarize Suite Performance (2/5)
• If program SPECRatio on Computer A is
  1.25 times bigger than Computer B, then
                       ExecutionTimereference
        SPECRatioA     ExecutionTimeA
 1.25             
        SPECRatioB
                     ExecutionTimereference
                       ExecutionTimeB
        ExecutionTimeB Performance A
                     
        ExecutionTimeA PerformanceB

• Note that when comparing 2 computers as a
  ratio, execution times on the reference
  computer drop out, so choice of reference
  computer is irrelevant
    How Summarize Suite Performance (3/5)
• Since ratios, proper mean is geometric mean
  (SPECRatio unitless, so arithmetic mean meaningless)

                              n
    GeometricMean  n        SPECRatio
                             i 1
                                              i



   1. Geometric mean of the ratios is the same as
      the ratio of the geometric means
   2. Ratio of geometric means
      = Geometric mean of performance ratios
       choice of reference computer is irrelevant!
   • These two points make geometric mean of ratios
      attractive to summarize performance
      How Summarize Suite Performance (4/5)
• Does a single mean well summarize performance of programs in
  benchmark suite?
• Can decide if mean a good predictor by characterizing
  variability of distribution using standard deviation
• Like geometric mean, geometric standard deviation is
  multiplicative rather than arithmetic
• Can simply take the logarithm of SPECRatios, compute the
  standard mean and standard deviation, and then take the
  exponent to convert back:



                        1 n                  
    GeometricMean  exp   ln SPECRatioi
                         n i 1              
    GeometricStDev  expStDevln SPECRatio
                                            i
 How Summarize Suite Performance (5/5)
• Standard deviation is more informative if know
  distribution has a standard form
    – bell-shaped normal distribution, whose data are symmetric
      around mean
    – lognormal distribution, where logarithms of data--not data
      itself--are normally distributed (symmetric) on a logarithmic
      scale


• For a lognormal distribution, we expect that
68% of samples fall in range     mean / gstdev, mean  gstdev
95% of samples fall in range          
                                 mean / gstdev 2 , mean  gstdev 2    
• Note: Excel provides functions EXP(), LN(), and
  STDEV() that make calculating geometric mean and
  multiplicative standard deviation easy
  Example Standard Deviation (1/2)
• GM and multiplicative StDev of SPECfp2000 for
  Itanium 2
                14000

                12000

                10000
  SPECfpRatio




                                                                                                  GM = 2712
                8000                                                                              GSTEV = 1.98


                6000
                                                                                                                                            5362

                4000
                                                                                                                                            2712
                2000
                                                                                                                                            1372
                   0


                                                                                                                  fma3d
                                                                                        facerec
                                         mgrid


                                                         mesa




                                                                                                   ammp
                                                 applu




                                                                               equake
                        wupwise




                                                                galgel




                                                                                                                                     apsi
                                  swim




                                                                         art




                                                                                                          lucas


                                                                                                                          sixtrack
     Example Standard Deviation (2/2)
• GM and multiplicative StDev of SPECfp2000 for AMD Athlon


                   14000

                   12000

                   10000
     SPECfpRatio




                                                                                                     GM = 2086
                   8000                                                                              GSTEV = 1.40


                   6000

                   4000
                                                                                                                                              2911
                   2000                                                                                                                       2086
                                                                                                                                              1494
                      0


                                                                                                                    fma3d
                                                                                           facerec
                                            mgrid


                                                            mesa




                                                                                                     ammp
                                                    applu




                                                                                  equake
                           wupwise




                                                                   galgel




                                                                                                                                       apsi
                                     swim




                                                                            art




                                                                                                            lucas


                                                                                                                            sixtrack
 Comments on Itanium 2 and Athlon

• Standard deviation of 1.98 for Itanium 2 is much
  higher-- vs. 1.40--so results will differ more widely
  from the mean, and therefore are likely less
  predictable
• Falling within one standard deviation:
    – 10 of 14 benchmarks (71%) for Itanium 2
    – 11 of 14 benchmarks (78%) for Athlon
• Thus, the results are quite compatible with a
  lognormal distribution (expect 68%)
                  And in conclusion …
• Tracking and extrapolating technology part of architect’s
  responsibility
• Expect Bandwidth in disks, DRAM, network, and
  processors to improve by at least as much as the square
  of the improvement in Latency
• Quantify dynamic and static power
   – Capacitance x Voltage2 x frequency, Energy vs. power
• Quantify dependability
   – Reliability (MTTF, FIT), Availability (99.9…)
• Quantify and summarize performance
   – Ratios, Geometric Mean, Multiplicative Standard Deviation
• Read Appendix A, record bugs online!
EE (CE) 6304 Computer Architecture



            Appendix A



Enhancing Performance with Pipelining
Execution Cycle
   Instruction   Obtain instruction from program storage
     Fetch

   Instruction   Determine required actions and instruction size
    Decode

    Operand      Locate and obtain operand data
     Fetch

    Execute      Compute result value or status

     Result      Deposit results in storage for later use
     Store

      Next
                 Determine successor instruction
   Instruction
             What’s a Clock Cycle?

   Latch              combinational
     or                  logic
  register




• Old days: 10 levels of gates
• Today: determined by numerous time-of-
  flight issues + gate delays
  – clock propagation, wire lengths, drivers
Fast, Pipelined Instruction Interpretation
   Next Instruction
                          NI   NI   NI NI NI
  Instruction Address          IF   IF IF IF IF
                                     D D D D      D
   Instruction Fetch                    E E E     E   E
                                           W W    W   W   W
   Instruction Register

   Decode &                           Time
   Operand Fetch
   Operand Registers

       Execute

   Result Registers

    Store Results

  Registers or Mem
                   Sequential Laundry
        6 PM   7      8     9     10      11   Midnight
                          Time

         30 40 20 30 40 20 30 40 20 30 40 20
T
a    A
s
k
     B
O
r
d   C
e
r   D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
                     Pipelined Laundry
                     Start work ASAP
          6 PM   7       8     9     10     11   Midnight
                             Time

           30 40 40 40 40 20
T
a      A
s
k
       B
O
r
d     C
e
r
      D

    • Pipelined laundry takes 3.5 hours for 4 loads
                   Pipelining Lessons

                                 • Pipelining doesn’t help
        6 PM   7      8     9      latency of single task, it
                          Time     helps throughput of
                                   entire workload
T
a        30 40 40 40 40 20       • Pipeline rate limited by
s                                  slowest pipeline stage
k   A                            • Multiple tasks operating
                                   simultaneously
O                                • Potential speedup =
r   B                              Number of pipe stages
d
e                                • Unbalanced lengths of
r   C                              pipe stages reduces
                                   speedup
    D                            • Time to “fill” pipeline
                                   and time to “drain” it
                                   reduces speedup
            Instruction Pipelining

• Execute billions of instructions, so throughput is
  what matters
• What is desirable in instruction sets for
  pipelining?
   – Variable length instructions vs.
     all instructions same length?
   – Memory operands part of any operation vs.
     memory operands only in loads or stores?
   – Register operand many places in instruction
     format vs. registers located in same place?
Example: MIPS (Note register location)
Register-Register
   31        26 25     21 20        16 15    11 10   6 5         0

        Op       Rs1           Rs2      Rd                 Opx

Register-Immediate
   31        26 25     21 20        16 15                        0

        Op       Rs1           Rd            immediate

Branch
   31        26 25     21 20        16 15                        0

        Op       Rs1    Rs2/Opx              immediate

Jump / Call
   31        26 25                                               0

        Op                          target
                  5 Steps of MIPS Datapath
          Instruction           Instr. Decode         Execute           Memory        Write
            Fetch                Reg. Fetch          Addr. Calc         Access        Back
  Next PC




                                                                        MUX
                                Next SEQ PC
               Adder




           4                       RS1
                                                                Zero?




                                                      MUX MUX
                                   RS2
Address




                Memory




                                          Reg File
                         Inst




                                                                 ALU
                                                                                  L




                                                                         Memory
                                   RD




                                                                          Data
                                                                                  M




                                                                                       MUX
                                                                                  D
                                          Sign
                                  Imm    Extend




                                                       WB Data
                         5 Steps of MIPS Datapath

           Instruction         Instr. Decode                      Execute                    Memory          Write
             Fetch              Reg. Fetch                       Addr. Calc                  Access          Back
 Next PC




                                                                                             MUX
                                   Next SEQ PC                   Next SEQ PC
                  Adder




            4                         RS1
                                                                            Zero?




                                                                  MUX MUX




                                                                                                        MEM/WB
 Address



                Memory




                                      RS2




                                                                                    EX/MEM
                                              Reg File

                                                         ID/EX
                           IF/ID




                                                                             ALU




                                                                                               Memory
                                                                                                Data



                                                                                                                 MUX


                                                                                                                       WB Data
                                              Sign
                                             Extend
                                       Imm


                                              RD                    RD                        RD




• Data stationary control
    – local decode for each instruction phase / pipeline stage
           Unpipelined datapath with control
                                                                                                                                        0
                                                                                                                                        M
                                                                                                                                        u
                                                                                                                                        x
                                                                                                                 Add     ALU            1
                                                                                                                       result
               Add                                                                                    Shift
                                                                   RegDst                            left 2
     4                                                             Branch
                                                                   MemRead
                                 Instruction [31– 26]              MemtoReg
                                                        Control    ALUOp
                                                                   MemWrite
                                                                   ALUSrc
                                                                   RegWrite

                                 Instruction [25– 21]               Read
     Read                                                           register 1
PC   address                                                                         Read
                                 Instruction [20– 16]               Read            data 1
                                                                    register 2                                       Zero
                   Instruction                          0                  Registers Read
                       [31– 0]                                                                                   ALU ALU
                                                        M           Write           data 2            0                         Address         Read
                                                                                                                    result                             1
     Instruction                                         u          register                          M                                         data
                                                                                                       u                                               M
       memory                                            x                                                                                              u
                                 Instruction [15– 11]               Write                              x
                                                        1                                                                                    Data       x
                                                                    data                              1                                     memory     0
                                                                                                                                Write
                                                                                                                                data
                                                                                 16             32
                                 Instruction [15– 0]                                    Sign
                                                                                       extend           ALU
                                                                                                       control

                                                                  Instruction [5– 0]




                            Memto- Reg Mem Mem
  Instruction RegDst ALUSrc  Reg   Write Read Write Branch ALUOp1 ALUp0
 R-format       1      0      0     1     0    0       0      1     0
 lw             0      1      1     1     1    0       0      0     0
 sw             X      1      X     0     0    1       0      0     0
 beq            X      0      X     0     0    0       1      0     1
                     Pipelined Datapath with Control
                PCSrc



                                                                                       ID/EX
               0
               M
                u                                                                       WB
                x                                                                                                                 EX/MEM
               1
                                                                        Control         M                                          WB
                                                                                                                                                                              MEM/WB

                                                                                        EX                                          M                                          WB
                        IF/ID


               Add

                                                                                                                      Add
     4                                                                                                          Add result



                                                            RegWrite
                                                                                                                                           Branch
                                                                                                Shift
                                                                                               left 2




                                                                                                                                                         MemWrite
                                                                                                                         ALUSrc
                                                 Read




                                                                                                                                                                                       MemtoReg
                                Instruction




PC   Address                                     register 1
                                                                  Read
                                                                 data 1
                                                 Read
                                                 register 2                                                               Zero
          Instruction
                                                        Registers Read                                                ALU ALU
            memory                               Write                                                   0                                                             Read
                                                                 data 2                                                  result                Address                                      1
                                                 register                                                M                                                             data
                                                                                                          u                                             Data                                M
                                                 Write                                                    x                                            memory                                u
                                                 data                                                                                                                                        x
                                                                                                         1
                                                                                                                                                                                            0
                                                                                                                                               Write
                                                                                                                                               data

                                              Instruction 16                      32                    6
                                              [15– 0]                   Sign                                    ALU                                                 MemRead
                                                                       extend                                  control

                                              Instruction
                                              [20– 16]
                                                                                                        0             ALUOp
                                                                                                        M
                                              Instruction                                                u
                                              [15– 11]                                                   x
                                                                                                        1
                                                                                                             RegDst
                                  Pipeline Control
•   Pass control signals along just like the data

                      Execution/Address                                 Write-back
                   Calculation stage control       Memory access stage stage control
                             lines                     control lines       lines
                  Reg    ALU     ALU     ALU       Branc Mem Mem        Reg    Mem
    Instruction   Dst    Op1      Op0    Src         h    Read Write write to Reg
    R-format       1       1       0      0          0      0        0   1       0
    lw             0       0       0      1          0      1        0   1       1
    sw             X       0       0      1          0      0        1   0       X
    beq            X       0       1      0          1      0        0   0       X



                                            WB

                   Instruction
                                 Control    M         WB


                                            EX        M             WB




                  IF/ID                    ID/EX    EX/MEM        MEM/WB
               Visualizing Pipelining

                        Time (clock cycles)

     Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I




                          ALU
n     Ifetch    Reg              DMem     Reg

s
t
r.



                                    ALU
               Ifetch     Reg             DMem    Reg



O
r



                                            ALU
                        Ifetch     Reg            DMem    Reg

d
e
r



                                                    ALU
                                 Ifetch    Reg            DMem   Reg
  Its Not That Easy for Computers


• Limits to pipelining: Hazards prevent next
  instruction from executing during its designated
  clock cycle
   – Structural hazards: HW cannot support this combination of
     instructions (single person to fold and put clothes away)
   – Data hazards: Instruction depends on result of prior
     instruction still in the pipeline (missing sock)
   – Control hazards: Caused by delay between the fetching of
     instructions and decisions about changes in control flow
     (branches and jumps).
    Example: One Memory Port/Structural
                  Hazard
     Time (clock cycles)
        Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I Load Ifetch


                            ALU
                  Reg              DMem     Reg


n
s



                                      ALU
t
   Instr 1       Ifetch     Reg             DMem    Reg



r.




                                              ALU
                                                            Reg
    Instr 2               Ifetch     Reg            DMem

O
r



                                                      ALU
d   Instr 3                        Ifetch    Reg            DMem   Reg



e
r   Instr 4
                Structural Hazard
     Resolving structural hazards

• Defn: attempt to use same hardware for
  two different things at the same time
• Solution 1: Wait
  must detect the hazard
  must have mechanism to stall
• Solution 2: Throw more hardware at the
  problem
    Detecting and Resolving Structural Hazard

     Time (clock cycles)
        Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I Load Ifetch


                            ALU
                  Reg                DMem      Reg


n
s



                                       ALU
t
   Instr 1       Ifetch     Reg               DMem     Reg



r.




                                                ALU
                                                               Reg
    Instr 2               Ifetch      Reg              DMem

O
r
    Stall                          Bubble    Bubble Bubble    Bubble   Bubble
d
e
r



                                                                 ALU
    Instr 3                                   Ifetch    Reg            DMem     Reg
   Eliminating Structural Hazards at Design Time



  Next PC




                                                                                            MUX
                                  Next SEQ PC                   Next SEQ PC
                  Adder




            4                        RS1
                                                                           Zero?




                                                                 MUX MUX




                                                                                                     MEM/WB
  Address




                                     RS2




                                                                                   EX/MEM
                                             Reg File
                Cache




                                                        ID/EX
                Instr


                          IF/ID




                                                                            ALU




                                                                                             Cache
                                                                                             Data



                                                                                                              MUX


                                                                                                                    WB Data
                                             Sign
                                            Extend
                                      Imm
Datapath
                                             RD                    RD                       RD



Control Path
         Resolving structural hazard in
                    memory
  • In this case, why provide separate instruction and
    data caches?
     – Why not provide two ports to one cache?
  • It is a possible solution
     – But an expensive one
  • For example: (from CACTI-3.2 tool, at 0.18micron)
Cache size   # ports          Access time         Power    Area
                                  (ns)             (nJ)   (sq.cm)
  16KB        1r, 1w            0.8238           0.5067   0.0127
  16KB        2r, 1w            0.8577           1.0150   0.0212
  32KB        1r, 1w            0.8864           0.5969   0.0244
  32KB        2r, 1w            1.1718           1.1559   0.0401

  • For separate instruction and data caches
     – There are other important reasons too
     – We will study those in next chapter
  Role of Instruction Set Design in
    Structural Hazard Resolution

• Simple to determine the sequence of
  resources used by an instruction
   – opcode tells it all
• Uniformity in the resource usage
• Compare MIPS to IA32?
• MIPS approach => all instructions flow
  through same 5-stage pipeline
                         Data Hazards
      Time (clock cycles)

                       IF ID/RF EX               MEM      WB

I




                                          ALU
     add r1,r2,r3     Ifetch    Reg              DMem     Reg

n
s
t




                                                   ALU
                               Ifetch    Reg              DMem     Reg
     sub r4,r1,r3
r.




                                                            ALU
O
                                                                          Reg
     and r6,r1,r7                       Ifetch    Reg              DMem


r
d




                                                                    ALU
                                                 Ifetch    Reg            DMem    Reg
e    or    r8,r1,r9
r




                                                                            ALU
     xor r10,r1,r11                                       Ifetch    Reg           DMem   Reg
        Three Generic Data Hazards

• Read After Write (RAW)
  InstrJ tries to read operand before InstrI writes it


           I: add r1,r2,r3
           J: sub r4,r1,r3

• Caused by a “Data Dependence” (in compiler
  nomenclature). This hazard results from an actual
  need for communication.
       Three Generic Data Hazards

• Write After Read (WAR)
  InstrJ writes operand before InstrI reads it
            I: sub r4,r1,r3
            J: add r1,r2,r3
            K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
  This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:
   – All instructions take 5 stages, and
   – Reads are always in stage 2, and
   – Writes are always in stage 5
          Three Generic Data Hazards
• Write After Write (WAW)
  InstrJ writes operand before InstrI writes it.

               I: sub r1,r4,r3
               J: add r1,r2,r3
               K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
  This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
   – All instructions take 5 stages, and
   – Writes are always in stage 5
• WAR and WAW are hazards in more complicated
  pipes like wide-issue processors
          Forwarding to Avoid Data Hazard
                   Time (clock cycles)
 I
 n   add r1,r2,r3 Ifetch




                                      ALU
                            Reg              DMem     Reg

 s
 t
r.   sub r4,r1,r3




                                               ALU
                           Ifetch    Reg              DMem     Reg



O
r




                                                        ALU
                                    Ifetch    Reg              DMem   Reg
d    and r6,r1,r7
e
r




                                                                ALU
                                             Ifetch    Reg            DMem    Reg
     or    r8,r1,r9




                                                                        ALU
                                                      Ifetch    Reg           DMem   Reg
     xor r10,r1,r11
                   HW Change for Forwarding


 NextPC
                        mux
    Registers




                                                      MEM/WR
                                    EX/MEM
                              ALU
                ID/EX




                                              Data
                        mux




                                             Memory




                                                               mux
Immediate
                                                            Forwarding
                                                              ID/EX

                                                               WB
                                                                                            EX/MEM

                                           Control             M                             WB
                                                                                                                    MEM/WB


                   IF/ID                                       EX                             M                          WB




                                                                           M
                           Instruction




                                                                           u
                                                                           x
                                                Registers
     Instruction                                                                                           Data
PC                                                                                ALU
       memory                                                                                             memory              M
                                                                                                                              u
                                                                           M                                                  x
                                                                           u
                                                                           x

                                         IF/ID.RegisterRs             Rs
                                         IF/ID.RegisterRt             Rt
                                         IF/ID.RegisterRt             Rt
                                                                           M                         EX/MEM.RegisterRd
                                         IF/ID.RegisterRd             Rd   u
                                                                           x
                                                                               Forwarding            MEM/WB.RegisterRd
                                                                                  unit
          Data Hazard Even with Forwarding

     Time (clock cycles)


I    lw r1, 0(r2) Ifetch




                                      ALU
                            Reg              DMem     Reg

n
s
t




                                                ALU
     sub r4,r1,r6          Ifetch    Reg              DMem    Reg

r.

O




                                                        ALU
                                    Ifetch     Reg            DMem   Reg
     and r6,r1,r7
r
d
e




                                                               ALU
                                             Ifetch    Reg           DMem   Reg

r
     or     r8,r1,r9
       Resolving this load hazard


• Adding hardware?
• Detection?
• Compilation techniques?

• What is the cost of load delays?
           Resolving the Load Data Hazard

         Time (clock cycles)
I
n




                                          ALU
                                                          Reg
s    lw r1, 0(r2)    Ifetch     Reg               DMem


t
r.




                                                            ALU
     sub r4,r1,r6              Ifetch    Reg     Bubble            DMem    Reg

O
r
d                                                Bubble




                                                                     ALU
                                        Ifetch             Reg             DMem   Reg
e    and r6,r1,r7
r
                                                 Bubble




                                                                            ALU
                                                          Ifetch   Reg            DMem
     or r8,r1,r9
                                                      Hazard Detection Unit
                                                    Hazard                    ID/EX.MemRead
                                                   detection
                                                      unit                              ID/EX

                        IF/IDWrite                                                       WB
                                                                                                                      EX/MEM
                                                                                    M
                                                               Control              u    M                             WB
                                                                                    x                                                         MEM/WB
                                                                          0
                         IF/ID                                                           EX                             M                          WB
PCWrite




                                                                                                     M
                                     Instruction




                                                                                                     u
                                                                                                     x
                                                                   Registers
          Instruction                                                                                                               Data
    PC                                                                                                      ALU
            memory                                                                                                                 memory               M
                                                                                                                                                        u
                                                                                                     M                                                  x
                                                                                                     u
                                                                                                     x


                                                                 IF/ID.RegisterRs
                                                                 IF/ID.RegisterRt
                                                                 IF/ID.RegisterRt               Rt   M                         EX/MEM.RegisterRd
                                                                 IF/ID.RegisterRd               Rd   u
                                                                                                     x
                                                                 ID/EX.RegisterRt               Rs       Forwarding            MEM/WB.RegisterRd
                                                                                                Rt          unit
Software Scheduling to Avoid Load
            Hazards

Try producing fast code for
      a = b + c;
      d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:              Fast code:
       LW    Rb,b               LW    Rb,b
       LW    Rc,c               LW    Rc,c
       ADD   Ra,Rb,Rc           LW    Re,e
       SW    a,Ra               ADD   Ra,Rb,Rc
       LW    Re,e               LW    Rf,f
       LW    Rf,f               SW    a,Ra
       SUB   Rd,Re,Rf           SUB   Rd,Re,Rf
       SW    d,Rd               SW    d,Rd
            False data dependencies
• Write After Read (WAR)
  InstrJ writes operand before InstrI reads it

              I: sub r4,r1,r3
              J: add r1,r2,r3
              K: mul r6,r1,r7
• Write After Write (WAW)
  InstrJ writes operand before InstrI writes it.


               I: sub r1,r4,r3
               J: add r1,r2,r3
               K: mul r6,r1,r7

• Can’t happen in MIPS 5 stage pipeline
• WAR and WAW are hazards in more complicated pipes like
  wide-issue processors
                     So far
• Pipeline performance deflated by
   – structural, data and control hazards
• Resolving Structural Hazards
   – Stall, if resource to be shared
   – Add resources, if time critical
• Resolving RAW data hazards
   – Handle by data forwarding from pipeline
     registers
       » Forwarding unit
   – Handle load data hazard by stalling one cycle
       » Hazard detection unit
• WAW and WAR hazards
           Control Hazard on Branches
              => Three Stage Stall


10: beq r1,r3,36




                                         ALU
                     Ifetch    Reg              DMem      Reg




                                                  ALU
                                                                  Reg
14: and r2,r3,r5              Ifetch    Reg              DMem




                                                           ALU
18: or   r6,r1,r7                      Ifetch    Reg              DMem   Reg




                                                                   ALU
                                                Ifetch    Reg            DMem   Reg
22: add r8,r1,r9




                                                                          ALU
36: xor r10,r1,r11                                       Ifetch    Reg          DMem   Reg
                PCSrc



                                                                                       ID/EX
               0
               M
                u                                                                       WB
                x                                                                                                                 EX/MEM
               1
                                                                        Control         M                                          WB
                                                                                                                                                                              MEM/WB

                                                                                        EX                                          M                                          WB
                        IF/ID


               Add

                                                                                                                      Add
     4                                                                                                          Add result




                                                            RegWrite
                                                                                                                                           Branch
                                                                                                Shift
                                                                                               left 2




                                                                                                                                                         MemWrite
                                                                                                                         ALUSrc
                                                 Read




                                                                                                                                                                                       MemtoReg
                                Instruction




PC   Address                                     register 1
                                                                  Read
                                                                 data 1
                                                 Read
                                                 register 2                                                               Zero
          Instruction
                                                        Registers Read                                                ALU ALU
            memory                               Write                                                   0                                                             Read
                                                                 data 2                                                  result                Address                                      1
                                                 register                                                M                                                             data
                                                                                                          u                                             Data                                M
                                                 Write                                                    x                                            memory                                u
                                                 data                                                                                                                                        x
                                                                                                         1
                                                                                                                                                                                            0
                                                                                                                                               Write
                                                                                                                                               data

                                              Instruction 16                      32                    6
                                              [15– 0]                   Sign                                    ALU                                                 MemRead
                                                                       extend                                  control

                                              Instruction
                                              [20– 16]
                                                                                                        0             ALUOp
                                                                                                        M
                                              Instruction                                                u
                                              [15– 11]                                                   x
                                                                                                        1
                                                                                                             RegDst
                 Branch Stall Impact


• If 30% branches, 3 cycle stall has significant impact
• Two part solution:
  – Determine branch taken or not sooner, AND
  – Compute taken branch address earlier
• MIPS branch tests if register = 0 or  0
• MIPS Solution:
  – Move Zero test to ID/RF stage
  – Adder to calculate new PC in ID/RF stage
  – 1 clock cycle penalty for branch versus 3
                         Pipelined MIPS Datapath
          Instruction         Instr. Decode                        Execute              Memory          Write
            Fetch              Reg. Fetch                         Addr. Calc            Access          Back
Next PC                            Next




                                                MUX
                                  SEQ PC




                                      Adder
                 Adder



                                                Zero?
           4                         RS1




                                                                                                   MEM/WB
Address



               Memory




                                     RS2




                                                                               EX/MEM
                                               Reg File

                                                          ID/EX




                                                                         ALU
                          IF/ID




                                                                                          Memory
                                                                   MUX




                                                                                           Data



                                                                                                            MUX


                                                                                                                  WB Data
                                               Sign
                                              Extend
                                      Imm


                                               RD                   RD                   RD
               Handling Branch Hazard

#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
   –   Execute successor instructions in sequence
   –   “Squash” instructions in pipeline if branch actually taken
   –   47% MIPS branches not taken on average
   –   PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
   – 53% MIPS branches taken on average
   – But haven’t calculated branch target address in MIPS
       » MIPS still incurs 1 cycle branch penalty
       » Other machines: branch target known before outcome
                      Branch Prediction




•   Techniques for branch prediction:
     –   A whole subject in itself to study

•   Still a whole lot research going on in this area
  Branch Hazard: Fourth Alternative

#4: Delayed Branch
  – Define branch to take place AFTER a following instruction

    branch instruction
      sequential successor1
      sequential successor2
      ........                         Branch delay of length n
      sequential successorn
   ........
    branch target if taken

  – 1 slot delay allows proper decision and branch target
    address in 5 stage pipeline
  – MIPS uses this
                    Delayed Branch
• Where to get instructions to fill branch delay
  slot?
   – Before branch instruction
   – From the target address: only valuable when branch taken
   – From fall through: only valuable when branch not taken
• Compiler effectiveness for single branch delay
  slot:
   – Fills about 60% of branch delay slots
   – About 80% of instructions executed in branch delay slots
     useful in computation
   – About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: loosing popularity in
  deeper pipelines, wide-issue processors etc.
                       Delayed Branch
a. From before         b. From target         c. From fall through
                         sub $t4, $t5, $t6
  add $s1, $s2, $s3                              add $s1, $s2, $s3
                         …
  if $s2 = 0 then                                if $s1 = 0 then
                         add $s1, $s2, $s3
       Delay slot                                     Delay slot
                         if $s1 = 0 then

                              Delay slot         sub $t4, $t5, $t6



 Becomes                Becomes                Becomes


                                                 add $s1, $s2, $s3

  if $s2 = 0 then                                if $s1 = 0 then
                         add $s1, $s2, $s3
   add $s1, $s2, $s3                              sub $t4, $t5, $t6
                         if $s1 = 0 then

                          sub $t4, $t5, $t6
           Delayed Branch: Example
• Actual code sequence       • Modified code
  –   36 sub $10, $4, $8        –   36 beqd $1, $3, 8
  –   40 beq $1, $3, 7          –   40 sub $10, $4, $8
                                –   44 and $12, $2, $5
  –   44 and $12, $2, $5
                                –   48 or $13, $2, $6
  –   48 or $13, $2, $6
                                –   52 add $14, $4, $2
  –   52 add $14, $4, $2
                                –   56 slt $15, $6, $7
  –   56 slt $15, $6, $7        –   ………….
  –   ………….                     –   72 lw $4, 50($7)
  –   72 lw $4, 50($7)


• How do we modify using beqd instruction?
Pipeline with multicycle operations
      Pipeline w/ multicycle operations
• Divide unit is not fully pipelined
   – Structural hazards between instructions seeking same unit
       » Need instruction stalling
• Instructions have varying running times
   – Contention for multiple register writes in same cycle
• Instructions no longer reach WB stage in order
   – WAW hazards are now a problem
• Instructions complete in different order
   – Problems with exceptions
• Longer latency operations
   – Stalls due to RAW hazards are more frequent
     Pipeline w/ multicycle operations
• Many of these problems can be handled using
  – Dynamic scheduling techniques
      » Like scoreboard in CDC 6600
  – More detailed study in Chapter 2
ILP and WAW, WAR hazards

A lr1  lr2 + lr3
                    • For wide-issue processors
B lr2  lr4 + lr5     – more instructions executed each
C lr6  lr1 + lr3       cycle the better,
                      – For better throughput (IPC)
D lr6  lr1 + lr2
                    • But, ILP limited by WAR
                      and WAW hazards
   RAR lr3
   RAW lr1
   WAR lr2
   WAW lr6
   A ; BC ; D
                      Renaming
• A rename stage is added typically in a
  modern pipeline
  – After decode stage and before execution
• Number of physical registers larger than
  logical registers
  – Rename stage maintains a list of free physical registers
• For every instruction fetched and decoded,
  – Logical destination register is renamed with a new
    physical register
• Source operands in following instructions
  that refer same logical register are
  renamed accordingly
Renaming

A lr1  lr2 + lr3   pr7  pr2 + pr3
B lr2  lr4 + lr5   pr8  pr4 + pr5
C lr6  lr1 + lr3   pr9  pr7 + pr3
D lr6  lr1 + lr2   pr10  pr7 + pr8

   RAR lr3          RAR pr3
   RAW lr1          RAW pr7
   WAR lr2          WAR x
   WAW lr6          WAW x
   A ; BC ; D       AB ; CD
Superscalar Pipeline
                           B
                 BPred     T              Rename
                           B               Table
 I-Cache   PC
                                IFQ
                                                   checkpoints
                                in2 in1 out op
D-Cache              LSQ

                                                     R
                                                     O
           FU   FU   FU    FU                        B
 Regfile                         Issue queue
Alpha 21264 Processor Pipeline
Rename hardware logic
                        Renaming
• For every instruction, a logical register is
  mapped to a physical register
  – What happens when no free physical register is available?
• Why we run out of free physical registers
  – Limited size of physical register file
      » Around 64 to 128 physical registers in current
        processors
  – Why not have a very large physical register file?
• So, a physical register Pi is recycled
  – First, allocated to a logical destination register Lk of
    instruction Ij
  – Next, value written into Pi when Ij is executed
  – Pi is read when following dependent instructions are executed
  – Pi is freed after a while and sent to the free list
       » When exactly?
                       Renaming
• So, a physical register Pi is recycled
  – First, allocated to a logical destination register Lk of
    instruction Ij
  – Next, value written into Pi when Ij is executed
  – Pi is read when following dependent instructions are executed
  – Pi is freed after a while and sent to the free list
       » When exactly?
  – In general Pi is freed when an instruction Ij+c is retired
    (i.e., completed and is out of pipeline)
       » Ij+c has a same logical destination register as Ij
       » That guarantees that the mapping of Pi  Lk is no
         longer required
Superscalar Pipeline
                           B
                 BPred     T              Rename
                           B               Table
 I-Cache   PC
                                IFQ
                                                   checkpoints
                                in2 in1 out op
D-Cache              LSQ

                                                     R
                                                     O
           FU   FU   FU    FU                        B
 Regfile                         Issue queue
Rename hardware logic
    Recall:Speed Up Equation for Pipelining


 CPI pipelined  Ideal CPI  Average Stall cycles per Inst


                    Ideal CPI             Cycle Timeunpipelined
Speedup                                
          Ideal CPI  Pipeline stall CPI Cycle Timepipelined

For simple RISC pipeline, CPI = 1:


                         1            Cycle Timeunpipelined
    Speedup                        
              1  Pipeline stall CPI Cycle Timepipelined
             Example: Evaluating Branch
                    Alternatives

  Pipeline speedup =                Pipeline depth
                         1 +Branch frequencyBranch penalty

Assume:
Conditional & Unconditional = 14%, out of which
  65% change PC

Scheduling     Branch             CPI    speedup v.
  scheme       penalty                     stall
Stall pipeline    3              1.42        1.0
Predict taken     1              1.14        1.25
Predict not taken 1              1.09        1.30
Delayed branch 0.5               1.07        1.33
                         Summary:
                Pipelining & Performance
 • Just overlap tasks; easy if tasks are independent
 • Speed Up  Pipeline Depth; if ideal CPI is 1, then:
              Pipeline depth                        Cycle Timeunpipelined
Speedup                         ; pipeline depth 
          1  Pipeline stall CPI                    Cycle Timepipelined
 • Hazards limit performance on computers:
     – Structural: need more HW resources
     – Data (RAW,WAR,WAW): need forwarding, compiler scheduling
     – Control: delayed branch, prediction
  • Time is measure of performance: latency or
    throughput
  • CPI Law:
   CPU time    = Seconds    = Instructions x Cycles x Seconds
                 Program        Program      Instruction Cycle
                                    Design and performance of pipelining

•   Pipelined processors are not EASY to design
•   Technology affects implementation
•   Instruction set design affects the performance, i.e., beq, bne
•   More stages do not lead to higher performance
                                    3.0


                                    2.5
             Relative performance




                                    2.0


                                    1.5


                                    1.0


                                    0.5


                                    0.0
                                          1   2         4          8   16
                                                  Pipeline depth
      Appendix B

Instruction Set Principles
The Instruction Set: a Critical Interface


 software



                instruction set



 hardware
Instruction Set Architecture
... the attributes of a [computing] system as seen by
the programmer, i.e. the conceptual structure and
functional behavior, as distinct from the organization
of the data flows and controls the logic design, and
the physical implementation.
      – Amdahl, Blaaw, and Brooks, 1964
                                               SOFTWARE
-- Organization of Programmable
   Storage
-- Data Types & Data Structures:
     Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
Organization
• Capabilities & Performance                  Logic Designer's View
  Characteristics of Principal
  Functional Units                                  ISA Level

   – (e.g., Registers, ALU, Shifters, Logic     FUs & Interconnect
     Units, ...)
• Ways in which these components
  are interconnected
• Information flows between
  components
• Logic and means by which such
  information flow is controlled.
• Choreography of FUs to
       realize the ISA
• Register Transfer Level (RTL)
  Description
       Levels of Representation
                                          temp = v[k];
High Level Language                       v[k] = v[k+1];
   Program
                                          v[k+1] = temp;
              Compiler
                                  lw $15,0($2)
Assembly Language                 lw $16,4($2)
  Program
                                  sw     $16, 0($2)
              Assembler           sw     $15, 4($2)
                               0000    1001   1100   0110   1010   1111   0101   1000
Machine Language               1010    1111   0101   1000   0000   1001   1100   0110
  Program                      1100    0110   1010   1111   0101   1000   0000   1001
                               0101    1000   0000   1001   1100   0110   1010   1111

              Machine Interpretation

Control Signal
  Specification
          °
          °
      Stored program concept & Instructions
• Instructions are bits
• Programs are stored in memory
   – To be read or written just like data
• Fetch & Execute
   – Instructions are fetched and put into a special register
                                                                           Memory
   – Bits in the register “control” the subsequent actions            Accounting program

   – Fetch the next instruction and continue
                                                                        (machine code)

                                                                       Editor program

• Language of the Machine                                              (machine code)

                                                                         C compiler
• More primitive than higher level languages              Processor    (machine code)


   – e.g., no sophisticated control flow                                 Payroll data


• Very restrictive                                                        Book text


   – e.g., MIPS Arithmetic Instructions                                 Source code in C
                                                                       for editor program

• First, we’ll be looking at the MIPS instruction set
  architecture
   – similar to other architectures developed since the 1980's
   – used by NEC, Nintendo, Silicon Graphics, Sony
Design goals: maximize performance and minimize
  cost, reduce design time
Instruction Set Classes
           Characteristics of Instruction Set

• Complete
   – Can be used for a variety of application
• Efficient
   – Useful in code generation
• Compatible
   – Programs written for previous versions of machines need it
• Primitive
   – Basic operations
• Simple
   – Easy to implement
• Smaller
   – Implementation
           Architecture Specification
• Data types:
   – Bit, byte, bit field, signed/unsigned integers, logical,
     floating point, character
• Operations:
   – Data movement, arithmetic, logical, shift/rotate,
     conversion, input/output, control, and system calls
• # of operands:
   – 3, 2, 1, or 0 operands
• Registers:
   – Integer, floating point, control
• Instruction representation as bit strings
            Example of multiple operands
• Instructions may have 3, 2, 1, or 0 operands
• Number of operands may affect instruction length
• Operand order is fixed (destination first, but need not that
  way)

  add $s0, $s1, $s2   : Add $s2 and $s1 and store result in $s0
  add $s0, $s1        : Add $s1 and $s0 and store result in $s0
  add $s0              : Add contents of a fixed location to $s0
  add                  : Add two fixed locations and store result
                       MIPS arithmetic
• All instructions have 3 operands
• Operand order is fixed (destination first). Example:
       C code:     A = B + C
      MIPS code: add $s0, $s1, $s2
  (associated with variables by compiler)
• Design Principle: simplicity favors regularity. Why?
• Of course this complicates some things...
      C code:      A = B + C + D;
                   E = F - A;
      MIPS code: add $t0, $s1, $s2
                   add $s0, $t0, $s3
                   sub $s4, $s5, $s0
• Operands must be registers
   – only 32 registers provided
• Design Principle: smaller is faster. Why?
                  Registers vs. Memory
• Arithmetic instructions operands must be registers,
      — only 32 registers provided
• Compiler associates variables with registers
• What about programs with lots of variables?




            Control             Input
                       Memory
           Datapath             Output

           Processor            I/O
              Memory Organization
• Viewed as a large, single-dimension
  array, with an address.                      0     8 bits of data
• A memory address is an index into the        1     8 bits of data
  array                                        2     8 bits of data

• "Byte addressing" means that the index       3     8 bits of data
  points to a byte of memory.                  4     8 bits of data

• Bytes are nice, but most data items use      5     8 bits of data
  larger "words"                               6     8 bits of data

• For MIPS, a word is 32 bits or 4 bytes.
• 232 bytes with byte addresses from 0 to
  232-1
• 230 words with byte addresses 0, 4, 8,
  ... 232-4                                     0     32 bits of data


• Words are aligned                             4     32 bits of data


      i.e., what are the least 2 significant    8     32 bits of data

  bits of a word address?                      12     32 bits of data

                                                    ...
              Addressing within a word
• Each word has four bytes
• Which byte is first and which is last?
• Two choices
  – Least significant byte is byte “0”  Little Endian
  – Most significant byte is byte “0”  Big Endian



         0    3    2   1   0    0    0    1   2   3

         4    7    6   5   4    4    4    5   6   7

         8    11 10    9   8    8    8 9      10 11

              ……….             12    ……….
        12
             ...                    ...
               Instructions

• Load and store instructions
• Example:
     C code:         A[8] = h + A[8];

    MIPS code:     lw $t0, 32($s3)
                   add $t0, $s2, $t0
                   sw $t0, 32($s3)

• Store word has destination last
• Remember arithmetic operands are registers,
  not memory!
                     Addressing modes
• Memory address for load and store has two parts
   – A register whose content is known
   – An offset stored in 16 bits
• The offset can be positive or negative
   – It is written in terms of number of bytes
   – It is but in instruction in terms of number of words
   – 32 byte offset is written as 32 but stored as 8
• Address is content of register + offset
• All addresses have both these components
• If no register needs to be used, then register 0 is
  used
   – Register 0 always stores value 0
• If no offset, then offset is 0
             So far we’ve learned:
• MIPS
     — loading words, but addressing bytes
     — arithmetic on registers only
• Instruction             Meaning

 add $s1, $s2, $s3   $s1 = $s2 + $s3
 sub $s1, $s2, $s3   $s1 = $s2 – $s3
 lw $s1, 100($s2) $s1=Memory[$s2+100]
 sw $s1, 100($s2) Memory[$s2+100]=$s1
                  Machine Language

• Instructions, like registers and data,
  are also 32 bits long
  – Example: add $t0, $s1, $s2
  – registers have numbers, $t0=9, $s1=17, $s2=18
• Instruction Format:
     000000 10001 10010   01001   00000   100000

       op    rs     rt      rd    shamt   funct



• Can you guess what the field names stand for?
                       Machine Language
• Consider the load-word and store-word instructions,
   – What would the regularity principle have us do?
   – New principle: Good design demands a compromise
• Introduce a new type of instruction format
   – I-type for data transfer instructions
   – other format was R-type for register
• Example: lw $t0, 32($s2)



          35      18       9           32
         op       rs       rt      16 bit number

• Where's the compromise?
                                 Control
• Decision making instructions
   – alter the control flow,
   – i.e., change the "next" instruction to be executed
• MIPS conditional branch instructions:
      bne $t0, $t1, Label
      beq $t0, $t1, Label
• Example: if (i==j) h = i + j;
            bne $s0, $s1, Label
            add $s3, $s0, $s1
      Label:        ....
• MIPS unconditional branch instructions:
      j label
• Example:
  f, g, and h are in registers $s3, $s4, and $s5
  if (g!=h)                 beq $s4, $s5, Lab1
      f=g-h;                sub $s3, $s4, $s5
  else                      j Lab2
      f=g+h;                Lab1:add $s3, $s4,$s5
                            Lab2:...
• Can you build a simple for-loop?
                        So far:


•   Instruction        Meaning

    add $s1,$s2,$s3    $s1 = $s2 + $s3
    sub $s1,$s2,$s3    $s1 = $s2 – $s3
    lw $s1,100($s2)    $s1 = Memory[$s2+100]
    sw $s1,100($s2)    Memory[$s2+100] = $s1
    bne $s4,$s5,L      Next inst is at Label if $s4 != $s5
    beq $s4,$s5,L      Next inst is at Label if $s4 = $s5
    j Label              Next instr. is at Label
•   Formats:


    R      op     rs       rt     rd    shamt   funct
    I      op     rs       rt     16 bit address
    J      op              26 bit address
                          Control Flow

• We have: beq, bne, what about Branch-if-less-than?
  – if   $s1 < $s2 then
         $t0 = 1
    else
        $t0 = 0
• New instruction:
  – slt $t0, $s1, $s2
• Can use this instruction to build "blt $s1,$s2,Label"
     — can now build general control structures
• Note that the assembler needs a register to do this,
     — there are policy of use conventions for registers
Policy of Use Conventions

 Name Register number                       Usage
$zero         0         the constant value 0
$v0-$v1      2-3        values for results and expression evaluation
$a0-$a3      4-7        arguments
$t0-$t7     8-15        temporaries
$s0-$s7    16-23        saved
$t8-$t9    24-25        more temporaries
$gp          28         global pointer
$sp          29         stack pointer
$fp          30         frame pointer
$ra          31         return address
                     Constants
• Small constants are used quite frequently
  (50% of operands)
      e.g., A = A + 5;
            B = B + 1;
            C = C - 18;
• Solutions? Why not?
   – put 'typical constants' in memory and load them.
   – create hard-wired registers (like $zero) for constants
     like one.
• MIPS Instructions:
     addi $29, $29, 4
     slti $8, $18, 10
     andi $29, $29, 6
     ori $29, $29, 4
• How do we make this work?
                 How about larger constants?


   •         We'd like to be able to load a 32 bit constant into a register
   •         Must use two instructions, new "load upper immediate" instruction

                  lui $t0, 1010101010101010
                                                     filled with zeros

1010101010101010         0000000000000000

   •         Then must get the lower order bits right, i.e.,
                  ori $t0, $t0, 1010101010101010


                  1010101010101010    0000000000000000

                  0000000000000000    1010101010101010
       ori

                  1010101010101010    1010101010101010
                       Other Issues

• Some other issues
  –   support for procedures
  –   linkers, loaders, memory layout
  –   stacks, frames, recursion
  –   manipulating strings and pointers
  –   Interrupts, exceptions, system calls and conventions
• More details on these can be found in reference
  book
• We've focused on architectural issues
  – basics of MIPS assembly language and machine code
                 Overview of MIPS
• simple instructions all 32 bits wide
• very structured, no unnecessary baggage
• only three instruction formats


    R     op       rs     rt      rd        shamt   funct
    I     op      rs      rt      16 bit address

    J     op              26 bit address

• rely on compiler to achieve performance
       — what are the compiler's goals?
• help compiler where we can
      Addresses in Branches and Jumps


• Instructions:
      bne $t4,$t5,Label             Next instruction is at Label if $t4 != $t5
      beq $t4,$t5,Label             Next instruction is at Label if $t4 = $t5
      j Label                       Next instruction is at Label

• Formats:

I      op         rs        rt         16 bit address

J      op                   26 bit address


• Addresses are not 32 bits
    – How do we handle this
    – Similar as in load and store instructions?
Various addressing modes
1. Immediate addressing
   op      rs        rt             Immediate




2. Register addressing
   op      rs        rt        rd     ...       funct                   Registers
                                                                         Register


3. Base addressing
   op      rs        rt             Address                              Memory



                      Register                          +   Byte   Halfword         Word




4. PC-relative addressing
   op      rs        rt             Address                              Memory



                          PC                            +                 Word




5. Pseudodirect addressing
   op                     Address                                        Memory



                          PC                                              Word
  To summarize:
                                           MIPS assembly language
  Category          Instruction         Example               Meaning                           Comments
              add                 add $s1, $s2, $s3   $s1 = $s2 + $s3                Three operands; data in registers


Arithmetic    subtract            sub $s1, $s2, $s3      $s1 = $s2 - $s3             Three operands; data in registers


              add immediate       addi $s1, $s2, 100     $s1 = $s2 + 100         Used to add constants
              load w ord          lw $s1, 100($s2)       $s1 = Memory[$s2 + 100]Word from memory to register
              store w ord         sw $s1, 100($s2)       Memory[$s2 + 100] = $s1 Word from register to memory
Data transfer load byte           lb $s1, 100($s2)       $s1 = Memory[$s2 + 100]Byte from memory to register
              store byte          sb $s1, 100($s2)       Memory[$s2 + 100] = $s1 Byte from register to memory
              load upper          lui $s1, 100           $s1 = 100 * 2 16        Loads constant in upper 16 bits
              immediate
              branch on equal     beq    $s1, $s2, 25    if ($s1 == $s2) go to       Equal test; PC-relative branch
                                                         PC + 4 + 100
              branch on not equal bne    $s1, $s2, 25    if ($s1 != $s2) go to       Not equal test; PC-relative
                                                         PC + 4 + 100
Conditional
branch        set on less than    slt    $s1, $s2, $s3   if ($s2 < $s3) $s1 = 1;     Compare less than; for beq, bne
                                                         else $s1 = 0
              set less than       slti   $s1, $s2, 100 if ($s2 < 100) $s1 = 1;       Compare less than constant
              immediate                                  else $s1 = 0

              jump                j      2500            go to 10000               Jump to target address
Uncondi-      jump register       jr     $ra             go to $ra                 For sw itch, procedure return
tional jump   jump and link       jal    2500            $ra = PC + 4; go to 10000 For procedure call
                 Summary so far
• Instruction complexity is only one variable
   – lower instruction count vs. higher CPI / lower clock
     rate
• Design Principles:
   –   simplicity favors regularity
   –   smaller is faster
   –   good design demands compromise
   –   make the common case fast
• Instruction set architecture
   – a very important abstraction indeed!
Review: Basic ISA Classes
 Accumulator:
  1 address        add A          acc  acc + mem[A]
  1+x address      addx A         acc  acc + mem[A + x]
 Stack:
  0 address         add           tos  tos + next
 General Purpose   Register:
  2 address         add A B       EA(A)  EA(A) + EA(B)
  3 address         add A B C     EA(A)  EA(B) + EA(C)
 Load/Store:
  3 address        add Ra Rb Rc   Ra  Rb + Rc
                   load Ra Rb     Ra  mem[Rb]
                   store Ra Rb    mem[Rb]  Ra
Instruction Formats
 Variable:                                    …
 Fixed:

 Hybrid:



•Addressing modes
   –each operand requires addess specifier => variable format

•code size => variable length instructions
•performance => fixed length instructions
   –simple decoding, predictable operations

•With load/store instruction arch, only one memory
address and few addressing modes
•=> simple format, address mode given by opcode
Cray-1: the original RISC

   Register-Register

       15        9   8        6   5         3 2        0

            Op           Rd           Rs1         R2


   Load, Store and Branch
      15         9   8        6   5         3 2        0   15          0

           Op            Rd           Rs1                  Immediate
           VAX-11: the canonical CISC
Variable format, 2 and 3 address instruction
  Byte 0      1         n             m

    OpCode   A/M       A/M          A/M




• Rich set of orthogonal address modes
    – immediate, offset, indexed, autoinc/dec, indirect,
      indirect+offset
    – applied to any operand
• Simple and complex instructions
    – synchronization instructions
    – data structure operations (queues)
    – polynomial evaluation
Review: Load/Store Architectures
                                            MEM         reg
°  3 address GPR
° Register to register arithmetic
° Load and store with simple addressing modes (reg + immediate)
° Simple conditionals
  compare ops + branch z                 op     r  r     r
  compare&branch
                                         op     r   r      immed
  condition code + branch on condition
° Simple fixed-format encoding           op       offset



° Substantial increase in instructions
° Decrease in data BW (due to many registers)
° Even more significant decrease in CPI (pipelining)
° Cycle time, Real estate, Design time, Design complexity
Appendix C and Chapter 5

   Memory Hierarchy
        The Memory Abstraction
• Association of <name, value> pairs
   – typically named as byte addresses
   – often values aligned on multiples of
     size
• Sequence of Reads and Writes
• Write binds a value to an address
• Read of addr returns most recently written
  value bound to that address
       command (R/W)
       address (name)
            data (W)

            data (R)

              done
                      What is a cache?
• Small, fast storage used to improve average access time to slow memory
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a cache!
   –   Registers “a cache” on variables – software managed
   –   First-level cache a cache on second-level cache
   –   Second-level cache a cache on memory
   –   Memory a cache on disk (virtual memory)
   –   TLB a cache on page table
   –   Branch-prediction a cache on prediction information?


                             Proc/Regs

                             L1-Cache
       Bigger                L2-Cache                   Faster

                              Memory

                          Disk, Tape, etc.
          Relationship of Caches and Pipeline
                                         Memory




              I-$                                                                   D-$



Next PC                          Next

                                            MUX
                                SEQ PC
                                    Adder
                Adder




          4
                                            Zero?
                                   RS1




                                                                                              MEM/WB
Address




                                                                           EX/MEM
              Memory




                                            Reg File

                                                       ID/EX
                                   RS2
                        IF/ID




                                                                     ALU




                                                                                     Memory




                                                                                                             WB Data
                                                               MUX




                                                                                      Data



                                                                                                       MUX
                                         Sign
                                    Imm Extend

                                            RD                 RD                   RD
             Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined implementation
  has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
   – Speedup(enhancement) = Time w/o enhancement /
     Time w/
   – Speedup(B) = Time(A) / Time(B)
                    = CPI(A)xCT(A) / CPI(B)xCT(B)
                          = 1 / (1.4 x 1/1.05) = 0.75

Machine A is 1.33 times faster
    Since 1980, CPU has outpaced DRAM ...

            Q. How do architects address this gap?
              A. Put smaller, faster “cache” memories
Performance          between CPU and DRAM.
 (1/latency)       Create a “memory hierarchy”.        CPU
                                                   CPU 60% per yr


                                                       2X in 1.5 yrs

                                     Gap grew 50% per
                                           year
                                                     DRAM
                                                DRAM
                                                     9% per yr
                                                     2X in 10 yrs


                                                        Year
Processor-DRAM Memory Gap (latency)
             Generations of Microprocessors
• Time of a full cache miss in instructions executed:
1st Alpha:           340 ns/5.0 ns = 68 clks x 2 or   136
2nd Alpha:           266 ns/3.3 ns = 80 clks x 4 or   320
3rd Alpha:           180 ns/1.7 ns =108 clks x 6 or   648
• (1/2)X latency x 3X clock rate x 3X Instr/clock
  -5X
                Levels of the Memory Hierarchy
  Capacity                                                         Upper Level
  Access Time                                       Staging
  Cost                                              Xfer Unit         faster
  CPU Registers
  100s Bytes            Registers
  <1s ns
                                Instr. Operands   prog./compiler
                                                  1-8 bytes
  Cache
  10s-100s K Bytes
  1-10 ns
                        Cache
  $10/ MByte                                      cache cntl
                                Blocks            8-128 bytes
 Main Memory
 M Bytes
 100ns- 300ns            Memory
 $1/ MByte
                                                   OS
                                Pages              512-4K bytes
Disk
10s G Bytes, 10 ms
(10,000,000 ns)
                         Disk
$0.0031/ MByte                                     user/operator
                                Files              Mbytes
                                                                        Larger
 Tape
 infinite                Tape                                   Lower Level
 sec-min
 $0.0014/ MByte
            A Modern Memory Hierarchy
• By taking advantage of the principle of locality:
    – Present the user with as much memory as is available in
      the cheapest technology.
    – Provide access at the speed offered by the fastest
      technology.
• Requires servicing faults on the processor
                Processor


               Control                                                           Tertiary
                                                                     Secondary   Storage
                                                                      Storage  (Disk/Tape)
                                               Second     Main
                                                                       (Disk)
                              On-Chip
                  Registers




                                                Level   Memory
                               Cache




       Datapath                                Cache    (DRAM)
                                              (SRAM)



    Speed (ns): 1s                      10s              100s    10,000,000s 10,000,000,000s
    Size (bytes): 100s                                             (10s ms)      (10s sec)
                                        Ks               Ms           Gs           Ts
1977: DRAM faster than microprocessors

                            Apple ][ (1977)
                            CPU: 1000 ns
                            DRAM: 400 ns




              Steve
     Steve   Wozniak
     Jobs
          Memory Hierarchy: Apple iMac G5
 Managed                      Managed           Managed by OS,
by compiler                  by hardware          hardware,
                                                  application
  07       Reg     L1 Inst   L1 Data    L2      DRAM    Disk

 Size      1K       64K       32K      512K     256M    80G
Latency
            1,       3,         3,       11,     88,     107,   iMac G5
Cycles,
          0.6 ns   1.9 ns     1.9 ns   6.9 ns   55 ns   12 ms   1.6 GHz
 Time
Goal: Illusion of large, fast, cheap memory
Let programs address a memory space that
 scales to the disk size, at a speed that is
     usually as fast as register access
       iMac’s PowerPC 970: All caches on-chip
           L1 (64K Instruction)


 R
eg
ist
er
                                            512K
 s                                           L2




(1K)


                L1 (32K Data)
        Processor-Memory Performance
                  Gap “Tax”
   Processor             % Area       %Transistors
                         (-cost)        (-power)
• Alpha 21164             37%             77%
• StrongArm SA110         61%             94%
• Pentium Pro             64%             88%
  – 2 dies per package: Proc/I$/D$ + L2$
• Caches have no “inherent value”,
  only try to close performance gap
                The Principle of Locality
• The Principle of Locality:
    – Program access a relatively small portion of the address
      space at any instant of time.
• Two Different Types of Locality:
    – Temporal Locality (Locality in Time): If an item is
      referenced, it will tend to be referenced again soon (e.g.,
      loops, reuse)
    – Spatial Locality (Locality in Space): If an item is
      referenced, items whose addresses are close by tend to
      be referenced soon
      (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed


        It is a property of programs which is exploited in machine design.
         Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X)
    – Hit Rate: the fraction of memory access found in the
      upper level
    – Hit Time: Time to access the upper level which consists of
        RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level
  (Block Y)
    – Miss Rate = 1 - (Hit Rate)
    – Miss Penalty: Time to replace a block in the upper level +
       Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)


                                             Lower Level
            To Processor   Upper Level        Memory
                            Memory
                              Blk X
         From Processor                         Blk Y
                  Cache Measures

• Hit rate: fraction found in that level
    – So high that usually talk about Miss rate
    – Miss rate fallacy: as MIPS to CPU performance,
      miss rate to average memory access time in
      memory
• Average memory-access time
        = Hit time + Miss rate x Miss penalty
                (ns or clocks)
• Miss penalty: time to replace a block from lower level,
  including time to replace in CPU
    – access time: time to lower level
      = f(latency to lower level)
    – transfer time: time to transfer block
      =f(BW between upper & lower levels)
    4 Questions for Memory Hierarchy


• Q1: Where can a block be placed in the upper level?
      (Block placement)
• Q2: How is a block found if it is in the upper level?
      (Block identification)
• Q3: Which block should be replaced on a miss?
      (Block replacement)
• Q4: What happens on a write?
      (Write strategy)
     Q1: Where can a block be placed in
             the upper level?
 •   Block 12 placed in 8 block cache:
      – Fully associative, direct mapped, 2-way set associative
      – S.A. Mapping = Block Number Modulo Number Sets


                              Direct Mapped     2-Way Assoc
            Full Mapped
                              (12 mod 8) = 4   (12 mod 4) = 0
             01234567            01234567        01234567

Cache


                        1111111111222222222233
              01234567890123456789012345678901

Memory
         Simplest Cache: Direct Mapped
Memory Address   Memory
  0
                                 4 Byte Direct Mapped Cache
  1
                                 Cache Index
  2
                                  0
  3
                                  1
  4
                                  2
  5
                                  3
  6
                          • Location 0 can be occupied by data from:
  7
  8
                             – Memory location 0, 4, 8, ... etc.
  9                          – In general: any memory location
  A                            whose 2 LSBs of the address are
  B
                               0s
  C                          – Address<1:0> => cache index
  D                       • Which one should we place in the cache?
  E                       • How can we tell which one is in the cache?
  F
   1 KB Direct Mapped Cache, 32B blocks
• For a 2 ** N byte cache:
   – The uppermost (32 - N) bits are always the Cache Tag
   – The lowest M bits are the Byte Select (Block Size = 2 **
     M)
   31                                              9                     4               0
                  Cache Tag        Example: 0x50   Cache Index           Byte Select
                                                        Ex: 0x01             Ex: 0x00
                      Stored as part
                      of the cache “state”
   Valid Bit   Cache Tag                               Cache Data




                                                               : :
                                                   Byte 31           Byte 1     Byte 0   0
                              0x50                 Byte 63           Byte 33 Byte 32 1
                                                                                         2
                                                                                         3

        :                      :                                     :



                                                                     :
                                                   Byte 1023                  Byte 992 31
                      Two-way Set Associative Cache
      • N-way set associative: N entries for each Cache Index
         – N direct mapped caches operates in parallel (N typically 2 to 4)
      • Example: Two-way set associative cache
           – Cache Index selects a “set” from the cache
           – The two tags in the set are compared in parallel
           – Data is selected based on the tag result

                                              Cache Index
Valid       Cache Tag       Cache Data                  Cache Data      Cache Tag   Valid
                            Cache Block 0             Cache Block 0

  :               :              :                             :           :         :


        Adr Tag
                  Compare            Sel1 1     Mux   0 Sel0          Compare

                                     OR
                                                   Cache Block
                                  Hit
           Disadvantage of Set Associative Cache
 • N-way Set Associative Cache v. Direct Mapped Cache:
         – N comparators vs. 1
         – Extra MUX delay for the data
         – Data comes AFTER Hit/Miss
 • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
         – Possible to assume a hit and continue. Recover later if miss.

                                              Cache Index
Valid       Cache Tag       Cache Data                  Cache Data      Cache Tag   Valid
                            Cache Block 0             Cache Block 0

  :               :              :                             :           :         :


        Adr Tag
                  Compare            Sel1 1     Mux   0 Sel0          Compare

                                     OR
                                                   Cache Block
                                  Hit
               The Cache Design Space
• Several interacting dimensions                Cache Size
    –   cache size
    –   block size                                            Associativity
    –   associativity
    –   replacement policy
    –   write-through vs write-back
                                                           Block Size
• The optimal choice is a compromise
    – depends on access characteristics
         » workload                       Bad
         » use (I-cache, D-cache, TLB)
    – depends on technology / cost
                                         Good   Factor A        Factor B
• Simplicity often wins
                                                Less             More
 Q2: How is a block found if it is in
         the upper level?

• Tag on each block
   – No need to check index or block
     offset
• Increasing associativity shrinks index,
  expands tag
               Block Address            Block
               Tag             Index   Offset
Q3: Which block should be replaced on a
                miss?

• Easy for Direct Mapped
• Set Associative or Fully Associative:
   –Random
   –LRU (Least Recently Used)

Assoc:      2-way        4-way          8-way
Size      LRU Ran      LRU Ran        LRU     Ran
16 KB     5.2% 5.7%     4.7% 5.3%    4.4%    5.0%
64 KB     1.9% 2.0%     1.5% 1.7%    1.4%    1.5%
256 KB   1.15% 1.17%   1.13% 1.13%   1.12% 1.12%
    4 Questions for Memory Hierarchy


• Q1: Where can a block be placed in the upper level?
      (Block placement)
• Q2: How is a block found if it is in the upper level?
      (Block identification)
• Q3: Which block should be replaced on a miss?
      (Block replacement)
• Q4: What happens on a write?
      (Write strategy)
     Q4: What happens on a write?

• Write through—The information is written to both
  the block in the cache and to the block in the lower-
  level memory.
• Write back—The information is written only to the
  block in the cache. The modified cache block is
  written to main memory only when it is replaced.
   – is block clean or dirty?
• Pros and Cons of each?
   – WT: read misses do not result in writes
     to lower level
   – WB: no repeated writes to same location
     in lower level
• WT always combined with write buffers so that don’t
  wait for lower level memory
                     Write Policy:
              Write-Through vs Write-Back

• Write-through: all writes update cache and underlying memory/cache
   – Can always discard cached data - most up-to-date data is in
     memory
   – Cache control bit: only a valid bit
• Write-back: all writes simply update cache
   – Can’t just discard cached data - may have to write it back to
     memory
   – Cache control bits: both valid and dirty bits
• Other Advantages:
   – Write-through:
       » memory (or other processors) always have latest data
       » Simpler management of cache
   – Write-back:
       » much lower bandwidth, since data often overwritten multiple times
       » Better tolerance to long-latency memory?
       Q4: What happens on a write?

                        Write-Through             Write-Back

                                               Write data only to
                     Data written to cache         the cache
                             block
      Policy
                     also written to lower-    Update lower level
                          level memory        when a block falls out
                                                  of the cache

     Debug                   Easy                     Hard
  Do read misses
  produce writes?
                              No                      Yes

Do repeated writes
 make it to lower             Yes                      No
      level?

    Additional option -- let writes to an un-cached address
         allocate a new cache line (“write-allocate”).
        Write Buffer for Write Through

                                   Cache
             Processor                           DRAM


                                 Write Buffer
• A Write Buffer is needed between the Cache and Memory
   – Processor: writes data into the cache and the write buffer
   – Memory controller: write contents of the buffer to
     memory
• Write buffer is just a FIFO:
   – Typical number of entries: 4
   – Works fine if: Store frequency (w.r.t. time) << 1 /
     DRAM write cycle
• Memory system design:
   – Store frequency (w.r.t. time)         -> 1 / DRAM write cycle
   – Write buffer saturation
Write Buffers for Write-Through Caches

                            Cache        Lower
        Processor                         Level
                                         Memory
                          Write Buffer


  Holds data awaiting write-through to
          lower level memory
Q. Why a write buffer ?       A. So CPU doesn’t stall

Q. Why a buffer, why          A. Bursts of writes are
not just one register ?       common.
Q. Are Read After Write A. Yes! Drain buffer before
(RAW) hazards an issue next read, or send read 1st
for write buffer?       after check write buffers.
    4 Questions for Memory Hierarchy


• Q1: Where can a block be placed in the upper level?
      (Block placement)
• Q2: How is a block found if it is in the upper level?
      (Block identification)
• Q3: Which block should be replaced on a miss?
      (Block replacement)
• Q4: What happens on a write?
      (Write strategy)
               Write Policy 2:
        Write Allocate vs Non-Allocate
        (What happens on write-miss)

• Write allocate: allocate new cache line in cache

   – Usually means that you have to do a
     “read miss” to fill in rest of the
     cache-line!
• Write non-allocate (or “write-around”):

   – Simply send write data through to
     underlying memory/cache - don’t
     allocate new cache line!
          Write Policy: Combination
• Usually it is either:
• A Write-back cache with Write allocate policy:

  –why?
• Or a Write-through cache with Write non-allocate
  policy:

  –why?
What are all the aspects of cache
    organization that impact
         performance?
                  Review: Cache performance
• Miss-oriented Approach to Memory Access:

                                MemAccess                          
CPUtime  IC   CPI                       MissRate  MissPenalty   CycleTime
                    Execution     Inst                             
                                MemMisses               
CPUtime  IC   CPI                       MissPenalty   CycleTime
                    Execution     Inst                  
      – CPIExecution includes ALU and Memory instructions
  •    Separating out Memory component entirely
         – AMAT = Average Memory Access Time
         – CPIALUOps does not include memory instructions

                      AluOps                        MemAccess        
      CPUtime  IC           CPI                            AMAT   CycleTime
                      Inst                                           
                                        AluOps
                                                       Inst
      AMAT  HitTime  MissRate MissPenalt       y
             HitTime Inst  MissRate Inst  MissPenalty Inst  
                 HitTimeData  MissRate Data  MissPenaltyData 
              Impact on Performance
• Suppose a processor executes at
   –Clock Rate = 200 MHz (5 ns per cycle), Ideal
    (no misses) CPI = 1.1
   –50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of data memory operations get 50
  cycle miss penalty
• Suppose that 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction
      1.1(cycles/ins) +
      [ 0.30 (DataMops/ins)
             x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
      [ 1 (InstMop/ins)
             x 0.01 (miss/InstMop) x 50 (cycle/miss)]
       = (1.1 + 1.5 + .5) cycle/ins = 3.1
• 64.5% of the time the processor is stalled waiting for
  memory!
                       Unified vs Split Caches
• Unified vs Separate I&D

              Proc
             Unified        I-Cache-1     Proc    D-Cache-1
             Cache-1                    Unified
                                        Cache-2
             Unified
             Cache-2

• Example:
   – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
   – 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
   – 75% accesses from instructions
   – hit time=1, miss time=50
   – Note that data hit has 1 stall for unified cache (only one port)

AMATSplit=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
               Where do misses come from?
•   Classifying Misses: 3 Cs
     – Compulsory—The        first access to a block is not in the cache,
        so the block must be brought into the cache. Also called cold
        start misses or first reference misses.
        (Misses in even an Infinite Cache)
     – Capacity—If      the cache cannot contain all the blocks needed
        during execution of a program, capacity misses will occur due to
        blocks being discarded and later retrieved.
        (Misses in Fully Associative Size X Cache)
     – Conflict—If      block-placement strategy is set associative or
        direct mapped, conflict misses (in addition to compulsory &
        capacity misses) will occur because a block can be discarded and
        later retrieved if too many blocks map to its set. Also called
        collision misses or interference misses.
        (Misses in N-way Associative, Size X Cache)
•   4th “C”:
     – Coherence        - Misses caused by cache coherence.
               3Cs Absolute Miss Rate (SPEC92)

                     0.14
                                1-way
                     0.12                           Conflict
                                    2-way
Miss Rate per Type




                      0.1
                                            4-way
                     0.08
                                                8-way
                     0.06
                                                        Capacity
                     0.04
                     0.02
                        0
                            1


                                2


                                        4


                                               8


                                                        16


                                                              32


                                                                   64


                                                                         128
                                        Cache Size (KB)            Compulsory
                                           Cache Size
                        0.14
                                   1-way
                        0.12
                                       2-way
   Miss Rate per Type




                         0.1
                                               4-way
                        0.08
                                                   8-way
                        0.06
                                                           Capacity
                        0.04
                        0.02
                           0
                               1


                                   2


                                           4


                                                  8


                                                           16


                                                                 32


                                                                      64


                                                                            128
                                           Cache Size (KB)            Compulsory
• Old rule of thumb: 2x size => 25% cut in miss rate
• What does it reduce?
• What is the down side?
               Cache Organization?

•   Assume total cache size not changed:
•   What happens if:

1) Change Block Size:

2) Change Associativity:

3) Change Compiler:

    Which of 3Cs is obviously affected?
                        Larger Block Size
                        (fixed size&assoc)
             25%

             20%                                               1K

                                                               4K
             15%
    Miss
                                                               16K
    Rate
             10%
                                                               64K
              5%                                               256K
 Reduced
compulsory    0%
  misses
                   16


                           32


                                   64


                                             128


                                                   256
                                                         Increased
                                                          Conflict
                        Block Size (bytes)
                                                           Misses


   What else is down side with block size?
                                    Associativity
                     0.14
                                1-way
                     0.12                           Conflict
                                    2-way
Miss Rate per Type




                      0.1
                                            4-way
                     0.08
                                                8-way
                     0.06
                                                        Capacity
                     0.04
                     0.02
                        0
                            1


                                2


                                        4


                                               8


                                                        16


                                                              32


                                                                   64


                                                                         128
                                        Cache Size (KB)            Compulsory


What is down side with associativity?
                         Look at 3 Cs again
•   Classifying Misses: 3 Cs

     – Compulsory—The        first access to a block is not in the cache,
        so the block must be brought into the cache. Also called cold
        start misses or first reference misses.
        (Misses in even an Infinite Cache)
     – Capacity—If      the cache cannot contain all the blocks needed
        during execution of a program, capacity misses will occur due to
        blocks being discarded and later retrieved.
        (Misses in Fully Associative Size X Cache)
     – Conflict—If      block-placement strategy is set associative or
        direct mapped, conflict misses (in addition to compulsory &
        capacity misses) will occur because a block can be discarded and
        later retrieved if too many blocks map to its set. Also called
        collision misses or interference misses.
        (Misses in N-way Associative, Size X Cache)
                                 3Cs Relative Miss Rate
                      100%
                                  1-way
                       80%
                                     2-way
                                                                         Conflict
 Miss Rate per Type




                                          4-way
                       60%                   8-way

                       40%
                                                       Capacity

                       20%

                        0%
                             1


                                 2


                                      4


                                               8


                                                      16


                                                            32


                                                                  64


                                                                        128
                                                                  Compulsory
                                          Cache Size (KB)

Flaws: for fixed block size
Good: insight => invention
       Associativity vs Cycle Time
• Beware: Execution time is the only final measure!
• Why is cycle time tied to hit time?


• Will Clock Cycle time increase?
   – Evaluating associativity in CPU caches
     Hill, M.D.; Smith, A.J.;
     IEEE Transactions on Computers, Volume:
     38, Issue:12, Dec. 1989
• Effective cycle time of assoc:
   – Performance tradeoffs in cache design
    S. Przybylski, M. Horowitz, and J. Hennessy
    ISCA 1988
   Example: Avg. Memory Access Time
             vs. Miss Rate

• Example: assume CCT = 1.10 for 2-way, 1.12 for
  4-way, 1.14 for 8-way vs. CCT direct mapped
      Cache Size     Associativity
      (KB)   1-way   2-way 4-way     8-way
      1      2.33    2.15    2.07    2.01
      2      1.98    1.86    1.76    1.68
      4      1.72    1.67    1.61    1.53
      8      1.46    1.48    1.47    1.43
      16     1.29    1.32    1.32    1.32
      32     1.20    1.24    1.25    1.27
      64     1.14    1.20    1.21    1.23
      128    1.10    1.17    1.18    1.20


 (Red means A.M.A.T. not improved by more associativity)
           Fast Hit Time + Low Conflict =>
                    Victim Cache

• How to combine fast hit time
  of direct mapped
  yet still avoid conflict misses?
                                                  TAGS             DATA
• Add buffer to place data
  discarded from cache
• Jouppi [1990]: 4-entry victim
  cache removed 20% to 95% of
  conflicts for a 4 KB direct        Tag and Comparator   One Cache line of Data
  mapped data cache                  Tag and Comparator   One Cache line of Data
• Used in Alpha, HP machines
                                     Tag and Comparator   One Cache line of Data
                                     Tag and Comparator   One Cache line of Data

                                                                To Next Lower Level In
                                                                      Hierarchy
                Add a second-level cache
• L2 Equations
 AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

 AMAT = Hit TimeL1 +
        Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)


• Definitions:
   – Local miss rate— misses         in this cache divided by the
     total number of memory          accesses to this cache (Miss
     rateL2)
   – Global miss rate—misses         in this cache divided by the
     total number of memory          accesses generated by the
    CPU
      Comparing Local and Global Miss Rates

•   32 KByte 1st level cache;
    Increasing 2nd level cache
•   Global miss rate close to
    single level cache rate
    provided L2 >> L1
•   Don’t use local miss rate
•   L2 not tied to CPU clock
    cycle!
•   Cost & A.M.A.T.
•   Generally Fast Hit Times
    and fewer misses
•   Since hits are few, target
    miss reduction
 EE/CE6304 Computer Architecture

            Lecture 13




         Rama Sangireddy
        Assistant Professor
Department of Electrical Engineering
   University of Texas at Dallas
                    Announcements
• Homework-2
   – Due today

• Exam-1 on 10/21/2008 (Tuesday)
  – Covers until memory hierarchy


• ARM seminar
  – TI Auditorium
  – 10/14/2008 (Tuesday), 5.30pm


• Project: Get started immediately
     Review: 6 Basic Cache Optimizations
• Reducing hit time
1. Giving Reads Priority over Writes
     •   E.g., Read complete before earlier writes in write buffer
2. Avoiding Address Translation during Cache
   Indexing

• Reducing Miss Penalty
3. Multilevel Caches

•    Reducing Miss Rate
4.   Larger Block size (Compulsory misses)
5.   Larger Cache size (Capacity misses)
6.   Higher Associativity (Conflict misses)
  11 Advanced Cache Optimizations
                       • Reducing Miss Penalty
                       7. Critical word first
• Reducing hit time
                       8. Merging write buffers
1.Small and simple
  caches
                       • Reducing Miss Rate
2.Way prediction
                       9. Compiler optimizations
3.Trace caches

• Increasing cache     • Reducing miss penalty
  bandwidth              or miss rate via
                         parallelism
4.Pipelined caches
                       10.Hardware prefetching
5.Multibanked caches
                       11.Compiler prefetching
6.Nonblocking caches
                                       1. Fast Hit times via
                                      Small and Simple Caches
•            Index tag memory and then compare takes time
•             Small cache can help hit time since smaller memory takes less time
             to index
                       – E.g., L1 caches same size for 3 generations of AMD
                         microprocessors: K6, Athlon, and Opteron
                       – Also L2 cache small enough to fit on chip, but large enough
                         to hit may accesses, avoids time penalty of going off chip
•            Simple  direct mapping
                       – Can overlap tag check with data transmission since no choice
•            Access time estimate for 90 nm using CACTI model 4.0
                       – Median ratios of access time relative to the direct-mapped
                         caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-
                         way caches

                       2.50
    Access time (ns)




                       2.00              1-way   2-way   4-way   8-way

                       1.50

                       1.00
                       0.50
                        -
                              16 KB     32 KB        64 KB         128 KB     256 KB   512 KB   1 MB
                                                                 Cache size
           2. Fast Hit times via Way Prediction
• How to combine fast hit time of Direct Mapped and have the lower
  conflict misses of 2-way SA cache?
• Way prediction: keep extra bits in cache to predict the “way,” or
  block within the set, of next cache access.
   – Multiplexor is set early to select desired block, only
     1 tag comparison performed that clock cycle in
     parallel with reading the cache data
   – Miss  check other blocks for matches in next clock
     cycle
       Hit Time

       Way-Miss Hit Time            Miss Penalty


• Accuracy  85%
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
   – Used for instruction caches vs. data caches
           3. Fast Hit times via Trace Cache
•    Find more instruction level parallelism?
     How avoid translation from x86 to microops?
•    Trace cache in Pentium 4
1.   Dynamic traces of the executed instructions vs. static sequences of instructions
     as determined by layout in memory
     –   Built-in branch predictor
2.   Cache the micro-ops vs. x86 instructions
     – Decode/translate from x86 to micro-ops on trace cache miss
+    1.  better utilize long blocks (don’t exit in middle of block, don’t enter
     at label in middle of block)
-    1.  complicated address mapping since addresses no longer aligned to
     power-of-2 multiples of word size
-    1.  instructions may appear multiple times in multiple dynamic traces
     due to different branch outcomes

For more information, refer to "Trace Cache: A Low Latency Approach to High-
    Bandwidth Instruction Fetching", E. Rotenberg, S. Bennett, J.E. Smith,
    Proceedings of MICRO-29, December 1996
-   Paper available for download in references section of course webpage
     4: Increasing Cache Bandwidth by
                 Pipelining
• Pipeline cache access to maintain bandwidth, but
  higher latency
• Instruction cache access pipeline stages:
  1: Pentium
  2: Pentium Pro through Pentium III
  4: Pentium 4
-  greater penalty on mispredicted branches
-  more clock cycles between the issue of the
  load and the use of the data
       5. Increasing Cache Bandwidth:
            Non-Blocking Caches
• Non-blocking cache or lockup-free cache allow data cache to
  continue to supply cache hits during a miss
• “hit under miss” reduces the effective miss penalty by working
  during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further
  lower the effective miss penalty by overlapping multiple misses
   – Significantly increases the complexity of the
     cache controller as there can be multiple
     outstanding memory accesses
   – Requires multiple memory banks (otherwise cannot
     support)
   – Pentium Pro allows 4 outstanding memory misses
                                  Value of Hit Under Miss for SPEC
                                                                                                Hit Under i Misses


                                    2

                                  1.8

                                  1.6

                                  1.4
                                                                                                                                                                                                                     Hit under:
         Avg. Mem. Access Tim e




                                                                                                                                                                                                             0->1
                                                                                                                                                                                                                     1 miss
                                  1.2
                                                                                                                                                                                                             1->2    2 misses
                                    1
                                                                                                                                                                                                             2->64   64 misses
                                  0.8
                                                                                                                                                                                                                     Base
                                                                                                                                                                                                             Base
                                  0.6

                                  0.4
                                                                                                                                                                                                           “Hit under n Misses”
                                  0.2

                                    0                                                                                       doduc




                                                                                                                                                                                  nasa7
                                                                                          ear
                                                  espresso




                                                                                                                                             wave5




                                                                                                                                                                                                     ora
                                                                     compress
                                        eqntott




                                                                                                                                                               hydro2d
                                                                                                 fpppp

                                                                                                         tomcatv




                                                                                                                                                                                          spice2g6
                                                                                                                                    su2cor



                                                                                                                                                     mdljdp2



                                                                                                                                                                         alvinn
                                                                                mdljsp2




                                                                                                                   swm256
                                                             xlisp




                                    Integer                                                      Floating Point

• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92
      6: Increasing Cache Bandwidth via
                Multiple Banks
• Rather than treat the cache as a single monolithic block,
  divide into independent banks that can support simultaneous
  accesses
   – E.g.,T1 (“Niagara”) L2 has 4 banks
• Banking works best when accesses naturally spread themselves
  across banks  mapping of addresses to banks affects
  behavior of memory system
• Simple mapping that works well is “sequential interleaving”
   – Spread block addresses sequentially across banks
   – E,g, if there 4 banks, Bank 0 has all blocks
     whose address modulo 4 is 0; bank 1 has all
     blocks whose address modulo 4 is 1; …
            7. Reduce Miss Penalty:
     Early Restart and Critical Word First
• Don’t wait for full block before restarting CPU
• Early restart—As soon as the requested word of the block
  arrives, send it to the CPU and let the CPU continue execution
   – Spatial locality  tend to want next sequential word,
     so not clear amount of benefit of just early restart
• Critical Word First—Request the missed word first from
  memory and send it to the CPU as soon as it arrives; let the
  CPU continue execution while filling the rest of the words in
  the block
   – Long blocks more popular today  Critical Word
     1st Widely used


                                            block
  11 Advanced Cache Optimizations
                       • Reducing Miss Penalty
                       7. Critical word first
• Reducing hit time
                       8. Merging write buffers
1.Small and simple
  caches
                       • Reducing Miss Rate
2.Way prediction
                       9. Compiler optimizations
3.Trace caches

• Increasing cache     • Reducing miss penalty or
  bandwidth              miss rate via parallelism
4.Pipelined caches     10.Hardware prefetching
5.Multibanked caches   11.Compiler prefetching
6.Nonblocking caches
         8. Merging Write Buffer to
            Reduce Miss Penalty
•   Write buffer to allow processor to continue
    while waiting to write to memory
•   If buffer contains modified blocks, the
    addresses can be checked to see if address
    of new data matches the address of a valid
    write buffer entry
•   If so, new data are combined with that
    entry
•   Increased block size of write buffer for
    write-through cache of writes to sequential
    words/bytes: since multiword writes more
    efficient to memory
•   The Sun T1 (Niagara) processor, among
    many others, uses write merging
         9. Reducing Misses by Compiler
                 Optimizations
• McFarling [1989] reduced caches misses by 75%
  on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
   – Reorder procedures in memory so as to reduce conflict misses
   – Profiling to look at conflicts (using tools they developed)
• Data
   – Merging Arrays: improve spatial locality by single array of
     compound elements vs. 2 arrays
   – Loop Interchange: change nesting of loops to access data in
     order stored in memory
   – Loop Fusion: Combine 2 independent loops that have same
     looping and some variables overlap
   – Blocking: Improve temporal locality by accessing “blocks” of
     data repeatedly vs. going down whole columns or rows
    Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */
struct merge {
   int val;
   int key;
};
struct merge merged_array[SIZE];



Reducing conflicts between val & key;
  improve spatial locality
   Loop Interchange Example

/* Before */
for (k = 0; k < 100; k = k+1)
  for (j = 0; j < 100; j = j+1)
       for (i = 0; i < 5000; i = i+1)
              x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
  for (i = 0; i < 5000; i = i+1)
       for (j = 0; j < 100; j = j+1)
              x[i][j] = 2 * x[i][j];


Sequential accesses instead of striding
 through memory every 100 words;
 improved spatial locality
           Loop Fusion Example

/* Before */
for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
       a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
       d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
  {    a[i][j] = 1/b[i][j] * c[i][j];
       d[i][j] = a[i][j] + c[i][j];}


2 misses per access to a & c vs. one miss per
  access;
                     Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
      {r = 0;
       for (k = 0; k < N; k = k+1){
         r = r + y[i][k]*z[k][j];};
       x[i][j] = r;
      };
• Two Inner Loops:
   – Read all NxN elements of z[]
   – Read N elements of 1 row of y[] repeatedly
   – Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:
   – 2N3 + N2 => (assuming no conflict; otherwise
     …)
• Idea: compute on BxB submatrix that fits
              Blocking Example

/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
   for (j = jj; j < min(jj+B-1,N); j = j+1)
     {r = 0;
      for (k = kk; k < min(kk+B-1,N); k = k+1) {
        r = r + y[i][k]*z[k][j];};
      x[i][j] = x[i][j] + r;
     };

• B called Blocking Factor
• Capacity Misses from 2N3 + N2 to 2N3/B +N2
• Conflict Misses Too?
               Reducing Conflict Misses by Blocking
                0.1
   Miss Rate




                                       Direct Mapped Cache
               0.05



                                       Fully Associative Cache

                  0
                      0     50                100           150
                            Blocking Factor
• Conflict misses in caches not FA vs. Blocking size
      – Lam et al [1991] a blocking factor of 24 had a
        fifth the misses vs. 48 despite both fit in
        cache
     Summary of Compiler Optimizations to
        Reduce Cache Misses (by hand)
vpenta (nasa7)
 gmty (nasa7)
       tomcatv
 btrix (nasa7)
  mxm (nasa7)
          spice
      cholesky
       (nasa7)
     compress

                  1         1.5        2           2.5              3
                       Performance Improvement

    merged            loop           loop fusion         blocking
    arrays            interchange
                                   10. Reducing Misses by Hardware
                                  Prefetching of Instructions & Data
•   Prefetching relies on having extra memory bandwidth that can be used
    without penalty
•   Instruction Prefetching
     – Typically, CPU fetches 2 blocks on a miss: the requested block
       and the next consecutive block.
     – Requested block is placed in instruction cache when it returns,
       and prefetched block is placed into instruction stream buffer
•   Data Prefetching
     – Pentium 4 can prefetch data into L2 cache from up to 8
       streams from 8 different 4 KB pages
     – Prefetching invoked if 2 successive L2 cache misses to a page,
       if distance between those cache blocks is < 256 bytes
        Performance Improvement




                                  2.20                                                                                            1.97
                                  2.00
                                  1.80
                                  1.60                 1.45                                                              1.49
                                                                                                               1.40
                                                                                    1.26   1.29       1.32
                                  1.40                         1.18   1.20   1.21
                                            1.16
                                  1.20
                                  1.00
                                                                                                       u
                                            p




                                                                                 el
                                                                   3d




                                                                                                                                  ke
                                                                     e




                                                                                          im
                                                                                 c




                                                                                                                        id
                                                   cf




                                                                                                               s
                                                                                                    pl
                                                                               re




                                                                                                             ca
                                         ga




                                                                  is


                                                                              lg




                                                                                                                      gr
                                                   m




                                                                                                                                ua
                                                                                        sw
                                                           m


                                                                 w




                                                                                                 ap
                                                                           ce
                                                                           ga




                                                                                                           lu


                                                                                                                   m
                                                         fa


                                                               up




                                                                                                                             eq
                                                                         fa
                                                              w




                                   SPECint2000                                      SPECfp2000
         11. Reducing Misses by
        Software Prefetching Data
• Data Prefetch
   – Load data into register (HP PA-RISC loads)
   – Cache Prefetch: load into cache
     (MIPS IV, PowerPC, SPARC v. 9)
   – Special prefetching instructions must not cause
     faults; a form of speculative execution

• Issuing Prefetch Instructions takes time
   – Is cost of prefetch issues < savings in reduced
     misses?
   – Higher superscalar reduces difficulty of issue
     bandwidth
        Compiler Optimization vs. Memory
                Hierarchy Search
• Compiler tries to figure out memory hierarchy
  optimizations
• New approach: “Auto-tuners” 1st run variations of
  program on computer to find best combinations of
  optimizations (blocking, padding, …) and algorithms,
  then produce C code to be compiled for that
  computer
• “Auto-tuner” targeted to numerical method
   – E.g., PHiPAC (BLAS), Atlas (BLAS),
     Sparsity (Sparse linear algebra), Spiral
     (DSP), FFT-W
                 Reducing Misses:
             Which apply to L2 Cache?
• Reducing Miss Rate
   1. Reduce Misses via Larger Block Size
   2. Reduce Conflict Misses via Higher Associativity
   3. Reducing Conflict Misses via Victim Cache
   4. Reducing Conflict Misses via Pseudo-Associativity
   5. Reducing Misses by HW Prefetching Instr, Data
   6. Reducing Misses by SW Prefetching Data
   7. Reducing Capacity/Conf. Misses by Compiler
    Optimizations
   L2 cache block size & A.M.A.T.
                     Relative CPU Time

   2                                            1.95
1.9
1.8
1.7
1.6                                      1.54
1.5
       1.36                      1.34
1.4           1.28       1.27
1.3
1.2
1.1
   1
        16     32         64      128    256    512
                         Block Size
• 32KB L1, 8 byte path to memory
             Reducing Miss Penalty Summary
                                       Memory accesses                           
CPUtime  IC  CPI                                      Mis s rate Miss penalty  Clock cycle time
                    E xe cuti on
                                          Instruction                             
        • Four techniques
            – Read priority over write on miss
            – Early Restart and Critical Word First
              on miss
            – Non-blocking Caches (Hit under Miss,
              Miss under Miss)
            – Second Level Cache
        • Can be applied recursively to Multilevel Caches
            – Danger is that time to DRAM will grow
              with multiple levels in between
                                   R4000 Performance
•    Not ideal CPI of 1:
       –   Load stalls (1 or 2 clock cycles)
       –   Branch stalls (2 cycles + unfilled slots)
       –   FP result stalls: RAW data hazard (latency)
       –   FP structural stalls: Not enough FP hardware (parallelism)
    4.5
      4
    3.5
      3
    2.5
      2
    1.5
      1
    0.5
      0
                                         gcc




                                                           doduc



                                                                      nasa7



                                                                                 ora
                        espresso




                                                 li
              eqntott




                                                                                                 spice2g6



                                                                                                              su2cor



                                                                                                                            tomcatv
       Base                        Load stalls        Branch stalls           FP result stalls              FP structural
                                                                                                            stalls
           What is the Impact of What You’ve
                Learned About Caches?
                            1000
                                                                                                                                                                                      CPU

•   1960-1985: Speed
    = ƒ(no. operations)
•   1990
                             100
     – Pipelined
       Execution &
       Fast Clock Rate
     – Out-of-Order
                        10


       execution
                                                                                                                                                                               DRAM
     – Superscalar
       Instruction Issue 1
                                   1980

                                          1981

                                                 1982

                                                        1983

                                                               1984

                                                                      1985

                                                                             1986

                                                                                    1987

                                                                                           1988

                                                                                                  1989

                                                                                                         1990

                                                                                                                1991

                                                                                                                       1992

                                                                                                                              1993

                                                                                                                                     1994

                                                                                                                                            1995

                                                                                                                                                   1996

                                                                                                                                                          1997

                                                                                                                                                                 1998

                                                                                                                                                                        1999

                                                                                                                                                                               2000
•   1998: Speed =
    ƒ(non-cached memory accesses)
•   What does this mean for
     – Compilers?,Operating Systems?, Algorithms?
       Data Structures?
  Alpha 21064
• Separate Instr & Data
  TLB & Caches
• TLBs fully associative
• TLB updates in SW
  (“Priv Arch Libr”)
                                       Instr   Data
• Caches 8KB direct
  mapped, write thru
• Critical 8 bytes first
• Prefetch instr. stream
  buffer
• 2 MB L2 cache, direct
  mapped, WB (off-chip)
                                                      Write
• 256 bit path to main
  memory, 4 x 64-bit                                  Buffer
  modules                   Stream
• Victim Buffer: to give    Buffer
  read priority over
  write
• 4 entry write buffer
  between D$ & L2$
                           Victim Buffer
                        Alpha CPI Components
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$ (pink)
  Other: compute + reg conflicts, structural conflicts


          5.00
          4.50
          4.00
          3.50                                                L2
          3.00                                                I$
    CPI




                                                              D$
          2.50
                                                              I Stall
          2.00
                                                              Other
          1.50
          1.00
          0.50
          0.00
            AlphaSort    Li   Compress   Ear   Tomcatv
   Pitfall: Predicting Cache Performance from
       Different Prog. (ISA, compiler, ...)
                          35%
                                     D$, Tom
                          30%                                    D: tomcatv
• 4KB Data cache miss                                            D: gcc
  rate 8%,12%, or
  28%?                    25%                                    D: espresso

• 1KB Instr cache miss
                                                                 I: gcc
                                D$, gcc
  rate 0%,3%,or 10%? Miss
                          20%                                    I: espresso

• Alpha vs. MIPS     Rate                                        I: tomcatv
   for 8KB Data $:        15% D$, esp
  17% vs. 10%
• Why 2X Alpha v.         10%
  MIPS?
                             I$, gcc
                           5%

                           0% esp
                             I$,
                               1  2       4      8      16       32       64   128
                                I$, Tom        Cache Size (KB)
                        Cache Optimization Summary

                 Technique                           MR   MP HT   Complexity
                 Larger Block Size                   +    –               0
     miss rate




                 Higher Associativity                +        –           1
                 Victim Caches                       +                    2
                 Pseudo-Associative Caches           +                    2
                 HW Prefetching of Instr/Data        +                    2
                 Compiler Controlled Prefetching     +                    3
                 Compiler Reduce Misses              +                    0
                 Priority to Read Misses                  +               1
penalty




                 Early Restart & Critical Word 1st        +               2
 miss




                 Non-Blocking Caches                      +               3
                 Second Level Caches                      +               2
                 Better memory system                     +               3
                 Small & Simple Caches               –        +           0
                 Avoiding Address Translation                 +           2
   hit time




                 Pipelining Caches                            +           2
                                               Mi
                                               ss    Mis    HW cost/
                                Hit    Band-
Technique                       Time   width
                                               pe    s      complexit     Comment
                                               nal   rate   y
                                               ty

Small and simple caches
                                  +                   –         0         Trivial; widely used

Way-predicting caches
                                  +                             1         Used in Pentium 4

Trace caches
                                  +                             3         Used in Pentium 4

Pipelined cache access
                                  –       +                     1         Widely used

Nonblocking caches
                                          +    +                3         Widely used
                                                                          Used in L2 of Opteron and
Banked caches
                                          +                     1         Niagara
Critical word first and early
restart                                        +                2         Widely used
                                                                          Widely used with write
Merging write buffer
                                               +                1         through
                                                                          Software is a challenge;
Compiler techniques to
                                                                          some computers have
reduce cache misses                                   +         0         compiler option
                                                                          Many prefetch instructions;
Hardware prefetching of
                                                            2 instr., 3   AMD Opteron prefetches
instructions and data
                                               +      +        data       data
Compiler-controlled                                                       Needs nonblocking cache; in
prefetching                                    +       +        3         many CPUs
            A Modern Memory Hierarchy
• By taking advantage of the principle of locality:
    – Present the user with as much memory as is available in
      the cheapest technology.
    – Provide access at the speed offered by the fastest
      technology.

                Processor


               Control                                                           Tertiary
                                                                     Secondary   Storage
                                                                      Storage  (Disk/Tape)
                                               Second     Main
                                                                       (Disk)
                              On-Chip
                  Registers




                                                Level   Memory
                               Cache




       Datapath                                Cache    (DRAM)
                                              (SRAM)



    Speed (ns): 1s                      10s              100s    10,000,000s 10,000,000,000s
    Size (bytes): 100s                                             (10s ms)      (10s sec)
                                        Ks               Ms           Gs           Ts
           The Limits of Physical Addressing
            “Physical addresses” of memory locations

A0-A31                                                  A0-A31

CPU                                                    Memory
D0-D31                                                  D0-D31

                              Data

         All programs share one address space:
               The physical address space
          Machine language programs must be
           aware of the machine organization
           No way to prevent a program from
           accessing any machine resource
Solution: Add a Layer of Indirection
    “Virtual Addresses”                      “Physical
                                            Addresses”
                      Virtual    Physical
A0-A31                                                   A0-A31
                           Address                   Memory
CPU                       Translation
D0-D31                                                   D0-D31

             Data

         User programs run in an standardized
                 virtual address space
         Address Translation hardware
      managed by the operating system (OS)
     maps virtual address to physical memory
    Hardware supports “modern” OS features:
        Protection, Translation, Sharing
                        Virtual Memory
• Divides physical memory into blocks
  and allocates them to different
  processes
• Actual reason for design of virtual
  memory?
   – relieves burden on
     programmer
• Relocation?
   – changes in mapping done by
     software
• Cache vs. Virtual Memory (VM)
   – cache block ~ page or
     segment
   – Cache miss ~ page fault
• Software (OS) controlled VM
  replacement on miss
   – Hardware controlled cache
     miss replacement
• VM size decided by width of CPU
  generated address
   – Cache size is independent
       Three Advantages of Virtual Memory
• Translation:
   – Program can be given consistent view of memory, even though
     physical memory is scrambled
   – Makes multithreading reasonable (now used a lot!)
   – Only the most important part of program (“Working Set”) must
     be in physical memory.
   – Contiguous structures (like stacks) use only as much physical
     memory as necessary yet still grow later.
• Protection:
   – Different threads (or processes) protected from each other.
   – Different pages can be given special behavior
       » (Read Only, Invisible to user programs, etc).
   – Kernel data protected from User programs
   – Very important for protection from malicious programs
• Sharing:
   – Can map same physical page to multiple users
     (“Shared memory”)
                      Virtual Memory
• Page vs. Segment?
• Paged VM and Segmented VM have design implications on CPU
   – Addressing complexities
• For paged VM, block replacement is trivial
   – Both replacing and replaced block are of same size
• For segmented VM, it is hard
• Different memory use inefficiencies
• Different disk traffic efficiencies
        Page tables encode virtual address spaces
                                    A virtual address space
   Virtual
Address Space
                   Physical
                 Address Space
                                     is divided into blocks
                                    of memory called pages
                    frame
                    frame
                    frame              A machine
                    frame
                                 usually supports
                                   pages of a few
                                            sizes
                                   (MIPS R4000):



                A valid page table entry codes physical
                memory “frame” address for the page
   Virtual Memory: 4 memory hierarchy Qs
• Where can a block be placed in main memory?
   – Fully associative
• How is a block found if it is in main memory?
• Which block should be replaced on a VM miss?
   – LRU
• What happens on a write?
   – Write-back or write-through?
             Page tables encode virtual address spaces
             Page Table      Physical
                           Memory Space
                                             A virtual address space
                                              is divided into blocks
                             frame
                                             of memory called pages
                             frame
                             frame
                             frame              A machine
                                          usually supports
                                            pages of a few
   virtual
  address                                            sizes
                                            (MIPS R4000):
OS
manages                   A page table is indexed by a
the page                        virtual address
table for
each ASID         A valid page table entry codes physical
                  memory “frame” address for the page
                             Details of Page Table
           Page Table     Physical
                        Memory Space
                                         Virtual Address
                           frame                            12
                           frame          V page no.       offset
                           frame
                           frame                       Page Table
                                   Page Table
                                   Base Reg             Access
                                            index   V   Rights   PA
 virtual
                                            into
address                                     page
                                            table   table located
                                                     in physical P page no.      offset
                                                       memory                     12
                                                                      Physical Address
   • Page table maps virtual page numbers to physical
     frames (“PTE” = Page Table Entry)
   • Virtual memory => treat memory  cache for disk
           Page tables may not fit in memory!

               A table for 4KB pages for a 32-bit address
                          space has 1M entries
         Each process needs its own address space!

  Two-level Page Tables

          32 bit virtual address
    31          22 21   12 11        0
         P1 index P2 index Page Offset



Top-level table wired in main memory

Subset of 1024 second-level tables in
   main memory; rest are on disk
          VM and Disk: Page replacement policy
                                                             Page Table
                                Dirty bit: page dirty used
                                    written.     1 0           ...
                                                   1   0
                                Used bit: set to   0   1
                                   1 on any        1   1
                                  reference
                                                   0   0
           Set of all pages
             in Memory         Tail pointer:
                               Clear the used
                               bit in the
                               page table
Head pointer                                                     Freelist
Place pages on free
list if used bit
is still clear.
Schedule pages
with dirty bit set to
                          Architect’s role:
be written to disk.     support setting dirty                Free Pages
                           and used bits
            Address Map
 V = {0, 1, . . . , n - 1} virtual address space
 M = {0, 1, . . . , m - 1} physical address space       n > m

 MAP: V --> M U {0} address mapping function

     MAP(a) = a' if data at virtual address a is present in physical
                         address a' and a' in M

               = 0 if data at virtual address a is not present in M

        a                               missing item fault
              Name Space V
                                         fault
Processor                               handler

                             0
                Addr Trans               Main           Secondary
       a        Mechanism               Memory           Memory
                             a'

                     physical address                   OS performs
                                                        this transfer
 Implications of Virtual Memory for
            Pipeline design

• Fault?
• Address translation?
            Paging Organization
                                              V.A.
P.A.                                                                           unit of
    0    frame 0 1K                              0      page 0        1K       mapping
 1024         1 1K                Addr        1024           1        1K
                                  Trans
                                  MAP                                    also unit of
 7168            7       1K                                              transfer from
                                                                         virtual to
        Physical                                                         physical
        Memory
                                            31744            31       1K memory
                                                     Virtual Memory
       Address Mapping
                               10
 VA     page no.              offset

                     Page Table
Page Table
Base Reg                 Access                      actually, concatenation
         index       V   Rights   PA        +        is more likely
         into
         page
         table       table located        physical
                      in physical         memory
                        memory            address
                Address Translation
           VA             PA            miss
                Trans-                          Main
   CPU           lation         Cache          Memory

                          hit
                 data



• Page table is a large data structure in memory
• Two memory accesses for every load, store, or instruction
  fetch!!!
• Virtually addressed cache?
   – synonym problem
• Cache the address translations?
TLB Design Concepts
  MIPS Address Translation: How does it work?

   “Virtual Addresses”                     “Physical
                                          Addresses”
A0-A31               Virtual   Physical                A0-A31
                         Translation
CPU                      Look-Aside                Memory
D0-D31
                           Buffer                      D0-D31
                            (TLB)
           Data
                                             What is
                                            the table
    Translation Look-Aside Buffer (TLB)            of
     A small fully-associative cache of    mappings
 mappings from virtual to physical addresses that it
                                            caches?
                 TLB also contains
         protection bits for virtual address
 Fast common case: Virtual address is in TLB,
   process has permission to read/write it.
                     TLBs
  A way to speed up translation is to use a special cache of recently
      used page table entries -- this has many names, but the most
      frequently used is Translation Lookaside Buffer or TLB

      Virtual Address Physical Address Dirty Ref Valid Access




Really just a cache on the page table mappings

TLB access time comparable to cache access time
     (much less than main memory access time)
         Translation Look-Aside Buffers
   Just like any other cache, the TLB can be organized as fully associative,
        set associative, or direct mapped

   TLBs are usually small, typically not more than 128 - 256 entries even on
        high end machines. This permits fully associative
        lookup on these machines. Most mid-range machines use small
        n-way set associative organizations.


                                      hit
                      VA              PA                 miss
                              TLB                                Main
              CPU                             Cache
                             Lookup                             Memory
Translation                miss             hit
with a TLB
                            Trans-
                             lation
                                                  data
                              1/2 t                t             100 t
        Reducing Translation Time

Machines with TLBs go one step further to reduce #
 cycles/cache access
They overlap the cache access with the TLB access:
    high order bits of the VA are used to look in the TLB while
  low order bits are used as index into cache
    The TLB caches page table entries
                                                                           Physical and virtual
                                                                           pages must be the
                                                                               same size!
TLB caches
 page table                                                                   Physical
  entries.
   virtual address                 for ASID
                                                                               frame
    page    off                                                               address
                      Page Table


                          2

                          0

                          1
                          3                                                V=0 pages either
                                       physical address
                                                                           reside on disk or
                                         page    off
                        TLB                                                have not yet been
                     frame page               MIPS handles TLB misses in       allocated.
                        2   2                      software (random
                        0   5
                                                 replacement). Other        OS handles V=0
                                               machines use hardware.        “Page fault”
    Can TLB and caching be overlapped?
         Virtual Page Number                       Page Offset

                                           Index         Byte Select

                Virtual

             Translation
             Look-Aside                Cache Tags Valid Cache Data
               Buffer
                (TLB)                                     Cache Block
                Physical

                     Cache Tag    =
                                                          Cache Block



                                 Hit
  This works, but ...
Q. What is the downside?
   A. Inflexibility. Size of cache
   limited by page size.
                                                          Data out
           Overlapped Cache & TLB Access

                        assoc               index
32         TLB                                            Cache     1 K
                        lookup

                                                         4 bytes
                                       10     2
                                              00

 PA         Hit/                                         Data
                            20          12          PA             Hit/
            Miss                                                   Miss
                        page #         disp


                                               =

     IF cache hit AND (cache tag = PA) then deliver data to CPU
     ELSE IF [cache miss and TLB hit] THEN
                  access memory with the PA from the TLB
     ELSE do standard VA translation
    Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to
     index into the cache do not change as the result of VA translation

This usually limits things to small caches, large page sizes, or high
     n-way set associative caches if you want a large cache

Example: suppose everything the same except that the cache is
    increased to 8 K bytes instead of 4 K:

                                11      2
                               cache
                                index   00
                                                This bit is changed
                                                by VA translation, but
                      20          12            is needed for cache
                 virt page #     disp           lookup
   Solutions:
        go to 8K byte page sizes;
        go to 2 way set associative cache; or
        SW guarantee VA[13]=PA[13]


                                                1K   2 way set assoc cache
            10
                          4                 4
          Use virtual addresses for cache?
    “Virtual Addresses”                   “Physical
                                         Addresses”
A0-A31                    Virtual   Physical      A0-A31
              Virtual
                           Translation
CPU          Cache         Look-Aside          Main Memory
              D0-D31
D0-D31
                             Buffer               D0-D31
                              (TLB)


         Only use TLB on a cache miss !

 Downside: a subtle, fatal problem. What is it?

 A. Synonym problem. If two address spaces
 share a physical frame, data may be in cache
twice. Maintaining consistency is a nightmare.
                   Summary #1/3:
               The Cache Design Space
• Several interacting dimensions                Cache Size
    –   cache size
    –   block size                                            Associativity
    –   associativity
    –   replacement policy
    –   write-through vs write-back
    –   write allocation                                   Block Size
• The optimal choice is a compromise
    – depends on access characteristics
         » workload                       Bad
         » use (I-cache, D-cache, TLB)
    – depends on technology / cost
                                         Good   Factor A        Factor B
• Simplicity often wins
                                                Less             More
        Summary #2/3: Caches
• The Principle of Locality:
   – Program access a relatively small portion of the
     address space at any instant of time.
      » Temporal Locality: Locality in Time
      » Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:
   – Compulsory Misses: sad facts of life. Example:
     cold start misses.
   – Capacity Misses: increase cache size
   – Conflict Misses: increase cache size and/or
     associativity.

• Write Policy: Write Through vs. Write Back
• Today CPU time is a function of (ops, cache misses) vs. just
  f(ops): affects Compilers, Data structures, and Algorithms
    Summary #3/3: TLB, Virtual Memory
• Page tables map virtual address to physical address
• TLBs are important for fast translation
• TLB misses are significant in processor performance
   – funny times, as most systems can’t access all of
     2nd level cache without TLB misses!
• Caches, TLBs, Virtual Memory all understood by examining
  how they deal with 4 questions:
  1) Where can block be placed?
  2) How is block found?
  3) What block is replaced on miss?
  4) How are writes handled?
• Today VM allows many processes to share single memory
  without having to swap all processes to disk; today VM
  protection is more important than memory hierarchy benefits,
  but computers insecure
     AMD Opteron Memory Hierarchy
• 12-stage integer pipeline yields a maximum clock rate of 2.8 GHz
  and fastest memory PC3200 DDR SDRAM
• 48-bit virtual and 40-bit physical addresses
• I and D cache: 64 KB, 2-way set associative, 64-B block, LRU
• L2 cache: 1 MB, 16-way, 64-B block, pseudo LRU
• Data and L2 caches use write back, write allocate
• L1 caches are virtually indexed and physically tagged
• L1 I TLB and L1 D TLB: fully associative, 40 entries
   – 32 entries for 4 KB pages and 8 for 2 MB or 4 MB pages
• L2 I TLB and L1 D TLB: 4-way, 512 entities of 4 KB pages
• Memory controller allows up to 10 cache misses
   – 8 from D cache and 2 from I cache
   Opteron Memory Hierarchy Performance
• For SPEC2000
   – I cache misses per instruction is 0.01% to 0.09%
   – D cache misses per instruction are 1.34% to 1.43%
   – L2 cache misses per instruction are 0.23% to 0.36%
• Commercial benchmark (“TPC-C-like”)
   – I cache misses per instruction is 1.83% (100X!)
   – D cache misses per instruction are 1.39% ( same)
   – L2 cache misses per instruction are 0.62% (2X to 3X)
• How compare to ideal CPI of 0.33?
      CPI breakdown for Integer Programs
      3.00
                         Min Pipeline Stall
      2.50
                         Max Memory CPI
      2.00
                         Base CPI
CPI




      1.50
      1.00
      0.50
       -




                                                                                         twolf
                                                   vortex




                                                                                   vpr
                       crafty
                                eon
                                      gzip
             perlbmk




                                                            bzip2




                                                                                                 TPC-C
                                                                          parser
                                                                    gcc
                                             gap



• CPI above base attributable to memory  50%
• L2 cache misses  25% overall (50% memory CPI)
      – Assumes misses are not overlapped with the execution
        pipeline or with each other, so the pipeline stall portion
        is a lower bound
            CPI breakdown for Floating Pt. Programs

      3.00
                    Min Pipeline Stall
      2.50
                    Max Memory CPI
      2.00
                    Base CPI
CPI




      1.50
      1.00
      0.50
        -




       eq i m
       fa plu




                   t
                ck




                  s
          m e




        fm p




                ke
                 el
      w esa




         am si
                 id



         ga c




                  d




                ar
             ca
               is




              m
              re




            a3
            ap
             lg
            gr
            ra




           sw
           ua
            w


          ap
          ce




          lu
          m
         xt


        up
       si




      • CPI above base attributable to memory  60%
      • L2 cache misses  40% overall (70% memory CPI)
            – Assumes misses are not overlapped with the execution
              pipeline or with each other, so the pipeline stall portion
              is a lower bound
 Pentium 4 vs. Opteron Memory Hierarchy

CPU            Pentium 4 (3.2 GHz*)           Opteron (2.8 GHz*)
Instructio     Trace Cache                    2-way associative, 64
n Cache        (8K micro-ops)                 KB, 64B block

Data           8-way associative, 16          2-way associative, 64
Cache          KB, 64B block,                 KB, 64B block,
               inclusive in L2                exclusive to L2

L2 cache       8-way associative,             16-way associative, 1
               2 MB, 128B block               MB, 64B block

Prefetch       8 streams to L2                1 stream to L2

Memory         200 MHz x 64 bits              200 MHz x 128 bits

           *Clock rate for this comparison in 2005; faster versions existed
                                  Misses Per Instruction: Pentium 4 vs. Opteron
Ratio of MPI: Pentium 4/Opteron


                                      7
                                                 D cache: P4/Opteron
                                      6
                                                 L2 cache: P4/Opteron
                                      5
                                      4
                                                                                                                       3.4X
                                      3
                                                       2.3X
                                      2                                                                                Opteron better
                                                                                       1.5X
                                      1
                                          0.5X                                                                         Pentium better
                                  -
                                                 vpr




                                                                                                       applu
                                                                             wupwise
                                          gzip




                                                                                               mgrid




                                                                                                               mes a
                                                                                        swim
                                                              mcf
                                                        gcc



                                                                    crafty




                                                   SPECint2000                             SPECfp2000
                   • D cache miss: P4 is 2.3X to 3.4X vs. Opteron
                   • L2 cache miss: P4 is 0.5X to 1.5X vs. Opteron
                   • Note: Same ISA, but not same instruction count
                       Fallacies and Pitfalls
•   Not delivering high memory bandwidth in a cache-based system

      – 10 Fastest computers at Stream benchmark [McCalpin 2005]
      – Only 4/10 computers rely on data caches, and their memory
        BW per processor is 7X to 25X slower than NEC SX7


         1,000,000

                                           System Mem ory BW
           100,000
                                            Per Processor Mem ory BW


             10,000


              1,000
                                          2)




                                            )

                                          2)




                                            )
                                            )




                                            )
                              _S 16)




                                           )
                                          4)




                                          4)
                                         16




                              Se (24




                                         (8
                                        12




                                        56
                                       (3




                                       (3




                                       (6




                                       (6
                          EC -7 (
                                       (
                                     (5




                                     (2




                                      A
                                     7




                                     4




                                     4
                                    A




                                   er




                                   er

                                  16
                                  X-




                                  X-




                                  X-
                       _S 00




                                   0
                                 16


                                  X




                                 rv




                                 rv
                                00




                                5-
                              _S




                              _S




                                S
                                0

                               5-




                              Se




                             X-
                   P_ C_
                             _3




                             _3
                           X-
              EC




                          EC




                          ha




                          ha

                         _S
                         tix




                         tix




                          E
             N




                       N

                       N




                       N
                       lp




                       lp
                      Al




                      Al




                     EC
                    EC




                     A




                     A
                    I_




                    I_

                   P_




                   N
               SG




                 SG
                  N




                 H




                 H
Questions?
Additional reference material
            Main Memory Background

• Performance of Main Memory:
  – Latency: Cache Miss Penalty
     » Access Time: time between request and word
       arrives
     » Cycle Time: time between requests
  – Bandwidth: I/O & Large Block Miss Penalty
    (L2)
• Main Memory is DRAM: Dynamic Random Access Memory
  – Dynamic since needs to be refreshed
    periodically (8 ms, 1% time)
  – Addresses divided into 2 halves (Memory as a
    2D matrix):
          Main Memory Deep Background

•   “Out-of-Core”, “In-Core,” “Core Dump”?
•   “Core memory”?
•   Non-volatile, magnetic
•   Lost to 4 Kbit DRAM (today using 512Mbit DRAM)
•   Access time 750 ns, cycle time 1500-3000 ns
         DRAM logical organization (4 Mbit)

                             Column Decoder
                                   …
                  11        Sense Amps & I/O    D



A0…A10                      Memory Array        Q
                            (2,048 x 2,048)
                                      Storage
                            Word Line Cell

  • Square root of bits per RAS/CAS
     Quest for DRAM Performance
1. Fast Page mode
  – Add timing signals that allow repeated
    accesses to row buffer without
    another row access time
  – Such a buffer comes naturally, as
    each array will buffer 1024 to 2048
    bits for each access
2. Synchronous DRAM (SDRAM)
  – Add a clock signal to DRAM interface,
    so that the repeated transfers would
    not bear overhead to synchronize with
    DRAM controller
3. Double Data Rate (DDR SDRAM)
  – Transfer data on both the rising edge
                                  DRAM name based on Peak Chip Transfers / Sec
                                  DIMM name based on Peak DIMM MBytes / Sec
                                                         M
                                                       transfer               Mbytes/s
                                  Stan- Clock Rate        s/        DRAM          /      DIMM
Fastest for sale 4/06 ($125/GB)




                                   dard    (MHz)        second       Name       DIMM      Name
                                  DDR       133         266        DDR266      2128       PC2100
                                  DDR       150         300        DDR300      2400      PC2400
                                  DDR      200          400        DDR400      3200      PC3200
                                  DDR2     266          533       DDR2-533     4264      PC4300
                                  DDR2     333          667       DDR2-667     5336      PC5300
                                  DDR2     400          800       DDR2-800     6400      PC6400
                                  DDR3     533         1066       DDR3-1066    8528      PC8500
                                  DDR3     666         1333       DDR3-1333    10664     PC10700
                                                  x2                 x8
                                  DDR3     800         1600       DDR3-1600    12800     PC12800
        Need for Error Correction!
• Motivation:
   – Failures/time proportional to number of
     bits!
   – As DRAM cells shrink, more vulnerable
• Went through period in which failure rate was
  low enough without error correction that people
  didn’t do correction
   – DRAM banks too large now
   – Servers always corrected memory
     systems
• Basic idea: add redundancy through parity bits
   – Common configuration: Random error
     correction
      » SEC-DED (single error correct, double
   Introduction to Virtual Machines


• VMs developed in late 1960s
   – Remained important in mainframe
     computing over the years
   – Largely ignored in single user
     computers of 1980s and 1990s
• Recently regained popularity due to
   – increasing importance of isolation
     and security in modern systems,
   – failures in security and reliability
     of standard operating systems,
   – sharing of a single computer among
    What is a Virtual Machine (VM)?
• Broadest definition includes all emulation
  methods that provide a standard software
  interface, such as the Java VM
• “(Operating) System Virtual Machines” provide
  a complete system level environment at binary
  ISA
  – Here assume ISAs always match the
    native hardware ISA
  – E.g., IBM VM/370, VMware ESX
    Server, and Xen
• Present illusion that VM users have entire
  computer to themselves, including a copy of OS
• Single computer runs multiple VMs, and can
  support a multiple, different OSes
  – On conventional platform, single OS
  Virtual Machine Monitors (VMMs)


• Virtual machine monitor (VMM) or hypervisor
  is software that supports VMs
• VMM determines how to map virtual
  resources to physical resources
• Physical resource may be time-shared,
  partitioned, or emulated in software
• VMM is much smaller than a traditional OS;
  – isolation portion of a VMM is 
    10,000 lines of code
             VMM Overhead?


• Depends on the workload
• User-level processor-bound programs (e.g.,
  SPEC) have zero-virtualization overhead
   – Runs at native speeds since OS
     rarely invoked
• I/O-intensive workloads  OS-intensive
   execute many system calls and privileged
  instructions
   can result in high virtualization overhead
   – For System VMs, goal of
     architecture and VMM is to run
     almost all instructions directly on
     native hardware
           Other Uses of VMs
• Focus here on protection
• 2 Other commercially important uses of VMs
1. Managing Software
  – VMs provide an abstraction that can
    run the complete SW stack, even
    including old OSes like DOS
  – Typical deployment: some VMs running
    legacy OSes, many running current
    stable OS release, few testing next
    OS release
2. Managing Hardware
  – VMs allow separate SW stacks to run
    independently yet share HW, thereby
 Requirements of a Virtual Machine Monitor
• A VM Monitor
   – Presents a SW interface to guest
     software,
   – Isolates state of guests from each
     other, and
   – Protects itself from guest software
     (including guest OSes)
• Guest software should behave on a VM exactly
  as if running on the native HW
   – Except for performance-related
     behavior or limitations of fixed
     resources shared by multiple VMs
• Guest software should not be able to change
  allocation of real system resources directly
    Requirements of a Virtual Machine Monitor
•    VMM must be at higher privilege level than
     guest VM, which generally run in user mode
     Execution of privileged instructions
      handled by VMM
•    E.g., Timer interrupt: VMM suspends
     currently running guest VM, saves its state,
     handles interrupt, determine which guest VM
     to run next, and then load its state
     – Guest VMs that rely on timer
       interrupt provided with virtual timer
       and an emulated timer interrupt by
       VMM
•  Requirements of system virtual machines are
    same as paged-virtual memory:
1. At least 2 processor modes, system and user
   ISA Support for Virtual Machines
• If VMs are planned for during design of ISA,
  easy to reduce instructions that must be
  executed by a VMM and how long it takes to
  emulate them
   – Since VMs have been considered for
     desktop/PC server apps only recently,
     most ISAs were created without
     virtualization in mind, including 80x86
     and most RISC architectures
• VMM must ensure that guest system only
  interacts with virtual resources  conventional
  guest OS runs as user mode program on top of
  VMM
   – If guest OS attempts to access or
     modify information related to HW
     resources via a privileged instruction--
   Impact of VMs on Virtual Memory
• Virtualization of virtual memory if each guest OS
  in every VM manages its own set of page tables?
• VMM separates real and physical memory
   – Makes real memory a separate,
     intermediate level between virtual
     memory and physical memory
   – Some use the terms virtual memory,
     physical memory, and machine memory to
     name the 3 levels
   – Guest OS maps virtual memory to real
     memory via its page tables, and VMM
     page tables map real memory to physical
     memory
   ISA Support for VMs & Virtual
             Memory

• IBM 370 architecture added additional level
  of indirection that is managed by the VMM
  – Guest OS keeps its page tables as
    before, so the shadow pages are
    unnecessary
• To virtualize software TLB, VMM manages
  the real TLB and has a copy of the contents
  of the TLB of each guest VM
  – Any instruction that accesses the
    TLB must trap
  – TLBs with Process ID tags support
    a mix of entries from different
    Impact of I/O on Virtual Memory
•   Most difficult part of virtualization
    – Increasing number of I/O devices
      attached to the computer
    – Increasing diversity of I/O device
      types
    – Sharing of a real device among multiple
      VMs,
    – Supporting the myriad of device drivers
      that are required, especially if
      different guest OSes are supported on
      the same VM system
•   Give each VM generic versions of each type of
    I/O device driver, and let VMM to handle real
    I/O
                 Example: Xen VM
•   Xen: Open-source System VMM for 80x86 ISA
    – Project started at University of Cambridge,
      GNU license model
•   Original vision of VM is running unmodified OS
    – Significant wasted effort just to keep guest
      OS happy
•  “paravirtualization” - small modifications to guest OS to
   simplify virtualization
3 Examples of paravirtualization in Xen:
1. To avoid flushing TLB when invoke VMM, Xen mapped
   into upper 64 MB of address space of each VM
2. Guest OS allowed to allocate pages, just check that
   didn’t violate protection restrictions
3. To protect the guest OS from user programs in VM,
   Xen takes advantage of 4 protection levels available in
   80x86
    – Most OSes for 80x86 keep everything at
        Xen changes for paravirtualization
     • Port of Linux to Xen changed  3000 lines,
       or  1% of 80x86-specific code
          – Does not affect application-binary
            interfaces of guest OS
     OS           Runs as host OS                    Runs as guest OS
     • OSes supported in Xen 2.0
Linux 2.4                        Yes                       Yes
Linux 2.6                        Yes                       Yes
NetBSD 2.0                       No                        Yes
NetBSD 3.0                       Yes                       Yes
Plan 9                           No                        Yes
FreeBSD 5                        No                        Yes
 http://wiki.xensource.com/xenwiki/OSCompatibility
               Xen and I/O

• To simplify I/O, privileged VMs assigned to
  each hardware I/O device: “driver domains”
   – Xen Jargon: “domains” = Virtual
     Machines
• Driver domains run physical device drivers,
  although interrupts still handled by VMM
  before being sent to appropriate driver
  domain
• Regular VMs (“guest domains”) run simple
  virtual device drivers that communicate with
  physical devices drivers in driver domains
  over a channel to access physical I/O
  hardware
• Data sent between guest and driver domains
                                                Xen Performance
                          • Performance relative to native Linux for Xen for
                            6 benchmarks from Xen developers
                                   100%
                          100%                                                                         99%
Performance relative to




                           99%
                           98%                    97%
     native Linux




                           97%                                                              96%
                           96%                                                   95%
                           95%
                           94%
                           93%                                   92%
                           92%
                           91%
                           90%
                                 SPEC INT2000   Linux build    PostgreSQL      PostgreSQL   dbench   SPEC WEB99
                                                   time       Inf. Retrieval      OLTP


                          • Slide 40: User-level processor-bound programs?
                            I/O-intensive workloads? I/O-Bound I/O-
                            Intensive?
                                                  Xen Performance, Part II
                                 • Subsequent study noticed Xen experiments based
                                   on 1 Ethernet network interfaces card (NIC),
                                   and single NIC was a performance bottleneck
                                        Linux   Xen-privileged driver VM ("driver dom ain")   Xen-guest VM + driver VM
                                 2500
Receive Throughput (Mbits/sec)




                                 2000


                                 1500


                                 1000


                                 500


                                   0
                                                  1                   2                   3                  4
                                                              Num ber of Netw ork Interface Cards
                                        Xen Performance, Part III
                                 Linux   Xen-privileged driver VM only   Xen-guest VM + driver VM


                                  4.5
 Xen-priviledged driver domain



                                  4.0
     Event count relative to




                                  3.5
                                  3.0
                                  2.5
                                  2.0
                                  1.5
                                  1.0
                                  0.5
                                   -
                                          Intructions     L2 m isses     I-TLB m isses   D-TLB m isses
1. > 2X instructions for guest VM + driver VM
2. > 4X L2 cache misses
3. 12X – 24X Data TLB misses
          Xen Performance, Part IV
1. > 2X instructions: page remapping and page transfer
   between driver and guest VMs and due to communication
   between the 2 VMs over a channel
2. 4X L2 cache misses: Linux uses zero-copy network
   interface that depends on ability of NIC to do DMA from
   different locations in memory
    – Since Xen does not support “gather DMA” in
      its virtual network interface, it can’t do true
      zero-copy in the guest VM
3. 12X – 24X Data TLB misses: 2 Linux optimizations
    – Superpages for part of Linux kernel space,
      and 4MB pages lowers TLB misses versus using
      1024 4 KB pages. Not in Xen
    – PTEs marked global are not flushed on a
      context switch, and Linux uses them for its
      kernel space. Not in Xen
•   Future Xen may address 2. and 3., but 1. inherent?
         And in Conclusion [1/2] …
• Memory wall inspires optimizations since so much
  performance lost there
   – Reducing hit time: Small and simple
     caches, Way prediction, Trace caches
   – Increasing cache bandwidth: Pipelined
     caches, Multibanked caches,
     Nonblocking caches
   – Reducing Miss Penalty: Critical word
     first, Merging write buffers
   – Reducing Miss Rate: Compiler
     optimizations
   – Reducing miss penalty or miss rate via
     parallelism: Hardware prefetching,
         And in Conclusion [2/2] …
• VM Monitor presents a SW interface to guest
  software, isolates state of guests, and protects
  itself from guest software (including guest OSes)
• Virtual Machine Revival
   – Overcome security flaws of large
     OSes
   – Manage Software, Manage
     Hardware
   – Processor performance no longer
     highest priority
• Virtualization challenges for processor, virtual
  memory, and I/O
          Protection and Instruction Set
                   Architecture
•   Example Problem: 80x86 POPF instruction
    loads flag registers from top of stack in memory
    – One such flag is Interrupt Enable (IE)
    – In system mode, POPF changes IE
    – In user mode, POPF simply changes all flags
      except IE
    – Problem: guest OS runs in user mode inside
      a VM, so it expects to see changed a IE,
      but it won’t
•  Historically, IBM mainframe HW and VMM took 3
   steps:
1. Reduce cost of processor virtualization
    – Intel/AMD proposed ISA changes to reduce
      this cost
2. Reduce interrupt overhead cost due to virtualization
           80x86 VM Challenges
•  18 instructions cause problems for
   virtualization:
1. Read control registers in user model that reveal
   that the guest operating system in running in a
   virtual machine (such as POPF), and
2. Check protection as required by the segmented
   architecture but assume that the operating
   system is running at the highest privilege level
• Virtual memory: 80x86 TLBs do not support
   process ID tags  more expensive for VMM and
   guest OSes to share the TLB
    – each address space change typically
      requires a TLB flush
    Intel/AMD address 80x86 VM Challenges
•   Goal is direct execution of VMs on 80x86
•   Intel's VT-x
    – A new execution mode for running
      VMs
    – An architected definition of the
      VM state
    – Instructions to swap VMs rapidly
    – Large set of parameters to select
      the circumstances where a VMM
      must be invoked
    – VT-x adds 11 new instructions to
      80x86
                   Outline


•   Virtual Machines
•   Xen VM: Design and Performance
•   Administrivia
•   AMD Opteron Memory Hierarchy
•   Opteron Memory Performance vs. Pentium 4
•   Fallacies and Pitfalls
•   Discuss “Virtual Machines” paper
•   Conclusion
     “Virtual Machine Monitors: Current
    Technology and Future Trends” [1/2]
• Mendel Rosenblum and Tal Garfinkel, IEEE
  Computer, May 2005
• How old are VMs? Why did it lie fallow so long?
• Why do authors say this technology got hot
  again?
   – What was the tie in to massively
     parallel processors?
• Why would VM re-invigorate technology transfer
  from OS researchers?
• Why is paravirtualization popular with academics?
  What is drawback to commercial vendors?
• How does VMware handle privileged mode
  instructions that are not virtualizable?
• How does VMware ESX Server take memory
  pages away from Guest OS?
    “Virtual Machine Monitors: Current
   Technology and Future Trends” [2/2]

• How does VMware avoid having lots of
  redundant copies of same code and data?
• How did IBM mainframes virtualize I/O?
• How did VMware Workstation handle the
  many I/O devices of a PC? Why did they do
  it? What were drawbacks?
• What does VMware ESX Server do for
  I/O? Why does it work well?
• What important role do you think VMs will
  play in the future Information Technology?
  Why?
• What is implication of multiple
  processors/chip and VM?
                 And in Conclusion
• Virtual Machine Revival
   – Overcome security flaws of modern OSes
   – Processor performance no longer highest
     priority
   – Manage Software, Manage Hardware
• “… VMMs give OS developers another opportunity to develop
  functionality no longer practical in today’s complex and
  ossified operating systems, where innovation moves at
  geologic pace .”
                                     [Rosenblum and Garfinkel, 2005]
• Virtualization challenges for processor, virtual memory, I/O
   – Paravirtualization, ISA upgrades to cope with those
     difficulties
• Xen as example VMM using paravirtualization
   – 2005 performance on non-I/O bound, I/O
     intensive apps: 80% of native Linux without
     driver VM, 34% with driver VM
• Opteron memory hierarchy still critical to performance
Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:2/27/2012
language:
pages:344
yangxichun yangxichun http://
About