pentium by dsouzaankit


More Info
									Features of Pentium

   •   Introduced in 1993 with clock frequency ranging from 60 to 66 MHz

   •   The primary changes in Pentium Processor were:

          – 64-bit data path

          – Instruction Cache

          – Data Cache

          – Parallel Integer Execution units

          – Enhanced Floating point unit

Pentium Architecture

   •   It has data bus of 64 bit and address bus of 32-bit

   •   There are two separate 8kB caches – one for code and one for data.
•    Each cache has a separate address translation TLB which translates linear addresses
    to physical.

•   There are 256 lines between code cache and prefetch buffer, permitting prefetching
    of 32 bytes (256/8) of instructions

•   Four prefetch buffers within the processor works as two independent pairs.

       •    When instructions are prefetched from cache, they are placed into one set of
           prefetch buffers.

       •    The other set is used as when a branch operation is predicted.

•   Pentium is a two issue superscalar processor in which two instructions are fetched &
    decoded simultaneously.

•   Thus the decode unit contains two parallel decoders which decode and issue upto next
    two sequential instructions into execution line.

•   The Control ROM contains the microcode which controls the sequence of operations
    performed by the processor. It has direct control over both pipelines.

•   The control unit handles exceptions, breakpoints and interrupts.

•   It controls the integer pipelines and floating point sequences

•   There are two parallel integer instruction pipelines: u-pipeline and v-pipeline

•   The u-pipeline has a barrel shifter

•   There is also a separate FPU pipeline with individual floating point add, multiply and
    divide operational units.

•   The data cache is dual-ported accessible by both u and v pipeline simultaneously

•   There is a Branch Target Buffer (BTB) supplying jump target prefetch addresses to
    the code cache

•   Address generators are equivalent to segmentation unit

•   Paging Unit is enabled by setting PG bit in CR0. It supports paging mechanism
    handling two linear addresses in same time to support both pipelines. Two TLBs are
    there associated with each cache

    Integer Pipeline

•   Pentium has 5 stage integer pipeline, branching out into two paths u and v in the last
    three stages.

•   The stages are as follows:
   •   P (Prefetch): The CPU prefetches code from code cache

   •   D1 (Instruction Decode):

          – The CPU decodes the instruction to generate control word.

          – A single control word causes direct execution of an instruction.

          – Complex instructions require microcoded control sequencing

   •   D2 (Address Generate): The CPU decodes the control word and generates addresses
       for data reference

   •   EX (Execute):

          – The instruction is executed n ALU

          – If needed, barrel shifter is used

          – If needed, data cache is accessed

   •   WB(Writeback):The CPU stores the result and updates the flags

Superscalar Operation of Pentium

   •   To understand the superscalar operation of u and v pipeline, we have to distinguish
       between simple and complex instructions.

   •   Simple instructions are entirely hardwired, do not require any microcode control and,
       in general, executes in one clock cycle
   •   The exceptions are the ALU mem,reg and ALU reg,mem instructions which are 3
       and 2 clock operations respectively.

   •   Sequencing hardware is used to allow them to function as simple instructions.

   •   The following integer instructions are considered simple and may be paired:

   1. mov reg, reg/mem/imm

   2. mov mem, reg/imm

   3. alu reg, reg/mem/imm

   4. alu mem, reg/imm

   5. inc reg/mem

   6. dec reg/mem

   7. push reg/mem

   8. pop reg

   9. lea reg,mem

   10. jmp/call/jcc near

   11. nop

   12. test reg, reg/mem

   13. test acc, imm

Integer Instruction Pairing Rules

   •   In order to issue two instructions simultaneously they must satisfy the following

          – Both instructions in the pair must be “simple”.

          – There must be no read-after-write or write-after-write register dependencies
            between them

          – Neither instruction may contain both a displacement and an immediate

          – Instructions with prefixes can only occur in the u-pipe (except for JCC
            instructions )
Instruction Issue Algorithm

   •   Decode the two consecutive instructions I1 and I2

   •   If the following are all true

           – I1 and I2 are simple instructions

           – I1 is not a jump instruction

           – Destination of I1 is not a source of I2

           – Destination of I1 is not a destination of I2

           – Then issue I1 to pipeline u and I2 to pipeline v

           – Else issue I1 to pipeline u

Floating Point Unit of Pentium

   •   The floating-point unit (FPU) of the Pentium processor is heavily pipelined.

   •   The FPU is designed to be able to accept one floating-point operation every clock.

   •   It can receive up to two floating-point instructions every clock, one of which must be
       an exchange instruction (FXCH).

   •   The 8 FP pipeline stages are summarized below:

   1. PF Prefetch

   2. D1 Instruction Decode

   3. D2 Address generation

   4. EX Memory and register read: This stage performs register reads or memory reads

   5. X1 Floating-Point Execute stage one: conversion of external memory format to
      internal FP data

   6. X2 Floating-Point Execute stage two

   7. WF Perform rounding and write floating-point result to register file

   8. ER Error Reporting/Update Status Word.

   •   The rules of how floating-point (FP) instructions get issued on the Pentium processor
       are :

   1. FP instructions do not get paired with integer instructions.
   2. When a pair of FP instructions is issued to the FPU, only the FXCH instruction can be
      the second instruction of the pair.

       The first instruction of the pair must be one of a set F where F = [ FLD,FADD, FSUB,

   3. FP instructions other than FXCH and instructions belonging to set F, always get
      issued singly to the FPU.

   4. FP instructions that are not directly followed by an FP exchange instruction are issued
      singly to the FPU.

Branch Prediction Logic

   •   Pentium Processor uses Branch Target Buffer (BTB) to predict the outcome of branch
       instructions which minimizes pipeline stalls due to prefetch delays

   •   When a branch is correctly predicted, no performance penalty is incurred.

   •   But if prediction is not correct, it causes a 3 cycle penalty in U pipeline and 4 cycle
       penalty in V pipeline.

   •   When a call or condition jump is mispredicted, a 3 clock penalty is incured

   •   BTB is a cache with 256 entries.

   •   The directory entry for each line contains the following information:

   •   A Valid Bit that indicates whether or not the entry is in use.

   •   History Bits that tracks how often the branch is taken

   •   Source memory address that the branch instruction was fetched from

   •   BTB sits off to the side of D1 stages of two pipelines and monitors for branch

   •   The first time a branch instruction enters a pipeline, BTB uses its source address to
       perform a lookup in cache and this results in BTB miss.

   •   When instruction reaches the execution stage, the branch will be either taken or not

   •   If taken, the next instruction should be fetched from the branch target address

   •   When a branch is taken for first time, the execution unit provides feedback to branch
       prediction logic and branch target address is recorded in BTB

   •   A directory entry is made containing source memory address and the history bits.

   •   The history bit indicates one of the 4 possible states:
       •   Strongly Taken

       •   Weakly Taken

       •   Weakly Not Taken

       •   Strongly Not Taken

•   Strongly Taken:

       •   The history bits are initialized to this state when the entry is made first

       •   If a branch marked weakly taken is taken again, it is upgraded to strongly

       •   When a branch marked strongly taken is not taken the next time, it is
           downgraded to weakly taken

•   Weakly Taken

       •   If a branch marked weakly taken is taken again, it is upgraded to strongly

       •   When a branch marked weakly taken is not taken the next time, it is
           downgraded to weakly not taken

•   Weakly Not Taken

       •   If a branch marked weakly not taken is taken again, it is upgraded to weakly

       •   When a branch marked weakly not taken is not taken the next time, it is
           downgraded to strongly not taken

•   Strongly Not Taken

       •   If a branch marked strongly not taken is taken again, it is upgraded to weakly
           not taken

       •   When a branch marked strongly not taken is not taken the next time, it remains
           in strongly not taken state
•   During D1 stage of decode, if branch prediction is not taken, no action is taken at this
    point. If prediction taken, the BTB supplies branch target address back to the
    prefetcher and indicates a positive prediction is being made. In response, prefetcher
    switches to opposite prefetch queue and immediately begins to prefetch from branch
    target address

•   During execution stage, the branch will either be taken or not. The results of the
    branch are fed back to the BTB and histroy bits are upgraded or downgraded

Cache Organization of Pentium

•   Pentium employs two separate internal cache memories: one for instruction and other
    for data.

Cache Background

•   Cache is a special type of high speed RAM, that is used to :

       – help speedup access to memory and

       – reduce traffic on processor’s busses.

•   An on-chip cache is used to feed instructions and data to the CPU’s pipeline

•   An external cache is used to to speedup main memory access.

•   Two characteristics of a running program pave the way for performance improvement
    when cache is used:

    Temporal Locality: When we access a memory location, there is a good chance that
    we may access it again.

    Spatial Locality: When we access one location, there is a good chance that we access
    the next location.
•   Consider the following loop of instructions:

                   MOV CX,1000

                   SUB AX,AX

    NX:    ADD AX,[SI]

                   MOV [SI], AX

                   INC SI

                   LOOP NX

•   The loop will get executed 1000 times.

•   If the cache is initially empty, each instruction fetch generates a miss in cache and it is
    read from main memory.

•   The next 999 passes will generate hits for each instruction and the speed is improved.

    When a miss occurs, the cache reads a copy of group of locations and this group is
    called a line of data

•   During data accesses (ADD AX,[SI] &            MOV [SI],AX) , a miss causes a line of
    data to be read resulting in faster data access.

•   In data write, it depends on the policy used by a particular system. There are 2

1. Writeback:

       – Write results only to cache

       – It results in fast writes with out-of-date main memory data

2. Writethrough:

       – Write results to cache and main memory

       – It maintains valid data in main memory but slows down the execution

•   When cache is full, a line must be replaced.

•   One algorithm used to replace the victim line is called LRU (Least Recently Used)

•   One or more bits are added to cache entry to support LRU and these bits are updated
    during hits and examined when a victim must be chosen.
   Cache Organization

   •   It deals with how a cache with numerous entries can search them so quickly and
       report a hit if a match is found.

   •   Cache may be organized in different hardware configurations.

   •   The 3 main designs are: Directly Mapped, Fully Associative &,Set Associative

   •   It uses a portion of incoming physical address to select an entry.

   •   A tag stored in entry is compared with remaining address bits and a match represents
       a hit

Direct Mapped Cache

Fully Associative Cache
Set Associative Cache

   •   It combines both the concepts.

   •   The entries are divided into sets containing 2, 4 8 or more entries

   •   Two entries per set is called two way set associative.

   •   Each entry has its own tag.

   •   A set is selected using its index

Cache Organization of Pentium

   •   The data and instruction cache are organized as two way set associative caches with
       128 sets.

   •   This gives 256 entries per cache

   •   There are 32 bytes in a line resulting in 8KB of storage per cache.

   •   An LRU algorithm is used to select victims when cache is full
                                Internal Structure of Cache

   •   The tags in data cache are triple ported (can be accessed from 3 different places at
       the same time )

   •   Two of these are for u and v pipelines.

   •   The third port is for a special operation called bus snooping(It is used to maintain
       consistent data in a multiprocessor system)

   •   Each entry in the data cache can be configured for writethrough or writeback.

   •   The instruction cache is write protected to prevent self-modifying code from changing
       the executing program.

   •   Tags in instruction cache are also triple ported, with two ports for split line
       access(upper half and lower half of each line are read simultaneously) and third port
       for bus snooping

   •   Parity bits are used in each cache to maintain data integrity

   •   Each tag has its own parity bit and each byte has parity bit.

Translation Lookaside Buffer

   •   TLB translates logical address to physical address(same address of main memory)

   •   TLBs are caches themselves.

   •   The data cache has 2 TLBs. First one is 4 way set associative with 64 entries.

   •   The lower 12 bits are same.
   •   The upper 20 bits of virtual address are checked against 4 tags and translated to upper
       20 physical address during hit.

   •   The second TLB is 4 way set associative with 8 entries and handles 4MB pages

   •   Both TLBs are parity protected and dual ported

   •   The instruction cache uses a single 4way set associative TLB with 32 entries. Both
       4KB and 4MB pages are supported. Parity bits are used to maintain data integrity.

   •   Entries in all TLBs use a 3-bit LRU counter

Cache Coherency in Multiprocessor System

   •   When multiple processors are used in a single system, there needs to be a mechanism
       whereby all processors agree on the contents of shared cache information.

   •   For e.g., two or more processors may utilize data from the same memory location,X.

   •   Each processor may change value of X, thus which value of X has to be considered?
                 A multiprocessor system with incoherent cache data

•   The Intel’s mechanism for maintaining cache coherency in its data cache is called
    MESI (Modified/Exclusive/Shared/Invalid)Protocol.

•   This protocol uses two bits stored with each line of data to keep track of the state of
    cache line.

•   The four states are defined as follows:

•   Modified:

       •   The current line has been modified and is only available in a single cache.

•   Exclusive:

       •   The current line has not been modified and is only available in a single cache

       •   Writing to this line changes its state to modified

•   Shared:

       •   Copies of the current line may exist in more than one cache.

       •   A write to this line causes a writethrough to main memory and may invalidate
           the copies in the other cache

•   Invalid:

       •   The current line is empty
          •   A read from this line will generate a miss

          •   A write will cause a writethrough to main memory

   •   Only the shared and invalid states are used in code cache.

   •   MESI protocol requires Pentium to monitor all accesses to main memory in a
       multiprocessor system. This is called bus snooping.

   •   Consider the above example.

   •   If the Processor 3 writes its local copy of X(30) back to memory, the memory write
       cycle will be detected by the other 3 processors.

   •   Each processor will then run an internal inquire cycle to determine whether its data
       cache contains address of X.

   •   Processor 1 and 2 then updates their cache based on individual MESI states.

   •   Inquire cycles examine the code cache as well (as code cache supports bus snooping)

   •   The Pentium’s address lines are used as inputs during an inquire cycle to accomplish
       bus snooping.

Cache Instructions

   •   Three instructions are provided to allow the programmer some control over cache

   •   These instructions are:

          •   INVD(Invalidate Cache)

          •   INVLPG(Invalidate TLB entry)

          •   WBINVD(Write back and invalidate cache)

   •   INVD effectively erases all information in the data cache. Any values not previously
       written back will be lost when INVD executes.

   •   This problem can be avoided by using WBINVD which first writes back any updated
       cache entries and then invalidates them

   •   INVLPG invalidates the TLB entry associated with a supplied memory operand

All these cache operations are performed automatically by the Pentium.

No programming code is needed to get the cache work.

Bus Operations

   •   Some of the operations over its address and data busses are :
          – Data transfers(single or burst cycles)

          – Interrupt Acknowledge cycles

          – Inquire cycles

          – I/O operations

Decoding a Bus Cycle

   •   There are 6 possible states the Pentium bus may be in and they are TI, T1, T2, T12,
       T2P and TD

   •   TI is the idle state and indicates that no bus cycle is currently running. The bus begins
       in this state after reset.

   •   During the first state T1, a valid address is output on the address lines.

   •   During the second state T2, data is read or written.

   •   T12 state indicates that the processor is starting the second bus cycle at the same time
       that data is being transferred for the first.

   •   State T2P continues the bus cycle started in T12.

   •   TD is used to insert a dead state between two consecutive cycles to give time for the
       system bus to change states
•   The bus state controller follows a predefined set of transitions in the form of state
    diagram shown:

•   The transitions between states are defined as follows:

(0) No Request Pending

(1) New bus cycle started & ADS# is asserted

(2) Second clock cycle of the current bus cycle

(3) Stays in T2 until BRDY# is active or new bus cycle is requested

(4) Go back to T1 if a new request is pending

(5)Bus Cycle complete: go back to idle state

(6) Begin second bus cycle

(7) Current cycle is finished and no dead clock is needed

(8) A dead clock is needed after the current cycle is finished

(9) Go to T2P to transfer data
(10) Wait in T2P until data is transferred

(11) Current cycle is finished and no dead clock is needed

(12) A dead clock is needed after the current cycle is finished

(13) Begin a pipelined bus cycle if NA is active

(14) No new bus cycle is pending


To top