Learning Center
Plans & pricing Sign in
Sign Out

The Microarchitecture of the Intel Pentium 4 processor on 90nm


									The Microarchitecture of the
Intel Pentium 4 processor on
      90nm Technology
   Darrell Boggs, Aravindh Baktha, et al. Desktop Platform Group, Intel Corp.
                           Intel Technology Journal, Volume 8, Issue 1, 2004

                                             Presentation by Min-woo, Song
                     Intel® NetBurstTM
•   Intel NetBurst
     – Performance to scale efficiently to high frequencies
     – The foundation for future IA-32 processors to deliver
       industry leading performance for the next several years
     – The design efforts
          Hyper-pipelining to enable higher processor frequencies
          Keeping the high-frequency execution units busy by increasing the system bandwidth
           and overlapping computations with memory access
          Enhanced OOO execution engine capable of finding more instructions to execute with
           deeper OOO resources
          Reducing the number of instructions needed to complete a task or program

     – Execution Trace cache
     – Out-of-Order Core
     – Rapid Execution Engine(REE)
• Innovative features

   – 400-Mhz system bus
        A 100Mhz clocked & 64-bit system bus and a buffering scheme
        Quad-pumped (3.2GB/s)
• Innovative features (cont’d)
   – Advanced Transfer Cache
          On-die L2 cache with 256-KB 8-way set, and 128-byte cache line
          64-byte interface to the system
          256-bit (32-byte) interface to the core ( 48GB/s = 32 byte * 1.5 GHz)
          7-clock read latency
   – Hyper-pipelined technology

        20-stage pipeline (Pentium 3 & Athlon 11 : 10-stage pipeline)
        Increased processor performance and frequency scalability
   – Rapid Execution Engine
        4 fast execution units (execution of certain instructions in ½ core clock)
          - 2 double-pumped ALUs and AGUs
        1 MMX and 1 SSE units
          - 2 MMX and 2 SSE units in Pentium 3
• Innovative features (cont’d)
   – Advanced Dynamic Execution
       Deep, OOO speculative execution engine
       Instruction window for up to 126 instruction (42 instruction in
        Pentium 3)
       Enhanced branch prediction capability
           Reducing 33% of the mispredictions of Pentium 3.
           4-KB BTB (8 times as large as one in Pentium 3)
   – Execution Trace Cache
       Primary instruction cache
       Store of decoded IA-32 instructions (u-op) in the path of program
       can hold 12K u-ops and deliver up to 3 u-ops per cycle
   – Streaming SIMD Extension 2 (SSE2)
       Adding 144 instructions for 128-bit integer arithmetic and double-
        precision FP
                            Pentium 4
• Additional Pentium 4 components
   – L1 cache
        Only 8-KB data L1 cache
             16-KB instruction L1 cache and 16-KB data L1 cache in Pentium 3
             1/8 of Athlon’s
        8-way set associative, 64-byte line
        LRU
   – Branch prediction
        predicts all near branches
        BTB and branch history table
        The static predictor – forward not-taken, backward taken – before BTB is
        Return Stack – 16 entries
• Block diagram
           Hyper-Threading (HT)
• Characteristics
   – First implemented in Intel® Xeon™ processors
   – System requirements for Pentium 4
          The Intel Pentium 4 processor at 3.06 GHz or higher
          An Intel® chipset that supports HT Technology
          System BIOS supports HT Technology and has it enabled
          An operating system that includes optimizations for HT Technology
              Microsoft windows : Windows XP series (professional/home edition) - optimized
              Linux : Working with Linux Community to get necessary optimizations for HT
               Technology included in distributions
• Concept
   – Hyper-Threading Technology appears to software as multiple (2) logical
     processors in one physical processor package.
   – For each logical processor,
        Having a complete set of architectural registers
        Sharing on single physical processor’s resources
        Responding to interrupts independently
   – Increased utilization of the execution resources within each physical
     processor package
         Hyper-Threading (HT)
• Resource Utilization
                 i850 Chipset
• MCH and ICH2

                  Intel Pentium 4 system
              Out-of-Order Core
• Main responsibility is to extract parallelism
• Heart of OOO core is uop scheduling
• Scheduler
   – Determine, when a uop is ready to execute about input register
   – Schedule, when execution resources are available (data
     dependent order)
• 5 different schedulers
 Rapid Execution Engine(REE)
• REE execute up to six uops per main clock cycle.
• 8-way set associative write-through cache, 64 byte cache
• Cache has unique access algorithms for low latency
   – Almost accesses hit L1 cache and DTLB
                   Front End
• Static prediction
• Dynamic prediction
• Set of instructions
               Execution Core
•   Shifter/rotator block
•   Integer multiply
•   L1 data cache
•   Scheduler
•   Load uop instruction
•   SSE/SSE2/SSE3 instructions
                Memory System
• Focus on
   – Trying to reduce amount of time spent waiting for data to be
     fetched from DRAM
   – Increasing the size of critical resources
   – Increase size of cache, 1MB L2 cache
   – 8-way set associative cache, 128-byte lines
• Software prefetch
   – By programmer before data used.
   – Special fault-handling logic
• Hardware prefetch scheme
          Complex Arithmetic
• Acoustic Echo Canceller
• 5 instructions added
   – addsubps, addsubpd
   – movsldup, movshdup, movddup
• SSE3
       Thread Synchronization
• HT processing instructions
   – Monitor
      • sets up HW to detect stores to an address range, a cache line
          – Or called Monitor event pending flag
      • Its value is not visible except mwait.
   – Mwait
      • puts processor into the special low-power/optimized state

   – Monitor & Mwait instructions must be coded into same loop
      • Cause mwait execution will clear monitor addr-state.
    Desktop performance Expectation
• Clock frequency and Performance
   – An increase in frequency didn’t yield performance increases equal to the
     frequency increases

   – Greater performance scaling in Pentium 4
        Recompilation/linking using the latest NetBurst MA-optimized compilers
         and libraries

To top